Page 1 of 2 12 LastLast
Results 1 to 15 of 21
  1. #1
    Join Date
    Jan 2011
    Posts
    24

    Unanswered: order of index columns

    Hello,

    I'd like to get some help. Designing indexes in Db2 manual has this sentence:

    Note: Whenever possible, order the columns in an index key from the most distinct to the least distinct. This provides the best performance.


    I don't quite understand the meaning of most distinct and least distinct. Must be my English. Can you explain please?


    Example, I want to create an index on columns:

    count(distinct col1) -> returns 250 rows
    count(distinct col2) -> returns 100 rows
    count(distinct col3) -> returns 10 rows
    count(distinct col4) -> returns 1 row


    If I want to place them in the order of most distinct to least distinct, it will be: (col1, col2, col3, col4). Correct?

    Why placing them in the order of most distinct to least distinct provides the best performance? Can you explain please.


    Thank you.

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Bear with me for a moment in the explanation. I'm pretty sure that it will make sense when I get done...

    Think of the rows in the table as marbles, and the columns in the index as though they divided those marbles into "bags".

    When you want to find a specific marble and you work from the "most selective" column (column 1) you pick one of the 250 bags that will be relatively small in terms of the marbles that it contains. It probably won't even hold as many bags as the next level (column) might have because so many marbles (and their bags) had already been excluded from your search.

    If you approach the problem from the other end and use the least selective column (column 4), you get one bag... and waste the effort of opening it to find ten more bags inside of it! You then open the appropriate bag from those ten, and find another bunch of bags!

    When all things are equal (and they aren't always), you should put the most selective columns first in the index column sequence.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  3. #3
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    In your example, none of the columns have a particularly high cardinality (a lot of distinct values), unless that particular table is fairly small, in which case indexes are usually irrelevant.

    However, if the columns were more unique and it was a larger table, then if you have a WHERE clause that only had column 1 in it, then it would perform better if it were the first column of the index. If all four columns are always supplied in the WHERE clause when retrieving, updating, or deleting a row from that table, then then the order of the columns in the index doesn't make enough difference to worry about.
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

  4. #4
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    Quote Originally Posted by Pat Phelan View Post
    Bear with me for a moment in the explanation. I'm pretty sure that it will make sense when I get done...

    Think of the rows in the table as marbles, and the columns in the index as though they divided those marbles into "bags".

    When you want to find a specific marble and you work from the "most selective" column (column 1) you pick one of the 250 bags that will be relatively small in terms of the marbles that it contains. It probably won't even hold as many bags as the next level (column) might have because so many marbles (and their bags) had already been excluded from your search.

    If you approach the problem from the other end and use the least selective column (column 4), you get one bag... and waste the effort of opening it to find ten more bags inside of it! You then open the appropriate bag from those ten, and find another bunch of bags!

    When all things are equal (and they aren't always), you should put the most selective columns first in the index column sequence.

    -PatP
    What happens when someone loses their marbles?
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

  5. #5
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Quote Originally Posted by Marcus_A View Post
    What happens when someone loses their marbles?
    That never bothered me one bit!

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  6. #6
    Join Date
    Jan 2011
    Posts
    24
    Thanks guys. Appreciate your explanation.

    Marcus, please elaborate on:

    Quote Originally Posted by Marcus_A View Post
    If all four columns are always supplied in the WHERE clause when retrieving, updating, or deleting a row from that table, then then the order of the columns in the index doesn't make enough difference to worry about.

    What you said applies to our situation: all four columns are always supplied in the WHERE clause. The sql is coded in the view and it looks likes:

    create view ....
    from taba, tabb... more tables...
    where taba.c1 = tabb.c1 and
    taba.c2 = tabb.c2
    taba.c3 = tabb.c3
    taba.c4 = tabb.c4
    ....

    We execute select * from view.


    Additional Q: we have another query that looks like: select count(*) from view where c5 = ? and c2 = ?

    Option 1: ind1 on (c1,c2,c3,c4) and ind2 on (c5,c2)
    Option 2: ind3 on (c1,c5,c2,c3,c4)


    What do you recommend: Option 1 or Option 2? Can db2 do use ind3 to process:

    select * from view (where only c1,c2,c3,c4 are supplied - coded in the view) and
    select count(*) from view where c5 = ? and c2 = ?

  7. #7
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    Quote Originally Posted by Mikky View Post
    Thanks guys. Appreciate your explanation.

    Marcus, please elaborate on:




    What you said applies to our situation: all four columns are always supplied in the WHERE clause. The sql is coded in the view and it looks likes:

    create view ....
    from taba, tabb... more tables...
    where taba.c1 = tabb.c1 and
    taba.c2 = tabb.c2
    taba.c3 = tabb.c3
    taba.c4 = tabb.c4
    ....

    We execute select * from view.


    Additional Q: we have another query that looks like: select count(*) from view where c5 = ? and c2 = ?

    Option 1: ind1 on (c1,c2,c3,c4) and ind2 on (c5,c2)
    Option 2: ind3 on (c1,c5,c2,c3,c4)


    What do you recommend: Option 1 or Option 2? Can db2 do use ind3 to process:

    select * from view (where only c1,c2,c3,c4 are supplied - coded in the view) and
    select count(*) from view where c5 = ? and c2 = ?
    Maybe you didn't understand what I said. I meant if all 4 columns are always supplied for EVERY SINGLE SQL statement where you are updating, deleting, or selecting from the table (via a view or with base table). That is apparently not the case since you showed a different query on the same table that has different predicates.

    It is hard to say what the optimum solution is because you haven't said what is the frequency of each query and what is the importance of each query, and what is the cost of extra indexes:
    • cost of space for extra indexes
    • performance cost of inserting new index entries
    • performance cost of deleting index entries when a row is deleted
    • etc

    But I can tell you that option 2 is probably a poor choice. Option 1 will definitely work with the SQL you showed us, but we can't tell if there are any other SQL statements besides these two. If these are the only two, then option 1 is fine.
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

  8. #8
    Join Date
    Jan 2011
    Posts
    24
    Yes, I understood what you said. If all 4 columns are always supplied for EVERY SINGLE SQL statement, then the order of the columns in the index doesn't make much difference. What DB2 manual says about ordering the column in the index from the most distinct to least distinct won't provide much benefit if all 4 columns are always provided for every sql.


    One additional Q.

    If I have an index on (col1,col2,col3) and query: select ... from table where col2 = x.

    Can DB2 use this index?


    Thank you.

  9. #9
    Join Date
    Dec 2007
    Location
    Richmond, VA
    Posts
    1,328
    Provided Answers: 5
    First off, the order of those 4 always supplied values does matter, refer to Pat's first explanation. Second, the query in your view only supplies those four columns for one of your 2 tables, so one of them will be a tablespace scan, unless in the use of the view there are actual values supplied for those columns.
    Lastly, your newest question. It is possible for DB2 to choose a non matching index scan of the index you describe. Which means that DB2 will scan every row in the index to see if col2 = x. If you want matching index access, meaning go directly to the rows where col2 = x, then col2 has to be your leading index.

    Dave Nance

  10. #10
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Quote Originally Posted by Mikky View Post
    What DB2 manual says about ordering the column in the index from the most distinct to least distinct won't provide much benefit if all 4 columns are always provided for every sql.
    Actually that's not true. See my first post, check the results of EXPLAIN for both index structures, and measure the elapsed time with some (several thousand rows) complex data. The order of the index columns does matter even if all columns are included in the search.

    Quote Originally Posted by Mikky View Post
    If I have an index on (col1,col2,col3) and query: select ... from table where col2 = x.
    Based on the query you provided, no it can't. There can be more complex cases where this index will make a difference in some versions of DB2, but not in the example that you posted.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  11. #11
    Join Date
    Jul 2013
    Location
    Moscow, Russia
    Posts
    666
    Provided Answers: 55
    Quote Originally Posted by Mikky View Post
    If I have an index on (col1,col2,col3) and query: select ... from table where col2 = x.

    Can DB2 use this index?
    Hi,

    it might be possible with db2 10.1 and newer.
    This technique is called jump scan.
    You can read this blog post on this as well.
    Regards,
    Mark.

  12. #12
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    Quote Originally Posted by dav1mo View Post
    First off, the order of those 4 always supplied values does matter, refer to Pat's first explanation.
    It won't matter enough to even measure the difference. Try it.
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

  13. #13
    Join Date
    Dec 2007
    Location
    Richmond, VA
    Posts
    1,328
    Provided Answers: 5
    The difference is very measurable in some cases. Try a 100+ mil row table where the column distribution is something like:
    col1 98000000
    col2 254
    col3 12
    col4 4

    The difference will shock you, since you dont think there is any.

  14. #14
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Quote Originally Posted by Marcus_A View Post
    It won't matter enough to even measure the difference. Try it.
    Actually I already did, and the difference was small (262 ms using spufi) on a largish data set (about 10M rows) using the same breakdown that was originally posted (250, 100, 10, 1). Each query returned 40 identical rows, and the query with the most selective column first was the fastest.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  15. #15
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    Quote Originally Posted by dav1mo View Post
    The difference is very measurable in some cases. Try a 100+ mil row table where the column distribution is something like:
    col1 98000000
    col2 254
    col3 12
    col4 4

    The difference will shock you, since you dont think there is any.
    Can you post the DDL and SQL statement you used?
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •