# Thread: order of index columns

1. Registered User
Join Date
Jan 2011
Posts
24

## Unanswered: order of index columns

Hello,

I'd like to get some help. Designing indexes in Db2 manual has this sentence:

Note: Whenever possible, order the columns in an index key from the most distinct to the least distinct. This provides the best performance.

I don't quite understand the meaning of most distinct and least distinct. Must be my English. Can you explain please?

Example, I want to create an index on columns:

count(distinct col1) -> returns 250 rows
count(distinct col2) -> returns 100 rows
count(distinct col3) -> returns 10 rows
count(distinct col4) -> returns 1 row

If I want to place them in the order of most distinct to least distinct, it will be: (col1, col2, col3, col4). Correct?

Why placing them in the order of most distinct to least distinct provides the best performance? Can you explain please.

Thank you.

2. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
Bear with me for a moment in the explanation. I'm pretty sure that it will make sense when I get done...

Think of the rows in the table as marbles, and the columns in the index as though they divided those marbles into "bags".

When you want to find a specific marble and you work from the "most selective" column (column 1) you pick one of the 250 bags that will be relatively small in terms of the marbles that it contains. It probably won't even hold as many bags as the next level (column) might have because so many marbles (and their bags) had already been excluded from your search.

If you approach the problem from the other end and use the least selective column (column 4), you get one bag... and waste the effort of opening it to find ten more bags inside of it! You then open the appropriate bag from those ten, and find another bunch of bags!

When all things are equal (and they aren't always), you should put the most selective columns first in the index column sequence.

-PatP

3. Registered User
Join Date
May 2003
Location
USA
Posts
5,737
In your example, none of the columns have a particularly high cardinality (a lot of distinct values), unless that particular table is fairly small, in which case indexes are usually irrelevant.

However, if the columns were more unique and it was a larger table, then if you have a WHERE clause that only had column 1 in it, then it would perform better if it were the first column of the index. If all four columns are always supplied in the WHERE clause when retrieving, updating, or deleting a row from that table, then then the order of the columns in the index doesn't make enough difference to worry about.

4. Registered User
Join Date
May 2003
Location
USA
Posts
5,737
Originally Posted by Pat Phelan
Bear with me for a moment in the explanation. I'm pretty sure that it will make sense when I get done...

Think of the rows in the table as marbles, and the columns in the index as though they divided those marbles into "bags".

When you want to find a specific marble and you work from the "most selective" column (column 1) you pick one of the 250 bags that will be relatively small in terms of the marbles that it contains. It probably won't even hold as many bags as the next level (column) might have because so many marbles (and their bags) had already been excluded from your search.

If you approach the problem from the other end and use the least selective column (column 4), you get one bag... and waste the effort of opening it to find ten more bags inside of it! You then open the appropriate bag from those ten, and find another bunch of bags!

When all things are equal (and they aren't always), you should put the most selective columns first in the index column sequence.

-PatP
What happens when someone loses their marbles?

5. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
Originally Posted by Marcus_A
What happens when someone loses their marbles?
That never bothered me one bit!

-PatP

6. Registered User
Join Date
Jan 2011
Posts
24

Originally Posted by Marcus_A
If all four columns are always supplied in the WHERE clause when retrieving, updating, or deleting a row from that table, then then the order of the columns in the index doesn't make enough difference to worry about.

What you said applies to our situation: all four columns are always supplied in the WHERE clause. The sql is coded in the view and it looks likes:

create view ....
from taba, tabb... more tables...
where taba.c1 = tabb.c1 and
taba.c2 = tabb.c2
taba.c3 = tabb.c3
taba.c4 = tabb.c4
....

We execute select * from view.

Additional Q: we have another query that looks like: select count(*) from view where c5 = ? and c2 = ?

Option 1: ind1 on (c1,c2,c3,c4) and ind2 on (c5,c2)
Option 2: ind3 on (c1,c5,c2,c3,c4)

What do you recommend: Option 1 or Option 2? Can db2 do use ind3 to process:

select * from view (where only c1,c2,c3,c4 are supplied - coded in the view) and
select count(*) from view where c5 = ? and c2 = ?

7. Registered User
Join Date
May 2003
Location
USA
Posts
5,737
Originally Posted by Mikky

What you said applies to our situation: all four columns are always supplied in the WHERE clause. The sql is coded in the view and it looks likes:

create view ....
from taba, tabb... more tables...
where taba.c1 = tabb.c1 and
taba.c2 = tabb.c2
taba.c3 = tabb.c3
taba.c4 = tabb.c4
....

We execute select * from view.

Additional Q: we have another query that looks like: select count(*) from view where c5 = ? and c2 = ?

Option 1: ind1 on (c1,c2,c3,c4) and ind2 on (c5,c2)
Option 2: ind3 on (c1,c5,c2,c3,c4)

What do you recommend: Option 1 or Option 2? Can db2 do use ind3 to process:

select * from view (where only c1,c2,c3,c4 are supplied - coded in the view) and
select count(*) from view where c5 = ? and c2 = ?
Maybe you didn't understand what I said. I meant if all 4 columns are always supplied for EVERY SINGLE SQL statement where you are updating, deleting, or selecting from the table (via a view or with base table). That is apparently not the case since you showed a different query on the same table that has different predicates.

It is hard to say what the optimum solution is because you haven't said what is the frequency of each query and what is the importance of each query, and what is the cost of extra indexes:
• cost of space for extra indexes
• performance cost of inserting new index entries
• performance cost of deleting index entries when a row is deleted
• etc

But I can tell you that option 2 is probably a poor choice. Option 1 will definitely work with the SQL you showed us, but we can't tell if there are any other SQL statements besides these two. If these are the only two, then option 1 is fine.

8. Registered User
Join Date
Jan 2011
Posts
24
Yes, I understood what you said. If all 4 columns are always supplied for EVERY SINGLE SQL statement, then the order of the columns in the index doesn't make much difference. What DB2 manual says about ordering the column in the index from the most distinct to least distinct won't provide much benefit if all 4 columns are always provided for every sql.

If I have an index on (col1,col2,col3) and query: select ... from table where col2 = x.

Can DB2 use this index?

Thank you.

9. Registered User
Join Date
Dec 2007
Location
Richmond, VA
Posts
1,328
First off, the order of those 4 always supplied values does matter, refer to Pat's first explanation. Second, the query in your view only supplies those four columns for one of your 2 tables, so one of them will be a tablespace scan, unless in the use of the view there are actual values supplied for those columns.
Lastly, your newest question. It is possible for DB2 to choose a non matching index scan of the index you describe. Which means that DB2 will scan every row in the index to see if col2 = x. If you want matching index access, meaning go directly to the rows where col2 = x, then col2 has to be your leading index.

Dave Nance

10. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
Originally Posted by Mikky
What DB2 manual says about ordering the column in the index from the most distinct to least distinct won't provide much benefit if all 4 columns are always provided for every sql.
Actually that's not true. See my first post, check the results of EXPLAIN for both index structures, and measure the elapsed time with some (several thousand rows) complex data. The order of the index columns does matter even if all columns are included in the search.

Originally Posted by Mikky
If I have an index on (col1,col2,col3) and query: select ... from table where col2 = x.
Based on the query you provided, no it can't. There can be more complex cases where this index will make a difference in some versions of DB2, but not in the example that you posted.

-PatP

11. Registered User
Join Date
Jul 2013
Location
Moscow, Russia
Posts
666
Originally Posted by Mikky
If I have an index on (col1,col2,col3) and query: select ... from table where col2 = x.

Can DB2 use this index?
Hi,

it might be possible with db2 10.1 and newer.
This technique is called jump scan.
You can read this blog post on this as well.

12. Registered User
Join Date
May 2003
Location
USA
Posts
5,737
Originally Posted by dav1mo
First off, the order of those 4 always supplied values does matter, refer to Pat's first explanation.
It won't matter enough to even measure the difference. Try it.

13. Registered User
Join Date
Dec 2007
Location
Richmond, VA
Posts
1,328
The difference is very measurable in some cases. Try a 100+ mil row table where the column distribution is something like:
col1 98000000
col2 254
col3 12
col4 4

The difference will shock you, since you dont think there is any.

14. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
Originally Posted by Marcus_A
It won't matter enough to even measure the difference. Try it.
Actually I already did, and the difference was small (262 ms using spufi) on a largish data set (about 10M rows) using the same breakdown that was originally posted (250, 100, 10, 1). Each query returned 40 identical rows, and the query with the most selective column first was the fastest.

-PatP

15. Registered User
Join Date
May 2003
Location
USA
Posts
5,737
Originally Posted by dav1mo
The difference is very measurable in some cases. Try a 100+ mil row table where the column distribution is something like:
col1 98000000
col2 254
col3 12
col4 4

The difference will shock you, since you dont think there is any.
Can you post the DDL and SQL statement you used?

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•