If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > DB2 > Distribution Key Change

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 11-04-10, 00:44
getback0 getback0 is offline
Registered User
 
Join Date: Oct 2010
Posts: 73
Distribution Key Change

Hello!

It's me again .... with my questions on distribution !

We have a massive table (as I see it) that is distributed using column Y. It does participate in some joins using that column. However, most of the joins it is involved in use column X - causing inter-partition scans (yieeks!); also, column X has a one-to-one cardinality with Y.

My question - would it be a win-win for both kinds of joins to have the distribution key changed to (X,Y)?

What would be the impact on the loads to the tables? like noticeable degradation? X and Y are both regular integers (max 2^ 32).


Your help is very much appreciated!

G
Reply With Quote
  #2 (permalink)  
Old 11-04-10, 08:17
n_i n_i is offline
:-)
 
Join Date: Jun 2003
Location: Toronto, Canada
Posts: 4,449
Quote:
Originally Posted by getback0 View Post

My question - would it be a win-win for both kinds of joins to have the distribution key changed to (X,Y)?
Obviously not, unless you make it so for _all_ tables. Distribution is based not on the key value but on the value of a hash function computed from that key. Hash(x), hash(y), and hash(x,y) will all have different values and will distribute rows differently as a result.
Reply With Quote
  #3 (permalink)  
Old 11-04-10, 11:24
getback0 getback0 is offline
Registered User
 
Join Date: Oct 2010
Posts: 73
Thanks n_i!

How about if I leverage the fact that X and Y have a one-to-one cardinality and therefor change the joins to use X and Y all the time (if it's possible)?

Would that then benefit from the combined distribution key?

Also, how bad do you think the ETL would get impacted because of a larger distribution key?

G
Reply With Quote
  #4 (permalink)  
Old 11-04-10, 11:59
n_i n_i is offline
:-)
 
Join Date: Jun 2003
Location: Toronto, Canada
Posts: 4,449
Quote:
Originally Posted by getback0
change the joins to use X and Y all the time
How you write your queries is dictated by business logic, not your desire to collocate joins. Tables will be collocated only if they have absolutely identical distribution keys.

Quote:
Originally Posted by getback0
how bad do you think the ETL would get impacted
About 3.5
Reply With Quote
  #5 (permalink)  
Old 11-04-10, 13:56
getback0 getback0 is offline
Registered User
 
Join Date: Oct 2010
Posts: 73
Thanks again! Quick questions -

Quote:
How you write your queries is dictated by business logic, not your desire to collocate joins. Tables will be collocated only if they have absolutely identical distribution keys.
I am not sure I understand - if X and Y have a 1-to-1 cardinality, even if business logic mentions using one instead of the other (perhaps to avoid redundancy) - why would it hurt to include both X and Y?

Quote:
About 3.5
You mean 3.5 times slower???? so the hash function on 2 integer columns is 3.5 times slower than that on one????
Reply With Quote
  #6 (permalink)  
Old 11-04-10, 15:34
n_i n_i is offline
:-)
 
Join Date: Jun 2003
Location: Toronto, Canada
Posts: 4,449
Quote:
Originally Posted by getback0 View Post
why would it hurt to include both X and Y?
And what would be the purpose of it?

Quote:
Originally Posted by getback0 View Post

You mean 3.5 times slower???? so the hash function on 2 integer columns is 3.5 times slower than that on one????
You didn't ask about the hash function; you asked about your ETL process. Calculating the hash function value from two columns will, of course, be marginally slower than from just one column, but only you can tell if it will affect your ETL process and if yes, how much.
Reply With Quote
  #7 (permalink)  
Old 11-04-10, 15:52
getback0 getback0 is offline
Registered User
 
Join Date: Oct 2010
Posts: 73
Quote:
Quote:
Originally Posted by getback0
why would it hurt to include both X and Y?

And what would be the purpose of it?
Let me try again -

Let's assume the following -
- there are 50 queries that need to pull attributes from table T based on column X
- there are ~ 50 queries that access attributes from table T based on column Y
- above 2 categories completely describe all kinds of queries run against the table
- also all queries have both X and Y available (query type #1 above currently scans table T on X alone - it does know the corresponding Y and therefore can include Y in its predicate; similarly query type #2 knows the value of X for the argument Y)

In light of the above, if we were to modify all queries to include both X and Y every time we query the table and modify the distribution key on T to be (X,Y) instead of X alone that it is currently; would we not benefit from distribution? The original intent of both the query types above will stay the same - but allowing for the redundancy (including Y for argument X and vice versa) would allow both categories to leverage distribution; wouldn't it?

Also, what was the number 3.5??? could you elaborate?

Thanks for taking the time out n_i!

G
Reply With Quote
  #8 (permalink)  
Old 11-04-10, 16:22
n_i n_i is offline
:-)
 
Join Date: Jun 2003
Location: Toronto, Canada
Posts: 4,449
Quote:
Originally Posted by getback0 View Post
- there are 50 queries that need to pull attributes from table T based on column X
- there are ~ 50 queries that access attributes from table T based on column Y
- above 2 categories completely describe all kinds of queries run against the table
- also all queries have both X and Y available (query type #1 above currently scans table T on X alone - it does know the corresponding Y and therefore can include Y in its predicate; similarly query type #2 knows the value of X for the argument Y)

In light of the above, if we were to modify all queries to include both X and Y every time we query the table and modify the distribution key on T to be (X,Y) instead of X alone that it is currently; would we not benefit from distribution?
It likely would.

Quote:
Originally Posted by getback0 View Post
Also, what was the number 3.5???
I just made that up; I don't really know how your ETL will be affected.
Reply With Quote
  #9 (permalink)  
Old 11-04-10, 17:04
getback0 getback0 is offline
Registered User
 
Join Date: Oct 2010
Posts: 73
Very helpful!
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On