Results 1 to 9 of 9
  1. #1
    Join Date
    Oct 2010
    Posts
    94

    Unanswered: Distribution Key Change

    Hello!

    It's me again .... with my questions on distribution !

    We have a massive table (as I see it) that is distributed using column Y. It does participate in some joins using that column. However, most of the joins it is involved in use column X - causing inter-partition scans (yieeks!); also, column X has a one-to-one cardinality with Y.

    My question - would it be a win-win for both kinds of joins to have the distribution key changed to (X,Y)?

    What would be the impact on the loads to the tables? like noticeable degradation? X and Y are both regular integers (max 2^ 32).


    Your help is very much appreciated!

    G

  2. #2
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Provided Answers: 1
    Quote Originally Posted by getback0 View Post

    My question - would it be a win-win for both kinds of joins to have the distribution key changed to (X,Y)?
    Obviously not, unless you make it so for _all_ tables. Distribution is based not on the key value but on the value of a hash function computed from that key. Hash(x), hash(y), and hash(x,y) will all have different values and will distribute rows differently as a result.

  3. #3
    Join Date
    Oct 2010
    Posts
    94
    Thanks n_i!

    How about if I leverage the fact that X and Y have a one-to-one cardinality and therefor change the joins to use X and Y all the time (if it's possible)?

    Would that then benefit from the combined distribution key?

    Also, how bad do you think the ETL would get impacted because of a larger distribution key?

    G

  4. #4
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Provided Answers: 1
    Quote Originally Posted by getback0
    change the joins to use X and Y all the time
    How you write your queries is dictated by business logic, not your desire to collocate joins. Tables will be collocated only if they have absolutely identical distribution keys.

    Quote Originally Posted by getback0
    how bad do you think the ETL would get impacted
    About 3.5

  5. #5
    Join Date
    Oct 2010
    Posts
    94
    Thanks again! Quick questions -

    How you write your queries is dictated by business logic, not your desire to collocate joins. Tables will be collocated only if they have absolutely identical distribution keys.
    I am not sure I understand - if X and Y have a 1-to-1 cardinality, even if business logic mentions using one instead of the other (perhaps to avoid redundancy) - why would it hurt to include both X and Y?

    About 3.5
    You mean 3.5 times slower???? so the hash function on 2 integer columns is 3.5 times slower than that on one????

  6. #6
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Provided Answers: 1
    Quote Originally Posted by getback0 View Post
    why would it hurt to include both X and Y?
    And what would be the purpose of it?

    Quote Originally Posted by getback0 View Post

    You mean 3.5 times slower???? so the hash function on 2 integer columns is 3.5 times slower than that on one????
    You didn't ask about the hash function; you asked about your ETL process. Calculating the hash function value from two columns will, of course, be marginally slower than from just one column, but only you can tell if it will affect your ETL process and if yes, how much.

  7. #7
    Join Date
    Oct 2010
    Posts
    94
    Quote:
    Originally Posted by getback0
    why would it hurt to include both X and Y?

    And what would be the purpose of it?
    Let me try again -

    Let's assume the following -
    - there are 50 queries that need to pull attributes from table T based on column X
    - there are ~ 50 queries that access attributes from table T based on column Y
    - above 2 categories completely describe all kinds of queries run against the table
    - also all queries have both X and Y available (query type #1 above currently scans table T on X alone - it does know the corresponding Y and therefore can include Y in its predicate; similarly query type #2 knows the value of X for the argument Y)

    In light of the above, if we were to modify all queries to include both X and Y every time we query the table and modify the distribution key on T to be (X,Y) instead of X alone that it is currently; would we not benefit from distribution? The original intent of both the query types above will stay the same - but allowing for the redundancy (including Y for argument X and vice versa) would allow both categories to leverage distribution; wouldn't it?

    Also, what was the number 3.5??? could you elaborate?

    Thanks for taking the time out n_i!

    G

  8. #8
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Provided Answers: 1
    Quote Originally Posted by getback0 View Post
    - there are 50 queries that need to pull attributes from table T based on column X
    - there are ~ 50 queries that access attributes from table T based on column Y
    - above 2 categories completely describe all kinds of queries run against the table
    - also all queries have both X and Y available (query type #1 above currently scans table T on X alone - it does know the corresponding Y and therefore can include Y in its predicate; similarly query type #2 knows the value of X for the argument Y)

    In light of the above, if we were to modify all queries to include both X and Y every time we query the table and modify the distribution key on T to be (X,Y) instead of X alone that it is currently; would we not benefit from distribution?
    It likely would.

    Quote Originally Posted by getback0 View Post
    Also, what was the number 3.5???
    I just made that up; I don't really know how your ETL will be affected.

  9. #9
    Join Date
    Oct 2010
    Posts
    94
    Very helpful!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •