Results 1 to 11 of 11
  1. #1
    Join Date
    Apr 2009
    Posts
    42

    Unanswered: create index on which column?

    Hi, have a question about how to create index on which column?
    for example, below sql:

    Code:
    select
        c1
        ,d1
        ,c4
        ,d5
    from
        tableC
        inner join tableD
        on( c1 = d1 
              and c2 = d2 
              and c3 = d3 )
    where
        c4 = ?
        and d4 = ?
    want to build index(s) on tableC or tableD to improve performance.

    Suppose:
    1) tableC lots of records, tabldD few records

    I think it is definitely to create a index on tableC.
    But on which column?
    all columns in the join predicate and in where clause?
    like:
    a)create index idx.c on tableC( c1, c2, c3, c4 )
    b)create index idx.c on tableC( c2, c3, c1, c4 )

    would a) and b) be the same?

    As I know, decide which column to be the first column in the index creation command is very important. But the 2nd place, 3rd place... dont matter.
    That means, a) and b) is different.

    IS MY UNDERSTANDING CORRECT?

    So, how to decide which column to be the 1st place in the index creation command?

    In my understanding, if a column contains the most different values, then this column shoud be the 1st in index.
    for example, if c1 is containing the customer number, c2 only containning the values of male or female.

    I think the 1st column shoule be c1, right?

    Please advise if I am correct on this!

    2)tableC lots of records, tableD lots of records.
    I believe i need to build two indexs, one on tableC and one tableD following previous assumption.
    If my previous assumption is correct, I think its quite easy to build these two indexes.

  2. #2
    Join Date
    May 2009
    Posts
    508
    Provided Answers: 1
    wilsonsv, When I create Indexes I usually start with the Equal predicates in the Where Clause. These will get exact matches with an index lookup. I also consider Sort columns (ORDER BY, GROUP BY, DISTINCT). Join predicates are also important.

    Note: the order of C1, C2, C3 would be from most distinct values to least distinct values. Example, a 1,000 row table with 1,000 distinct values would be the most distinct while 1 distinct value would be least distinct.

    I would try indexes on:

    C4, C1, C2, C3
    D4, D1, D2, D3

    If C1, C2, C3 and/or D1, D2, D3 are Unique, I would try:

    C1, C2, C3, C4
    D1, D2, D3, D4

    You can try 2 indexes on each table with:

    C4
    C1, C2, C3
    D4
    D1, D2, D3

    This may take advantage of RID anding / oring to filter rows.

    One thing you can try, is to create many different indexes, run an Explain and see what DB2 uses.

  3. #3
    Join Date
    Dec 2005
    Posts
    273
    I'd suggest indexes on

    C4, C1, C2, C3
    D4, D1, D2, D3, D5

    that allows DB2 index-only-access on both tables.

    Column C4 resp. D4 in first place.
    Because for one of the two tables only
    c4 = ? resp.
    d4 = ?
    can be used to match the index.

    Indexes in both tables allows the optimizer to choose either of the tables as the outer table and still have matching index access.

  4. #4
    Join Date
    Jan 2007
    Location
    Jena, Germany
    Posts
    2,721
    And I suggest that you run you workload through the Design Advisor.
    Knut Stolze
    IBM DB2 Analytics Accelerator
    IBM Germany Research & Development

  5. #5
    Join Date
    Apr 2009
    Posts
    42
    Quote Originally Posted by stolze
    And I suggest that you run you workload through the Design Advisor.
    I agree Design Advisor is a great help to newbie.
    But I doubt it would give the most efficient indexes.
    Cause I am the one who knows exactly which column would store the most distinct values and which is not.
    So...here is my post asking how to build the indexes.

    Anyway, thx very much for all your sharing.

    Suppose:
    Code:
    select
        c1
        ,d1
        ,c4
        ,d5
    from
        tableC
        inner join tableD
        on(  c2 = d2 
              and c3 = d3 )
    where
        c1 = ?
        and c4 = ?
        and d1 = ?
        and d4 = ?
    There are two predicates for each table.

    As I know, DB2 will ALWAYS filter the rows using the where clause predicates before joining tables. Thus, I believe below sequence it's correct.

    c1, c4, c2, c3 (considering c1 the most distinct and c4 the least)

  6. #6
    Join Date
    May 2009
    Posts
    508
    Provided Answers: 1
    As I know, DB2 will ALWAYS filter the rows using the where clause predicates before joining tables.
    wilsonfv, I would not put ALWAYS in that statement. In general, what DB2 does is:

    Filter rows on one table.
    Join to the other table (probably with a Nested Loop Join)
    Check filtering on other table.

    It uses Statistics on the tables / columns to determine the first table.

    There are some cases where DB2 will:

    Filter on on table
    Filter on the other table
    Do a Merge Scan (or Hybrid) Join of the two filtered results.

    The Nested Loop Join is more common than the Merge Scan Join (from what I have experienced). There are other types of joins (especially in the LUW world) that may filter on both tables first and then Join.

  7. #7
    Join Date
    Jan 2007
    Location
    Jena, Germany
    Posts
    2,721
    Quote Originally Posted by wilsonfv
    I agree Design Advisor is a great help to newbie.
    But I doubt it would give the most efficient indexes.
    I agree. However, such a question from the OP suggests that the basics in understanding how indexes work, are not there yet. So it is good to have a starting point _and_ then understand how indexes work.
    Knut Stolze
    IBM DB2 Analytics Accelerator
    IBM Germany Research & Development

  8. #8
    Join Date
    Apr 2009
    Posts
    42
    Quote Originally Posted by Stealth_DBA
    The Nested Loop Join is more common than the Merge Scan Join (from what I have experienced). There are other types of joins (especially in the LUW world) that may filter on both tables first and then Join.
    Ya, as I know, there are three kinds of operation for joining tables.
    Code:
    Nested-Loop 
    Join A nested-loop join is performed in one of the following two ways:
    
    • Scanning the inner table for each accessed row of the outer table
    • Performing an index lookup on the inner table for each accessed row of the outer table
    Merge Join Merge join, sometimes known as merge scan join or sort merge join, requires a predicate of the form table1.column = table2.column. This is called an equality join predicate. Merge join requires ordered input on the joining columns, either through index access or by sorting. A merge join cannot be used if the join column is a LONG field column or a large object (LOB) column. Hash Join A hash join requires one or more predicates of the form table1.columnX = table2.columnY, for which the column types are the same. For columns of type CHAR, the length must be the same. For columns of type DECIMAL, the precision and scale must be the same. The column type cannot be a LONG field column, or a large object (LOB) column.
    Which is better, i cant tell.
    But as I know, better to avoid hash join.
    (PS: the color for hash join in SQL Explain is brown, I think DB2 is giving a message that this operation is bad. )

    Now, I have another question.
    Once DB2 have filtered records and joined the tables with indexes scan, before return the result, I think DB2 still need to access the tables (consider not all data is in the index) to retrieve the corresponding records' columns, RIGHT?
    Could this access be called as 'table scan'?

    If this is true, could I say that table space is unavoidable?
    (Why I am asking such a question is because my project has to follow a SQL coding standard which says that all table scan must eliminate in SQL Explain. But I think table scan is unavoidable, DB2 dont care about this is a table scan or index scan as long as this operation is faster, then DB2 will use it.)
    Last edited by wilsonfv; 10-27-09 at 13:27.

  9. #9
    Join Date
    May 2009
    Posts
    508
    Provided Answers: 1
    wilsonfv, If an Index is used, you don't have a table space scan. If access to the table is needed, the Index contains the RID of the row and gets the page the row is on.

    You only have a table space scan when an index on that table is not used.

    By the way, while it is a good practice to eliminate as many table space scans as possible, there are times when you can't (and shouldn't) eliminate them.

    And example of a can't is a table with a large number of rows and you are going to return a very large percentage of the rows. In this case, DB2 can decide that using an index would be more work since most of the rows are being accessed anyway. (this is a very general statement and there are other factors involved. Just trying to keep it simple for the explanation.)

    An example of a shouldn't is a very small table. If there is only one page that contains rows and there is an Index on this table (probably just to enforce uniqueness), it would require 2 or 3 get pages to use the Index (Index pages and table page) while a table space scan (of 1 page) only requires 1 get page.

  10. #10
    Join Date
    Feb 2009
    Posts
    114
    Quote Originally Posted by stolze
    And I suggest that you run you workload through the Design Advisor.
    Does BLINK use Design Advisor?

  11. #11
    Join Date
    Jan 2007
    Location
    Jena, Germany
    Posts
    2,721
    Quote Originally Posted by db2dummy1
    Does BLINK use Design Advisor?
    No need for that. IBM Smart Analytics Optimizer (aka DWA aka BLINK) doesn't have indexes or MQTs or different tablespaces or ... You merely identify the tables and the references between those tables (references are comparable to foreign keys and define the join paths). You create the mart with those basic definitions, load the data and that's it.
    Knut Stolze
    IBM DB2 Analytics Accelerator
    IBM Germany Research & Development

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •