Results 1 to 6 of 6
  1. #1
    Join Date
    Sep 2013
    Posts
    2

    Unanswered: DB2 Index Hell - 140 Million rows

    Hi, I have spent all day coming up with an index strategy to this problem. We have a table of about 140 million rows - each row has a CREATE_TS and STATUS. The STATUS is NULL initially but it will change as we process records. There are about 4 different statuses.

    I want to be able to get to documents based on timestamp and status. Here are the 4 scenarios. All queries are run with OPTIMIZE FOR 1000 ROWS
    1. WHERE STATUS IS NULL ORDER BY CREATED_TS ASC
    2. WHERE STATUS IS NULL AND CREATED_TS > timestamp(...) ORDER BY CREATED_TS ASC
    3. WHERE STATUS = 'XXX' AND CREATED_TS < timestamp(...) ORDER BY CREATED_TS ASC
    4. WHERE STATUS = 'XXX' AND CREATED_TS < timestamp(...) AND CREATED_TS > timestamp(...) ORDER BY CREATED_TS ASC


    Attempt 1: Composite Index (STATUS, CREATED_TS)
    DB2 refused to use this index atleast initially (maybe because STATUS column has low cardinality?)

    Attempt 2: Composite Index (CREATED_TS, STATUS)
    This index is being user by all queries - Query 1 performs well, Query 3 performs very slow even though there are no rows returned. Is this because it is sorting by CREATED_TS and then scanning all the rows until it found the records with a particular STATUS? I feel like this can get worse once the statuses change and we are looking for a particular STATUS.

    Attempt 3: Two separate indexes one on STATUS and one on CREATED_TS
    Surprisingly, Query 1 took forever to return.

    Any suggestions appreciated.
    Thanks
    Last edited by javadude; 09-25-13 at 17:51. Reason: formatting

  2. #2
    Join Date
    Apr 2012
    Posts
    1,035
    Provided Answers: 18
    db2 version and fixpack?
    operating system platform ?
    what runstats options used after each index created?
    table is partitioned-by-range?
    table is distributed-by-hash?

  3. #3
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    Additional to db2mor's questions,

    I'm interested in ...

    What was the DDL for the table? What index(es) did you created?

    Did you inspect explain result?
    For example:
    In the following situation,
    what access path was chosen?
    where in the access path did took most time?
    Attempt 2: Composite Index (CREATED_TS, STATUS)
    This index is being user by all queries - Query 1 performs well, Query 3 performs very slow even though there are no rows returned.
    The difference from Query 4 with Query 3 was additional condition "AND CREATED_TS > timestamp(...)" only.
    That might influence on selecting access path.
    But, once an access path was chosen, execution performance of similar access path(s) might be not so different except filter factors.
    So, it might be a good practice to see/compare each filter factors( or number of subject rows(or may be selected rows? I'm not sure) ) in each access.
    Last edited by tonkuma; 09-25-13 at 19:28.

  4. #4
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Provided Answers: 1
    What DB2 version (and platform) are you using now and what it was when the indexes were created? If we're talking DB2 LUW, the ALLOW REVERSE SCANS became the default since DB2 9.7 I believe, until then it was an option. Enabling reverse scans should help with the "less than" index scan.
    ---
    "It does not work" is not a valid problem statement.

  5. #5
    Join Date
    Sep 2013
    Posts
    2
    Hey guys, DB2 version is v9.7.0.5 and fixpack 5 running on AIX. The table is not partitioned.

    All indexes are created using COLLECT DETAILED STATISTICS COMPRESS NO ALLOW REVERSE SCANS;

    We did look at the explain plans. All plans are following a similar path below (below is for query 1)

    Code:
    Total Cost:             4.73144e+07
            Query Degree:           1
    
                      Rows
                     RETURN
                     (   1)
                      Cost
                       I/O
                       |
                   3.59168e+07
                     FETCH
                     (   2)
                   4.73144e+07
                   8.90812e+06
                  /----+-----\
            3.59168e+07    7.57594e+07
              IXSCAN     TABLE: OS04USER
              (   3)       DOCVERSION
              302807           Q4
              113445
                |
            7.57594e+07
          INDEX: OS04USER
         IDX_CREATE_TS_STS
                Q4
    Here's the behavior we noticed.
    Let's imagine oldest million docs have a STATUS of X and the rest of them have Y:

    WHERE STATUS='X' ORDER BY CRE_TS ASC OPTIMIZE FOR 1000 ROWS - takes about 0.2 seconds
    WHERE STATUS='Y' ORDER BY CRE_TS ASC OPTIMIZE FOR 1000 ROWS - takes a little over 1 second

    This behavior scares me because - when we update other 10 mill documents with X (oldest to newest) - Im worried the query is going to degrade in performance.

    Now, again the index in the explain plan above is (CREATE_TS, STATUS). It is not using the (STATUS, CREATE_TS) index although i would think it would perform better with the latter - (as we always filter by status and sort by CREATE_TS)

    Thanks
    Last edited by javadude; 09-26-13 at 17:46.

  6. #6
    Join Date
    Dec 2007
    Location
    Richmond, VA
    Posts
    1,328
    Provided Answers: 5
    The index starting with STATUS will not be helpful in a lot of cases, but really depends on your data. I have created indexes on fields that only have 2 values(Y/N, T/F,ETC...), but the data was heavily skewed. For instance, I only want to see the N for "not processed" and less than 1% of my data currently is in that state.
    [QUOTE]This behavior scares me because - when we update other 10 mill documents with X (oldest to newest) - Im worried the query is going to degrade in performance./QUOTE]
    This statement scares me. Why would you want to select 11 million rows? You are correct, your performance will degrade. You may not want to continue optimizing for 1000 rows, best to let DB2 know that you want millions. or change the SQL so that you only get 1000, "FETCH FIRST 1000 ROWS ONLY"

    Dave

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •