Results 1 to 6 of 6
  1. #1
    Join Date
    Aug 2010
    Posts
    17

    Unanswered: SubQuery Performance

    Hi All,

    I have a poor performance SQL as following. The table has about 200M records, each employee have average 100 records. The query lasts about 3 hours. All I want is to update the flag for highest version of each client's record. Any suggestion is welcome!

    Thanks,

    Mike


    ====SQL===========
    update empTbl A
    set flag=1
    where
    rec_ver =
    ( select max(rec_ver)
    from empTbl
    where empNo = A.empNo)



    ===Table empTbl=====

    empTbl
    {
    int empNo;
    int flag;
    char[256] empDesc;
    int rec_ver;
    }

  2. #2
    Join Date
    May 2009
    Posts
    508
    Provided Answers: 1
    mail4mz, try this (it works in DB2 V9.7)
    Code:
    UPDATE EMPTBL
    SET FLAG = 1
    WHERE (EMPNO, REC_VER) IN(SELECT EMPNO, MAX(REC_VER)
                              FROM EMPTBL
                              GROUP BY EMPNO
                             )
    If you have an index on EMPNO, REC_VER, it would also help.

  3. #3
    Join Date
    Aug 2010
    Posts
    17
    Actually, The subquery "group by" itself takes hour to complete.

    SELECT EMPNO, MAX(REC_VER)
    FROM EMPTBL
    GROUP BY EMPNO

    Why do you think the "group by" subquery will be faster? Is it because my previous subquery is correlated query?


    By the way, I simplized the structure of the table empTbl. There are many more columns in the table. Both EMPNO and REC_VER are part of the primary key and index.

    Thanks!

  4. #4
    Join Date
    May 2009
    Posts
    508
    Provided Answers: 1
    With the correlated subquery, the Max is processed for every row in the table.
    i.e read row 1 and get the Max value for that row's EMPNO.
    read row 2 and get Max value for that row's EMPNO
    repeat until end of table.
    NOTE: I believe DB2 can shortcut this a little if the EMPNO value doesn't change between rows and can use the already generated max value.

    With the non-correlated subquery, the max is processed once.

    Generally, I have found the correlated sub-query work better in most cases since it usually can avoid a sort. A non-correlated sub-query with an IN clause will do a sort to remove duplicates (if there isn't a unique index that can be used to avoid this). If one method doesn't work well, I try the other to see if it would work better.

    PS Depending on the things you 'simplified' out, I am very surprised to hear it takes an hour to process a col, MAX() where both columns are the primary key. However, you do have approximately 1 million distinct EMPNO values (200 million rows divided by 100 rows per EMPNO).

  5. #5
    Join Date
    Aug 2010
    Posts
    3
    Yes. Stealth_DBA is correct.

    Also as said earlier, if columns mentioned in where clause are part of index then it would help much. Make sure, if possible, to use full index instead of partial index.

  6. #6
    Join Date
    Apr 2009
    Posts
    42
    take a look at the following case:
    if you update two columns in a single update and col1 sits far away from col2, meaning in the table creation script, between col1 and col2 there are many columns. in such case, the update will generate much more log than that if col1 sits just besides column 2.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •