Results 1 to 8 of 8
  1. #1
    Join Date
    Apr 2004
    Posts
    64

    Unanswered: Removing duplicates from table based on two columns

    db2 V8 z/os

    Hi,

    I have a requirement to select latest rows from a set of duplicates.

    Table structer is,

    TB1
    (key_A not null
    key_B not null
    ins_ts not null
    upd_ts nullable
    )

    rows are inserted in this table with upd_ts as null but with valid ins_ts(insert timestamp)

    When row is updated, the updated row is inserted again, with upd_ts as updated timestamp. The orignial row still stays in table. So now there are two rows. Assume, even after update, columns key_a and key_b are still same in both rows.

    The requirement is select the latest updated/inserted rows from table. So basically we will use the updated timestamp(upd_ts) to find latest rows and when null, will use the inserted timestamp(ins_ts).

    We can't use the db2 V9 row_number function here since its still in V8. So I used the following query. Since the table is very large(25 GB approx), this doesn't look like an efficient query as several intermediate result tables are used here. Any ideas if any better query is possible?

    SELECT TB.KEY_A, TB.KEY_B, TB.INS_TS,
    TB.TMS FROM
    (SELECT KEY_A, KEY_B,
    ins_ts, COALESCE(UPD_TS,INS_TS) TMS
    FROM creator.table ) TB

    WHERE TB.TMS=(SELECT MAX(TC.TMS)
    FROM
    (SELECT KEY_A, KEY_B,
    COALESCE(UPD_TS,INS_TS) TMS
    FROM creator.table ) TC
    where TC.KEY_A = TB.KEY_A
    and TC.KEY_B = TB.KEY_B
    ) WITH UR;

    Thanks in advance.
    Amar

  2. #2
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    Try
    Code:
    SELECT key_a
         , key_b
         , ins_ts
         , upd_ts
     FROM  creator.table t
     WHERE NOT EXISTS
           (SELECT 0
             FROM  creator.table s
             WHERE s.key_a = t.key_a
               AND s.key_b = t.key_b
               AND COALESCE(s.upd_ts , s.ins_ts)
                 > COALESCE(t.upd_ts , t.ins_ts)
           )
     WITH UR
    ;
    Index ( key_a , key_b , upd_ts DESC , ins_ts ) must be useful.
    Note: I'm not sure that DESC and sequence of upd_ts and ins_ts may be effective in performance.

  3. #3
    Join Date
    Apr 2004
    Posts
    64
    Thank you Tonkuma, that is a nice query!

  4. #4
    Join Date
    Apr 2004
    Posts
    64
    Sorry to be bumping this old thread but the query is not working as expected.

    If there are say 10 rows with same key values, I am getting all 10 in output. Not the one with higest timestamp. Query given by Tonkuma above looks right theoretically but not sure why doesn't work as expected. Any ideas?

  5. #5
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    If there are say 10 rows with same key values, I am getting all 10 in output. Not the one with higest timestamp.
    Please publish sample data, your query and results.

  6. #6
    Join Date
    Apr 2004
    Posts
    64
    here sample output,

    key_a, key_b, ins_ts, upd_ts

    A, B, 2011-06-10..., NULL
    A, B, 2011-06-11..., NULL
    A, B, 2011-06-12..., NULL

    I was expecting only the last row with 2011-06-12

    I tried removing coalesce and not using upd_ts and just comparing ins_ts. But still get all teh duplicates. Not sure timestamp compare is not working.

    Just to add, the table is actually a view that UNION ALLs three tables. That should not affect it but just FYI.
    Last edited by xamar; 08-23-12 at 09:52.

  7. #7
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    It worked on DB2 9.7.5 on Windows.

    Code:
    ------------------------------ Commands Entered ------------------------------
    WITH
     sample_data
    ( key_a , key_b , ins_ts , upd_ts ) AS (
    VALUES
      ( 'A' , 'B' , TIMESTAMP('2011-06-10') , CAST(NULL AS TIMESTAMP) )
    , ( 'A' , 'B' , TIMESTAMP('2011-06-11') , CAST(NULL AS TIMESTAMP) )
    , ( 'A' , 'B' , TIMESTAMP('2011-06-12') , CAST(NULL AS TIMESTAMP) )
    )
    SELECT key_a
         , key_b
         , ins_ts
         , upd_ts
     FROM  sample_data t
     WHERE NOT EXISTS
           (SELECT 0
             FROM  sample_data s
             WHERE s.key_a = t.key_a
               AND s.key_b = t.key_b
               AND COALESCE(s.upd_ts , s.ins_ts)
                 > COALESCE(t.upd_ts , t.ins_ts)
           )
     WITH UR
    ;
    ------------------------------------------------------------------------------
    
    KEY_A KEY_B INS_TS                     UPD_TS                    
    ----- ----- -------------------------- --------------------------
    A     B     2011-06-12-00.00.00.000000 -                         
    
      1 record(s) selected.
    I want to see your CREATE TABLE statement and INSERT statement of sample data for the table.


    By the way,
    you wrote
    When row is updated, the updated row is inserted again, with upd_ts as updated timestamp. The orignial row still stays in table. So now there are two rows.
    Assume, even after update, columns key_a and key_b are still same in both rows.
    Why are there more than one NULL upd_ts for same key_a and key_b?

  8. #8
    Join Date
    Apr 2004
    Posts
    64
    Quote Originally Posted by tonkuma View Post
    It worked on DB2 9.7.5 on Windows.

    Code:
    ------------------------------ Commands Entered ------------------------------
    WITH
     sample_data
    ( key_a , key_b , ins_ts , upd_ts ) AS (
    VALUES
      ( 'A' , 'B' , TIMESTAMP('2011-06-10') , CAST(NULL AS TIMESTAMP) )
    , ( 'A' , 'B' , TIMESTAMP('2011-06-11') , CAST(NULL AS TIMESTAMP) )
    , ( 'A' , 'B' , TIMESTAMP('2011-06-12') , CAST(NULL AS TIMESTAMP) )
    )
    SELECT key_a
         , key_b
         , ins_ts
         , upd_ts
     FROM  sample_data t
     WHERE NOT EXISTS
           (SELECT 0
             FROM  sample_data s
             WHERE s.key_a = t.key_a
               AND s.key_b = t.key_b
               AND COALESCE(s.upd_ts , s.ins_ts)
                 > COALESCE(t.upd_ts , t.ins_ts)
           )
     WITH UR
    ;
    ------------------------------------------------------------------------------
    
    KEY_A KEY_B INS_TS                     UPD_TS                    
    ----- ----- -------------------------- --------------------------
    A     B     2011-06-12-00.00.00.000000 -                         
    
      1 record(s) selected.
    I want to see your CREATE TABLE statement and INSERT statement of sample data for the table.


    By the way,
    you wrote

    Why are there more than one NULL upd_ts for same key_a and key_b?
    The table is actually a view that is UNION ALL of three different tables. Its user data so I have no idea why they have more rows with same key values but no upd_ts.

    I can't replicate the problem in another system. Even comparing only ins_ts I get duplicate values in source system. But not in my test system where I have created a simple table. In my simple table in test system the query works fine.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •