Page 1 of 2 12 LastLast
Results 1 to 15 of 21
  1. #1
    Join Date
    Jul 2010
    Posts
    21

    Unanswered: Display duplicate records

    Hi all,

    I have a table which looks like this:

    id | t_stamp
    --------------------------------
    1110 | 2010-05-16 12:02:18
    1111 | 2010-05-16 12:12:45
    1122 | 2010-05-16 12:12:06
    1122 | 2010-05-16 12:13:45
    1122 | 2010-05-16 12:17:52
    1122 | 2010-05-16 12:18:08
    1121 | 2010-05-16 12:20:33
    1121 | 2010-05-16 12:21:23
    1110 | 2010-05-16 12:21:36
    1110 | 2010-05-16 12:23:56
    1121 | 2010-05-16 12:33:49
    1110 | 2010-05-16 12:35:51
    1136 | 2010-05-16 12:47:10
    1133 | 2010-05-16 12:48:09
    1137 | 2010-05-16 12:49:10
    1133 | 2010-05-16 12:53:09

    Assume that this data is captured for interval one hour, from 12:00:00 to 13:00:00.
    first, i want select all duplicate id for interval one hour, for every 30mins, and for every 15mins.

    For one hour, the data should look like this:

    id | t_stamp
    --------------------------------
    1110 | 2010-05-16 12:02:18
    1110 | 2010-05-16 12:21:36
    1110 | 2010-05-16 12:23:56
    1110 | 2010-05-16 12:35:51
    1121 | 2010-05-16 12:20:33
    1121 | 2010-05-16 12:21:23
    1121 | 2010-05-16 12:33:49
    1122 | 2010-05-16 12:12:06
    1122 | 2010-05-16 12:13:45
    1122 | 2010-05-16 12:17:52
    1122 | 2010-05-16 12:18:08
    1133 | 2010-05-16 12:48:09
    1133 | 2010-05-16 12:53:09

    For every 30 minutes, select duplicate id.

    id | t_stamp
    --------------------------------
    1110 | 2010-05-16 12:02:18 }
    1110 | 2010-05-16 12:21:36 }
    1110 | 2010-05-16 12:23:56 }
    1121 | 2010-05-16 12:20:33 }
    1121 | 2010-05-16 12:21:23 } 1st 30 minutes (1-30)
    1122 | 2010-05-16 12:12:06 }
    1122 | 2010-05-16 12:13:45 }
    1122 | 2010-05-16 12:17:52 }
    1122 | 2010-05-16 12:18:08 }
    ---------------------------------
    1133 | 2010-05-16 12:48:09 } 2nd 30 minutes (30-60)
    1133 | 2010-05-16 12:53:09 }

    For every 15 minutes, select duplicate id

    id | t_stamp
    --------------------------------
    1122 | 2010-05-16 12:12:06 } 1st 15 minutes (1-15)
    1122 | 2010-05-16 12:13:45 }
    ---------------------------------
    1122 | 2010-05-16 12:17:52 }
    1122 | 2010-05-16 12:18:08 }
    1110 | 2010-05-16 12:21:36 }
    1110 | 2010-05-16 12:23:56 } 2nd 15 minutes (15-30)
    1121 | 2010-05-16 12:20:33 }
    1121 | 2010-05-16 12:21:23 }
    ---------------------------------
    } 3rd 15 minutes (30-45)
    ---------------------------------
    1133 | 2010-05-16 12:48:09 } 4th 15 minutes (45-60)
    1133 | 2010-05-16 12:53:09 }


    for 1 hour, i can do with:
    Code:
    SELECT id, t_stamp FROM log
    WHERE id IN (SELECT id FROM log GROUP BY id HAVING COUNT(*)>1) AND t_stamp
    BETWEEN date_trunc('hour',current_timestamp - interval '1 hour')
    AND date_trunc('hour',current_timestamp)
    Now, i try to do for every 30minutes and 15 minutes but if i do something like -interval '30 minutes', this will only show data for the first 30 minutes.

    Help me to figure this out. thank u very much.

    Regards,
    nbtet
    Last edited by nbtet; 07-27-10 at 23:36.

  2. #2
    Join Date
    Dec 2007
    Location
    Richmond, VA
    Posts
    1,328
    Provided Answers: 5
    instead of doing the math on current_timestamp, how about doing it with your tstamp column?
    Dave

  3. #3
    Join Date
    Jul 2010
    Posts
    21
    Hi Dave,

    thanks for the reply. I'm not familiar with the date time function. So far, i can only display duplicate for 1 hour by using the sql statement i stated earlier. and i'm not sure how to do it with the t_stamp column.
    my data is actually synchronized for every one hour from other server and copied into postgres. by the interval 1 hour, i should perform checking on duplicate id for every 30minutes and 15minutes.
    I really need help on the SQL statement for this. thanks.


    Regards,
    nbtet

  4. #4
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    Not sure I understand your examples, but what about this:

    Code:
    select id, t_stamp
    from log
    where id in (select id 
                 from log 
                 group by id, extract(minute from t_stamp)::int / 30
                 having count(*) > 1)
    order by id, t_stamp;
    If you need to group by 15 minutes just change the divider in the sub-query

    And please for your next post:
    use [code] tags to format sample data and your queries. Thank you

  5. #5
    Join Date
    Jul 2010
    Posts
    21
    Hi Shammat,

    Assume that I have a table looks like this:
    Code:
    id   | t_stamp
    -----+---------------------
    1137 | 2010-05-16 11:49:31
    1110 | 2010-05-16 12:02:18
    1111 | 2010-05-16 12:12:45
    1122 | 2010-05-16 12:12:06
    1122 | 2010-05-16 12:13:45
    1122 | 2010-05-16 12:17:52
    1122 | 2010-05-16 12:18:08
    1121 | 2010-05-16 12:20:33
    1121 | 2010-05-16 12:21:23
    1110 | 2010-05-16 12:21:36
    1110 | 2010-05-16 12:23:56
    1121 | 2010-05-16 12:33:49
    1110 | 2010-05-16 12:35:51
    1136 | 2010-05-16 12:47:10
    1133 | 2010-05-16 12:48:09
    1137 | 2010-05-16 12:49:10
    1133 | 2010-05-16 12:53:09
    1133 | 2010-05-16 13:05:33
    1133 | 2010-05-16 13:23:57
    1127 | 2010-05-16 13:44:00

    Lets consider the current_timestamp is '2010-05-16 13:44:00'. Now, I would like to display duplicate id for:
    1) interval 1 hour (12:00:00 - 13:00:00)
    2) Time period of every 30 minutes (0mins-30mins, 30mins-60mins) within 12:00:00 - 13:00:00
    3) TIme period of every 15 minutes (0mins-15mins, 15mins-30mins, 30mins-45mins and 45mins-60mins) within 12:00:00 - 13:00:00

    For (1), I have no problem to select the duplicate id.
    For (2) and (3), I have no idea how to do this. The data should be seen like this (order by id):


    (2).
    Code:
    id   | t_stamp
    -----+---------------------
    1110 | 2010-05-16 12:02:18
    1110 | 2010-05-16 12:21:36
    1110 | 2010-05-16 12:23:56
    1121 | 2010-05-16 12:20:33
    1121 | 2010-05-16 12:21:23
    1122 | 2010-05-16 12:12:06
    1122 | 2010-05-16 12:13:45
    1122 | 2010-05-16 12:17:52
    1122 | 2010-05-16 12:18:08
    
    1133 | 2010-05-16 12:48:09
    1133 | 2010-05-16 12:53:09

    (3).
    Code:
    id   | t_stamp
    -----+---------------------
    1110 | 2010-05-16 12:21:36
    1110 | 2010-05-16 12:23:56
    
    1121 | 2010-05-16 12:20:33
    1121 | 2010-05-16 12:21:23
    
    1122 | 2010-05-16 12:12:06
    1122 | 2010-05-16 12:13:45
    
    1122 | 2010-05-16 12:17:52
    1122 | 2010-05-16 12:18:08
    
    1133 | 2010-05-16 12:48:09
    1133 | 2010-05-16 12:53:09

    Notice that id 1110 actually got 4 duplicates in 1 hour, but for time period of 15 mins, only two will be considered as duplicate.

    Code:
    id   | t_stamp
    -----+---------------------
    1110 | 2010-05-16 12:02:18 >> omitted since not within the minute of 15-30
    1110 | 2010-05-16 12:21:36
    1110 | 2010-05-16 12:23:56
    1110 | 2010-05-16 12:35:51 >> omitted since not within the minute of 15-30
    I have tried with the your code but i got syntax error. So, i try to do something like this:
    Code:
    SELECT id, t_stamp FROM log
    WHERE id IN (SELECT id FROM log 
                 WHERE t_stamp BETWEEN DATE_TRUNC('hour',timestamp '2010-05-16 13:44:00' - interval '1 hour') 
                 AND DATE_TRUNC('hour', timestamp '2010-05-16 13:44:00')
                 GROUP BY id, 
                 (SELECT(DATE_TRUNC('hour',timestamp '2010-05-16 13:44:00') -date_trunc('hour', timestamp '2010-05-16 13:44:00' - interval '1 hour'))/4) 
                 having count(*) > 1) 
    order by id;
    So far, this code works well, but it doesn't give what i want. perhaps, you could suggest me a better solution. thanks again!


    Regards,
    nbtet
    Last edited by nbtet; 07-28-10 at 04:07.

  6. #6
    Join Date
    Jul 2010
    Posts
    21
    For better understanding:

    Select id, t_stamp From log
    Where id has duplicate [count(id)>1] and within a time period of every 15mins
    Code:
    	    12:00:00		              13:00:00
    Time Period   .00-----.15-----.30------.45------.60
    		|1110	| 1122 	| 1121 	|  1136  | 
    		|1111	| 1122 	| 1110 	|  1133  | 
    		|1122	| 1121 	|	|  1137  | 
    		|1122	| 1121 	| 	|  1133  | 
    		|       | 1110 	| 	|        | 
    		|       | 1110 	|       |        |

    Red indicates duplicate id within time period from 12:00:00 to 13:00:00.

    Should only select the red color:
    Code:
    id | t_stamp
    --------------------------------
    1110 | 2010-05-16 12:02:18
    1111 | 2010-05-16 12:12:45
    1122 | 2010-05-16 12:12:06
    1122 | 2010-05-16 12:13:45
    
    1122 | 2010-05-16 12:17:52
    1122 | 2010-05-16 12:18:08
    1121 | 2010-05-16 12:20:33
    1121 | 2010-05-16 12:21:23
    1110 | 2010-05-16 12:21:36
    1110 | 2010-05-16 12:23:56
    
    1121 | 2010-05-16 12:33:49
    1110 | 2010-05-16 12:35:51
    1136 | 2010-05-16 12:47:10
    1133 | 2010-05-16 12:48:09
    1137 | 2010-05-16 12:49:10
    1133 | 2010-05-16 12:53:09
    Last edited by nbtet; 07-28-10 at 01:04.

  7. #7
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    Quote Originally Posted by nbtet View Post
    Perhaps, you could suggest me a better solution. thanks again!
    So my query does not what you want?

  8. #8
    Join Date
    Jul 2010
    Posts
    21
    Hi again,

    Code:
    select id, t_stamp
    from log
    where id in (select id 
                 from log 
                 group by id, extract(minute from timestamp '2010-05-18 13:44:00')::int / 30
                 having count(*) > 1) 
    and t_stamp between date_trunc('hour',timestamp '2010-05-18 13:44:00' - interval '1 hour') 
    and date_trunc('hour', timestamp '2010-05-18 13:44:00')
    order by id, t_stamp;
    I modified your query to this and it display non-duplicate id as well. So, now i'm trying to do something like this:
    Code:
    select id, t_stamp 
    from log
    where id in (select id from log 
                  where t_stamp between date_trunc('hour', timestamp '2010-05-18 13:44:00' - interval '1 hour') 
                  and (date_trunc('hour', timestamp '2010-05-18 13:44:00' - interval '1 hour') + interval '15 minutes')
                  group by id having count(*) > 1) 
    and t_stamp BETWEEN date_trunc('hour',timestamp '2010-05-18 13:44:00' - interval '1 hour') 
    AND date_trunc('hour', timestamp '2010-05-18 13:44:00') order by id;
    However, this query only select duplicate id between 12.00 to 12.15. I also want to select the duplicate id within 12.15-12.30, 12.30-12.45 and 12.45 to 13.00. Any suggestion?

  9. #9
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    Quote Originally Posted by nbtet View Post
    However, this query only select duplicate id between 12.00 to 12.15. I also want to select the duplicate id within 12.15-12.30, 12.30-12.45 and 12.45 to 13.00. Any suggestion?
    Try this
    Code:
    select id, t_stamp
    from (
      select id, 
             t_stamp, 
             count(*) over (partition by id, extract(minute from t_stamp)::int / 15) as num_dupes
      from log
    ) t
    where num_dupes > 1
    order by t_stamp

  10. #10
    Join Date
    Jul 2010
    Posts
    21
    Hi,

    The query returns the same result as before
    anyway, can you do something like grouping the 1 hour into every 15minutes? there should be 4 group i guess:
    0-15mins, 15-30mins, 30-45mins and 45-60mins. I think this is the main reason why i failed.
    for getting the duplicate id is not a problem, but how to get the duplicate id which is both within the time period.


    Thanks.
    Last edited by nbtet; 07-28-10 at 06:57.

  11. #11
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    Quote Originally Posted by nbtet View Post
    The query returns the same result as before
    Hmm, I tested that with your sample data and it returned exactly what you wanted.
    Code:
    postgres=> select * from log;
      id  |       t_stamp
    ------+---------------------
     1110 | 2010-05-16 12:02:18
     1111 | 2010-05-16 12:12:45
     1122 | 2010-05-16 12:12:06
     1122 | 2010-05-16 12:13:45
     1122 | 2010-05-16 12:17:52
     1122 | 2010-05-16 12:18:08
     1121 | 2010-05-16 12:20:33
     1121 | 2010-05-16 12:21:23
     1110 | 2010-05-16 12:21:36
     1110 | 2010-05-16 12:23:56
     1121 | 2010-05-16 12:33:49
     1110 | 2010-05-16 12:35:51
     1136 | 2010-05-16 12:47:10
     1133 | 2010-05-16 12:48:09
     1137 | 2010-05-16 12:49:10
     1133 | 2010-05-16 12:53:09
    (16 rows)
    
    postgres=> select id, t_stamp
    postgres-> from (
    postgres(>   select id,
    postgres(>          t_stamp,
    postgres(>          count(*) over (partition by id, extract(minute from t_stamp)::int / 15) as num_dupes
    postgres(>   from log
    postgres(> ) t
    postgres-> where num_dupes > 1
    postgres-> order by t_stamp
    postgres-> ;
      id  |       t_stamp
    ------+---------------------
     1122 | 2010-05-16 12:12:06
     1122 | 2010-05-16 12:13:45
     1122 | 2010-05-16 12:17:52
     1122 | 2010-05-16 12:18:08
     1121 | 2010-05-16 12:20:33
     1121 | 2010-05-16 12:21:23
     1110 | 2010-05-16 12:21:36
     1110 | 2010-05-16 12:23:56
     1133 | 2010-05-16 12:48:09
     1133 | 2010-05-16 12:53:09
    (10 rows)
    
    postgres=>
    These are exactly the rows that you marked red in post #6

  12. #12
    Join Date
    Jul 2010
    Posts
    21
    Well, i actually modify the code u gave me. If i run exactly the same code,
    Code:
    select user_id, t_stamp
    from (
      select user_id, 
             t_stamp, 
             count(*) over (partition by user_id, extract(minute from t_stamp)::int / 15) as num_dupes
      from click_log
    ) t
    where t_stamp between date_trunc('hour',timestamp '2010-05-18 13:44:00' - interval '1 hour') 
    and date_trunc('hour', timestamp '2010-05-18 13:44:00') and num_dupes > 1 
    order by t_stamp;
    i got this error:
    ERROR: function pg_catalog.date_part("unknown", text) does not exist
    LINE 5: count(*) over (partition by user_id, extract(minute...
    ^
    HINT: No function matches the given name and argument types. You may need to add explicit type casts.
    Last edited by nbtet; 07-28-10 at 07:07.

  13. #13
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    Quote Originally Posted by nbtet View Post
    Well, i actually modify the code u gave me
    Then don't say it's not working

    i got this error
    I guess you are using an old Postgres version then.
    As you did not state your version I assumed you are using the most recent one (8.4)

    Sorry, I have not idea how to do that with an old Postgres version.

  14. #14
    Join Date
    Jul 2010
    Posts
    21
    Quote Originally Posted by nbtet View Post
    So far, this code works well, but it doesn't give what i want. perhaps, you could suggest me a better solution. thanks again!
    I'm not saying your code is not working. its only that i cant get what it supposed to be.
    Can we try this once again.
    the data is captured from 2010-05-18 11:41:20 to 2010-05-18 13:51:24. I need to get duplicate id from 12:00:00 to 13:00:00 only, and duplicates should be within every 15minutes. I really appreciate if you can help.

    Code:
    CREATE TABLE click_log (user_id integer, t_stamp timestamp without time zone);
    btw, i'm using greenplum database.
    thanks.
    Last edited by nbtet; 07-30-10 at 06:49.

  15. #15
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    Quote Originally Posted by nbtet View Post
    i'm using greenplum database.
    To which Postgres version does this match?

    My solution with analytical function returns exactly what you want, but I have no idea how to do that without.

    Your table definition "user_id integer" does not match your sample data (e.g. "ea4855b32dfe7547bd829ee5717cfef7") so I changed the table definition.

    Here is the result when I run it on Postgres 8.4
    Code:
    postgres=>
    postgres=> select user_id, t_stamp
    postgres-> from (
    postgres(>   select user_id,
    postgres(>          t_stamp,
    postgres(>          count(*) over (partition by user_id, extract(minute from t_stamp)::int / 15) as num_dupes
    postgres(>   from click_log
    postgres(> ) t
    postgres-> where t_stamp
    postgres->    between date_trunc('hour',timestamp '2010-05-18 13:44:00' - interval '1 hour')
    postgres->            and date_trunc('hour', timestamp '2010-05-18 13:44:00')
    postgres-> and num_dupes > 1
    postgres-> order by t_stamp;
                 user_id              |       t_stamp
    ----------------------------------+---------------------
     8b842d2c5ca36e194d7b9485f5f00d90 | 2010-05-18 12:01:28
     ac582dc750cc4e9a6cbbc08a3b9d71bc | 2010-05-18 12:01:49
     ac582dc750cc4e9a6cbbc08a3b9d71bc | 2010-05-18 12:06:14
     ac582dc750cc4e9a6cbbc08a3b9d71bc | 2010-05-18 12:10:33
     8b842d2c5ca36e194d7b9485f5f00d90 | 2010-05-18 12:12:33
     d184bd64be648b63dc755ff779482206 | 2010-05-18 12:31:50
     d184bd64be648b63dc755ff779482206 | 2010-05-18 12:31:55
     02007f0b1e6f70a06d6fed9314d13726 | 2010-05-18 12:41:49
     470c3a0c10089276cad4f3e491f78cf4 | 2010-05-18 12:45:40
     470c3a0c10089276cad4f3e491f78cf4 | 2010-05-18 12:45:46
    (10 rows)
    As far as I can tell, this is exactly what you want.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •