If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > PostgreSQL > Display duplicate records

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 07-27-10, 05:23
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 21
Display duplicate records

Hi all,

I have a table which looks like this:

id | t_stamp
--------------------------------
1110 | 2010-05-16 12:02:18
1111 | 2010-05-16 12:12:45
1122 | 2010-05-16 12:12:06
1122 | 2010-05-16 12:13:45
1122 | 2010-05-16 12:17:52
1122 | 2010-05-16 12:18:08
1121 | 2010-05-16 12:20:33
1121 | 2010-05-16 12:21:23
1110 | 2010-05-16 12:21:36
1110 | 2010-05-16 12:23:56
1121 | 2010-05-16 12:33:49
1110 | 2010-05-16 12:35:51
1136 | 2010-05-16 12:47:10
1133 | 2010-05-16 12:48:09
1137 | 2010-05-16 12:49:10
1133 | 2010-05-16 12:53:09

Assume that this data is captured for interval one hour, from 12:00:00 to 13:00:00.
first, i want select all duplicate id for interval one hour, for every 30mins, and for every 15mins.

For one hour, the data should look like this:

id | t_stamp
--------------------------------
1110 | 2010-05-16 12:02:18
1110 | 2010-05-16 12:21:36
1110 | 2010-05-16 12:23:56
1110 | 2010-05-16 12:35:51
1121 | 2010-05-16 12:20:33
1121 | 2010-05-16 12:21:23
1121 | 2010-05-16 12:33:49
1122 | 2010-05-16 12:12:06
1122 | 2010-05-16 12:13:45
1122 | 2010-05-16 12:17:52
1122 | 2010-05-16 12:18:08
1133 | 2010-05-16 12:48:09
1133 | 2010-05-16 12:53:09

For every 30 minutes, select duplicate id.

id | t_stamp
--------------------------------
1110 | 2010-05-16 12:02:18 }
1110 | 2010-05-16 12:21:36 }
1110 | 2010-05-16 12:23:56 }
1121 | 2010-05-16 12:20:33 }
1121 | 2010-05-16 12:21:23 } 1st 30 minutes (1-30)
1122 | 2010-05-16 12:12:06 }
1122 | 2010-05-16 12:13:45 }
1122 | 2010-05-16 12:17:52 }
1122 | 2010-05-16 12:18:08 }
---------------------------------
1133 | 2010-05-16 12:48:09 } 2nd 30 minutes (30-60)
1133 | 2010-05-16 12:53:09 }

For every 15 minutes, select duplicate id

id | t_stamp
--------------------------------
1122 | 2010-05-16 12:12:06 } 1st 15 minutes (1-15)
1122 | 2010-05-16 12:13:45 }
---------------------------------
1122 | 2010-05-16 12:17:52 }
1122 | 2010-05-16 12:18:08 }
1110 | 2010-05-16 12:21:36 }
1110 | 2010-05-16 12:23:56 } 2nd 15 minutes (15-30)
1121 | 2010-05-16 12:20:33 }
1121 | 2010-05-16 12:21:23 }
---------------------------------
} 3rd 15 minutes (30-45)
---------------------------------
1133 | 2010-05-16 12:48:09 } 4th 15 minutes (45-60)
1133 | 2010-05-16 12:53:09 }


for 1 hour, i can do with:
Code:
SELECT id, t_stamp FROM log
WHERE id IN (SELECT id FROM log GROUP BY id HAVING COUNT(*)>1) AND t_stamp
BETWEEN date_trunc('hour',current_timestamp - interval '1 hour')
AND date_trunc('hour',current_timestamp)
Now, i try to do for every 30minutes and 15 minutes but if i do something like -interval '30 minutes', this will only show data for the first 30 minutes.

Help me to figure this out. thank u very much.

Regards,
nbtet

Last edited by nbtet; 07-27-10 at 22:36.
Reply With Quote
  #2 (permalink)  
Old 07-27-10, 08:14
dav1mo dav1mo is offline
Registered User
 
Join Date: Dec 2007
Location: Richmond, VA
Posts: 782
instead of doing the math on current_timestamp, how about doing it with your tstamp column?
Dave
Reply With Quote
  #3 (permalink)  
Old 07-27-10, 10:28
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 21
Hi Dave,

thanks for the reply. I'm not familiar with the date time function. So far, i can only display duplicate for 1 hour by using the sql statement i stated earlier. and i'm not sure how to do it with the t_stamp column.
my data is actually synchronized for every one hour from other server and copied into postgres. by the interval 1 hour, i should perform checking on duplicate id for every 30minutes and 15minutes.
I really need help on the SQL statement for this. thanks.


Regards,
nbtet
Reply With Quote
  #4 (permalink)  
Old 07-27-10, 11:00
shammat shammat is offline
Registered User
 
Join Date: Nov 2003
Posts: 2,407
Not sure I understand your examples, but what about this:

Code:
select id, t_stamp
from log
where id in (select id 
             from log 
             group by id, extract(minute from t_stamp)::int / 30
             having count(*) > 1)
order by id, t_stamp;
If you need to group by 15 minutes just change the divider in the sub-query

And please for your next post:
use [code] tags to format sample data and your queries. Thank you
Reply With Quote
  #5 (permalink)  
Old 07-27-10, 22:26
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 21
Hi Shammat,

Assume that I have a table looks like this:
Code:
id   | t_stamp
-----+---------------------
1137 | 2010-05-16 11:49:31
1110 | 2010-05-16 12:02:18
1111 | 2010-05-16 12:12:45
1122 | 2010-05-16 12:12:06
1122 | 2010-05-16 12:13:45
1122 | 2010-05-16 12:17:52
1122 | 2010-05-16 12:18:08
1121 | 2010-05-16 12:20:33
1121 | 2010-05-16 12:21:23
1110 | 2010-05-16 12:21:36
1110 | 2010-05-16 12:23:56
1121 | 2010-05-16 12:33:49
1110 | 2010-05-16 12:35:51
1136 | 2010-05-16 12:47:10
1133 | 2010-05-16 12:48:09
1137 | 2010-05-16 12:49:10
1133 | 2010-05-16 12:53:09
1133 | 2010-05-16 13:05:33
1133 | 2010-05-16 13:23:57
1127 | 2010-05-16 13:44:00

Lets consider the current_timestamp is '2010-05-16 13:44:00'. Now, I would like to display duplicate id for:
1) interval 1 hour (12:00:00 - 13:00:00)
2) Time period of every 30 minutes (0mins-30mins, 30mins-60mins) within 12:00:00 - 13:00:00
3) TIme period of every 15 minutes (0mins-15mins, 15mins-30mins, 30mins-45mins and 45mins-60mins) within 12:00:00 - 13:00:00

For (1), I have no problem to select the duplicate id.
For (2) and (3), I have no idea how to do this. The data should be seen like this (order by id):


(2).
Code:
id   | t_stamp
-----+---------------------
1110 | 2010-05-16 12:02:18
1110 | 2010-05-16 12:21:36
1110 | 2010-05-16 12:23:56
1121 | 2010-05-16 12:20:33
1121 | 2010-05-16 12:21:23
1122 | 2010-05-16 12:12:06
1122 | 2010-05-16 12:13:45
1122 | 2010-05-16 12:17:52
1122 | 2010-05-16 12:18:08

1133 | 2010-05-16 12:48:09
1133 | 2010-05-16 12:53:09

(3).
Code:
id   | t_stamp
-----+---------------------
1110 | 2010-05-16 12:21:36
1110 | 2010-05-16 12:23:56

1121 | 2010-05-16 12:20:33
1121 | 2010-05-16 12:21:23

1122 | 2010-05-16 12:12:06
1122 | 2010-05-16 12:13:45

1122 | 2010-05-16 12:17:52
1122 | 2010-05-16 12:18:08

1133 | 2010-05-16 12:48:09
1133 | 2010-05-16 12:53:09

Notice that id 1110 actually got 4 duplicates in 1 hour, but for time period of 15 mins, only two will be considered as duplicate.

Code:
id   | t_stamp
-----+---------------------
1110 | 2010-05-16 12:02:18 >> omitted since not within the minute of 15-30
1110 | 2010-05-16 12:21:36
1110 | 2010-05-16 12:23:56
1110 | 2010-05-16 12:35:51 >> omitted since not within the minute of 15-30
I have tried with the your code but i got syntax error. So, i try to do something like this:
Code:
SELECT id, t_stamp FROM log
WHERE id IN (SELECT id FROM log 
             WHERE t_stamp BETWEEN DATE_TRUNC('hour',timestamp '2010-05-16 13:44:00' - interval '1 hour') 
             AND DATE_TRUNC('hour', timestamp '2010-05-16 13:44:00')
             GROUP BY id, 
             (SELECT(DATE_TRUNC('hour',timestamp '2010-05-16 13:44:00') -date_trunc('hour', timestamp '2010-05-16 13:44:00' - interval '1 hour'))/4) 
             having count(*) > 1) 
order by id;
So far, this code works well, but it doesn't give what i want. perhaps, you could suggest me a better solution. thanks again!


Regards,
nbtet

Last edited by nbtet; 07-28-10 at 03:07.
Reply With Quote
  #6 (permalink)  
Old 07-28-10, 00:00
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 21
For better understanding:

Select id, t_stamp From log
Where id has duplicate [count(id)>1] and within a time period of every 15mins
Code:
	    12:00:00		              13:00:00
Time Period   .00-----.15-----.30------.45------.60
		|1110	| 1122 	| 1121 	|  1136  | 
		|1111	| 1122 	| 1110 	|  1133  | 
		|1122	| 1121 	|	|  1137  | 
		|1122	| 1121 	| 	|  1133  | 
		|       | 1110 	| 	|        | 
		|       | 1110 	|       |        |

Red indicates duplicate id within time period from 12:00:00 to 13:00:00.

Should only select the red color:
Code:
id | t_stamp
--------------------------------
1110 | 2010-05-16 12:02:18
1111 | 2010-05-16 12:12:45
1122 | 2010-05-16 12:12:06
1122 | 2010-05-16 12:13:45

1122 | 2010-05-16 12:17:52
1122 | 2010-05-16 12:18:08
1121 | 2010-05-16 12:20:33
1121 | 2010-05-16 12:21:23
1110 | 2010-05-16 12:21:36
1110 | 2010-05-16 12:23:56

1121 | 2010-05-16 12:33:49
1110 | 2010-05-16 12:35:51
1136 | 2010-05-16 12:47:10
1133 | 2010-05-16 12:48:09
1137 | 2010-05-16 12:49:10
1133 | 2010-05-16 12:53:09

Last edited by nbtet; 07-28-10 at 00:04.
Reply With Quote
  #7 (permalink)  
Old 07-28-10, 02:15
shammat shammat is offline
Registered User
 
Join Date: Nov 2003
Posts: 2,407
Quote:
Originally Posted by nbtet View Post
Perhaps, you could suggest me a better solution. thanks again!
So my query does not what you want?
Reply With Quote
  #8 (permalink)  
Old 07-28-10, 04:57
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 21
Hi again,

Code:
select id, t_stamp
from log
where id in (select id 
             from log 
             group by id, extract(minute from timestamp '2010-05-18 13:44:00')::int / 30
             having count(*) > 1) 
and t_stamp between date_trunc('hour',timestamp '2010-05-18 13:44:00' - interval '1 hour') 
and date_trunc('hour', timestamp '2010-05-18 13:44:00')
order by id, t_stamp;
I modified your query to this and it display non-duplicate id as well. So, now i'm trying to do something like this:
Code:
select id, t_stamp 
from log
where id in (select id from log 
              where t_stamp between date_trunc('hour', timestamp '2010-05-18 13:44:00' - interval '1 hour') 
              and (date_trunc('hour', timestamp '2010-05-18 13:44:00' - interval '1 hour') + interval '15 minutes')
              group by id having count(*) > 1) 
and t_stamp BETWEEN date_trunc('hour',timestamp '2010-05-18 13:44:00' - interval '1 hour') 
AND date_trunc('hour', timestamp '2010-05-18 13:44:00') order by id;
However, this query only select duplicate id between 12.00 to 12.15. I also want to select the duplicate id within 12.15-12.30, 12.30-12.45 and 12.45 to 13.00. Any suggestion?
Reply With Quote
  #9 (permalink)  
Old 07-28-10, 05:14
shammat shammat is offline
Registered User
 
Join Date: Nov 2003
Posts: 2,407
Quote:
Originally Posted by nbtet View Post
However, this query only select duplicate id between 12.00 to 12.15. I also want to select the duplicate id within 12.15-12.30, 12.30-12.45 and 12.45 to 13.00. Any suggestion?
Try this
Code:
select id, t_stamp
from (
  select id, 
         t_stamp, 
         count(*) over (partition by id, extract(minute from t_stamp)::int / 15) as num_dupes
  from log
) t
where num_dupes > 1
order by t_stamp
Reply With Quote
  #10 (permalink)  
Old 07-28-10, 05:49
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 21
Hi,

The query returns the same result as before
anyway, can you do something like grouping the 1 hour into every 15minutes? there should be 4 group i guess:
0-15mins, 15-30mins, 30-45mins and 45-60mins. I think this is the main reason why i failed.
for getting the duplicate id is not a problem, but how to get the duplicate id which is both within the time period.


Thanks.

Last edited by nbtet; 07-28-10 at 05:57.
Reply With Quote
  #11 (permalink)  
Old 07-28-10, 05:54
shammat shammat is offline
Registered User
 
Join Date: Nov 2003
Posts: 2,407
Quote:
Originally Posted by nbtet View Post
The query returns the same result as before
Hmm, I tested that with your sample data and it returned exactly what you wanted.
Code:
postgres=> select * from log;
  id  |       t_stamp
------+---------------------
 1110 | 2010-05-16 12:02:18
 1111 | 2010-05-16 12:12:45
 1122 | 2010-05-16 12:12:06
 1122 | 2010-05-16 12:13:45
 1122 | 2010-05-16 12:17:52
 1122 | 2010-05-16 12:18:08
 1121 | 2010-05-16 12:20:33
 1121 | 2010-05-16 12:21:23
 1110 | 2010-05-16 12:21:36
 1110 | 2010-05-16 12:23:56
 1121 | 2010-05-16 12:33:49
 1110 | 2010-05-16 12:35:51
 1136 | 2010-05-16 12:47:10
 1133 | 2010-05-16 12:48:09
 1137 | 2010-05-16 12:49:10
 1133 | 2010-05-16 12:53:09
(16 rows)

postgres=> select id, t_stamp
postgres-> from (
postgres(>   select id,
postgres(>          t_stamp,
postgres(>          count(*) over (partition by id, extract(minute from t_stamp)::int / 15) as num_dupes
postgres(>   from log
postgres(> ) t
postgres-> where num_dupes > 1
postgres-> order by t_stamp
postgres-> ;
  id  |       t_stamp
------+---------------------
 1122 | 2010-05-16 12:12:06
 1122 | 2010-05-16 12:13:45
 1122 | 2010-05-16 12:17:52
 1122 | 2010-05-16 12:18:08
 1121 | 2010-05-16 12:20:33
 1121 | 2010-05-16 12:21:23
 1110 | 2010-05-16 12:21:36
 1110 | 2010-05-16 12:23:56
 1133 | 2010-05-16 12:48:09
 1133 | 2010-05-16 12:53:09
(10 rows)

postgres=>
These are exactly the rows that you marked red in post #6
Reply With Quote
  #12 (permalink)  
Old 07-28-10, 06:02
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 21
Well, i actually modify the code u gave me. If i run exactly the same code,
Code:
select user_id, t_stamp
from (
  select user_id, 
         t_stamp, 
         count(*) over (partition by user_id, extract(minute from t_stamp)::int / 15) as num_dupes
  from click_log
) t
where t_stamp between date_trunc('hour',timestamp '2010-05-18 13:44:00' - interval '1 hour') 
and date_trunc('hour', timestamp '2010-05-18 13:44:00') and num_dupes > 1 
order by t_stamp;
i got this error:
Quote:
ERROR: function pg_catalog.date_part("unknown", text) does not exist
LINE 5: count(*) over (partition by user_id, extract(minute...
^
HINT: No function matches the given name and argument types. You may need to add explicit type casts.

Last edited by nbtet; 07-28-10 at 06:07.
Reply With Quote
  #13 (permalink)  
Old 07-28-10, 06:15
shammat shammat is offline
Registered User
 
Join Date: Nov 2003
Posts: 2,407
Quote:
Originally Posted by nbtet View Post
Well, i actually modify the code u gave me
Then don't say it's not working

Quote:
i got this error
I guess you are using an old Postgres version then.
As you did not state your version I assumed you are using the most recent one (8.4)

Sorry, I have not idea how to do that with an old Postgres version.
Reply With Quote
  #14 (permalink)  
Old 07-28-10, 06:35
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 21
Quote:
Originally Posted by nbtet View Post
So far, this code works well, but it doesn't give what i want. perhaps, you could suggest me a better solution. thanks again!
I'm not saying your code is not working. its only that i cant get what it supposed to be.
Can we try this once again.
the data is captured from 2010-05-18 11:41:20 to 2010-05-18 13:51:24. I need to get duplicate id from 12:00:00 to 13:00:00 only, and duplicates should be within every 15minutes. I really appreciate if you can help.

Code:
CREATE TABLE click_log (user_id integer, t_stamp timestamp without time zone);
btw, i'm using greenplum database.
thanks.

Last edited by nbtet; 07-30-10 at 05:49.
Reply With Quote
  #15 (permalink)  
Old 07-28-10, 06:43
shammat shammat is offline
Registered User
 
Join Date: Nov 2003
Posts: 2,407
Quote:
Originally Posted by nbtet View Post
i'm using greenplum database.
To which Postgres version does this match?

My solution with analytical function returns exactly what you want, but I have no idea how to do that without.

Your table definition "user_id integer" does not match your sample data (e.g. "ea4855b32dfe7547bd829ee5717cfef7") so I changed the table definition.

Here is the result when I run it on Postgres 8.4
Code:
postgres=>
postgres=> select user_id, t_stamp
postgres-> from (
postgres(>   select user_id,
postgres(>          t_stamp,
postgres(>          count(*) over (partition by user_id, extract(minute from t_stamp)::int / 15) as num_dupes
postgres(>   from click_log
postgres(> ) t
postgres-> where t_stamp
postgres->    between date_trunc('hour',timestamp '2010-05-18 13:44:00' - interval '1 hour')
postgres->            and date_trunc('hour', timestamp '2010-05-18 13:44:00')
postgres-> and num_dupes > 1
postgres-> order by t_stamp;
             user_id              |       t_stamp
----------------------------------+---------------------
 8b842d2c5ca36e194d7b9485f5f00d90 | 2010-05-18 12:01:28
 ac582dc750cc4e9a6cbbc08a3b9d71bc | 2010-05-18 12:01:49
 ac582dc750cc4e9a6cbbc08a3b9d71bc | 2010-05-18 12:06:14
 ac582dc750cc4e9a6cbbc08a3b9d71bc | 2010-05-18 12:10:33
 8b842d2c5ca36e194d7b9485f5f00d90 | 2010-05-18 12:12:33
 d184bd64be648b63dc755ff779482206 | 2010-05-18 12:31:50
 d184bd64be648b63dc755ff779482206 | 2010-05-18 12:31:55
 02007f0b1e6f70a06d6fed9314d13726 | 2010-05-18 12:41:49
 470c3a0c10089276cad4f3e491f78cf4 | 2010-05-18 12:45:40
 470c3a0c10089276cad4f3e491f78cf4 | 2010-05-18 12:45:46
(10 rows)
As far as I can tell, this is exactly what you want.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On