If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > PostgreSQL > Display duplicate records

Reply
 
LinkBack Thread Tools Display Modes
  #16 (permalink)  
Old 07-28-10, 06:54
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 11
Hi,

as far as i know, use of the older PostgreSQL version should not be an issue in greenplum and it is an open-source PostgreSQL platform.
I will try your code and will update anything later.


thank you very much, Shammat

regards,
nbtet
Reply With Quote
  #17 (permalink)  
Old 07-28-10, 10:48
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 11
Quote:
Originally Posted by shammat View Post
Code:
             user_id              |       t_stamp
----------------------------------+---------------------
 8b842d2c5ca36e194d7b9485f5f00d90 | 2010-05-18 12:01:28
 ac582dc750cc4e9a6cbbc08a3b9d71bc | 2010-05-18 12:01:49
 ac582dc750cc4e9a6cbbc08a3b9d71bc | 2010-05-18 12:06:14
 ac582dc750cc4e9a6cbbc08a3b9d71bc | 2010-05-18 12:10:33
 8b842d2c5ca36e194d7b9485f5f00d90 | 2010-05-18 12:12:33
 d184bd64be648b63dc755ff779482206 | 2010-05-18 12:31:50
 d184bd64be648b63dc755ff779482206 | 2010-05-18 12:31:55
 02007f0b1e6f70a06d6fed9314d13726 | 2010-05-18 12:41:49
 470c3a0c10089276cad4f3e491f78cf4 | 2010-05-18 12:45:40
 470c3a0c10089276cad4f3e491f78cf4 | 2010-05-18 12:45:46
(10 rows)
As far as I can tell, this is exactly what you want.
yes, this is working and exactly what i want. Just for one thing, if you notice the red mark inside the data, this should not be included i guess since it is obvious that there is no duplicate for this user id. why this is happening actually? the rest is just fine.

thanks.
Reply With Quote
  #18 (permalink)  
Old 07-29-10, 03:02
shammat shammat is offline
Registered User
 
Join Date: Nov 2003
Posts: 1,598
Quote:
Originally Posted by nbtet View Post
yes, this is working and exactly what i want. Just for one thing, if you notice the red mark inside the data, this should not be included i guess since it is obvious that there is no duplicate for this user id. why this is happening actually? the rest is just fine.
It's because of my sloppy usage of the "partition by" expression.
There are three rows for that user_id, the other two have a different hour but the same "minute interval". As the partition is evaluated over the whole result, these are take into account.

There are two solutions to this: either include the full date (without minutes) in the partition like this:
count(*) over (partition by user_id, to_char(t_stamp, 'yyyymmddhh24'), extract(minute from t_stamp)::int / 15) as num_dupes

or (which is probably more efficient) push the where condition that limits the time frame into the inner select in order to make the partition only work over the selected time frame:
Code:
select * from 
(
  select user_id, 
         t_stamp, 
         count(*) over (partition by user_id, extract(minute from t_stamp)::int / 15) as num_dupes
  from click_log
  where t_stamp between date_trunc('hour',timestamp '2010-05-18 13:44:00' - interval '1 hour') 
                        and date_trunc('hour', timestamp '2010-05-18 13:44:00') 
) t
where num_dupes > 1;
Of course both things can be combined, to make the query more robust.
Reply With Quote
  #19 (permalink)  
Old 07-29-10, 10:29
futurity futurity is offline
Registered User
 
Join Date: May 2008
Posts: 108
Quote:
Originally Posted by nbtet View Post
as far as i know, use of the older PostgreSQL version should not be an issue in greenplum and it is an open-source PostgreSQL platform.
The problem is that the window function shammat uses (count(*) over (...)) was introduced in PostgreSQL 8.4. If greenplum is built upon an older version of PostgreSQL, then this function won't be present. This is why his code isn't working for you.

Try this:
Code:
select
    id,
    t_stamp
from click_log
where
    (id, extract(hour from t_stamp), extract(minute from t_stamp)::int / 15)
        in (
            select
                id,
                hr,
                intvl
            from (
                    select
                        id,
                        extract(hour from t_stamp) as hr,
                        extract(minute from t_stamp)::int / 15 as intvl
                    from click_log
                ) as intervals
            group by id, hr, intvl
            having count(hr || ':' || intvl) > 1
        )
order by id, t_stamp
You'll need to change the divisors to the appropriate interval.

Last edited by futurity; 07-29-10 at 11:06.
Reply With Quote
  #20 (permalink)  
Old 07-29-10, 11:09
shammat shammat is offline
Registered User
 
Join Date: Nov 2003
Posts: 1,598
That's a pretty cool solution!
Reply With Quote
  #21 (permalink)  
Old 07-29-10, 11:26
nbtet nbtet is offline
Registered User
 
Join Date: Jul 2010
Posts: 11
Futurity, the code is great! thanks for the solution. I really appreciate both of your efforts to help me. I learn so much things from you guys.

Regards,
nbtet
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On