Results 1 to 3 of 3
  1. #1
    Join Date
    Jan 2012
    Posts
    7

    Unanswered: keep duplicate with highest score fuzzy grouping

    Hi i have recently decided to dedupe my data but i am having a problem after running fuzzy grouping with the query on updating which duplicate to keep

    _key_in is unique, _key_out is the duplicates so for example:

    _key_in , _key_out , name , score , dedupe
    1 , 1 , ron , 10 , purge
    2 , 1 , ronn , 15 , keep
    3 , 3 , john , 5 , keep
    4 , 4 , matt , 15 , keep
    5 , 4 , mat , 10 , purge
    6 , 4 , matt , 15 , purge

    I want to keep the _key_out with the higher score by setting the field de_dupe to 'keep' and the remainder to 'purge'. The score can also be the same within a duplicate so in the case it is the same i just need to keep one it doesnt matter which one. The query i have below nearly works but it marks duplicates with the same score as keep.

    If someone could please take a look at it and help me fix it up it would be greatly appreciated

    Code:
    UPDATE 	b
    SET 	b.dedupe_result = 'keep'
    FROM 	
    	[BusinessListings].[dbo].[MongoOrganisationACTM1Destination] b
    INNER JOIN
    (
    	SELECT 
               Distinct(_key_out) AS _key_out,
               quality_score 
    	FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1
    	WHERE quality_score = 
    		(
    			SELECT MAX(quality_score) 
    			FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] 
    			WHERE _key_out = t1._key_out)
    		)
    ) d ON d._key_out = b._key_out AND d.quality_score = b.quality_score

  2. #2
    Join Date
    Jan 2012
    Posts
    84
    Hi,

    it would be really helpful if you provided us CREATE TABLE statements and also INSERTs statemens with sample data,
    here is an example:
    Code:
    go
    create table Mongo(
    _key_in int primary key,
    _key_out int,
    name varchar(100), 
    score int,
    dedupe varchar(100)
    );
    go
    insert into mongo values(1 , 1 , 'ron' , 10 , 'purge' );
    insert into mongo values(2 , 1 , 'ronn' , 15 , 'keep' );
    insert into mongo values(3 , 3 , 'john' , 5 , 'keep' );
    insert into mongo values(4 , 4 , 'matt' , 15 , 'keep' );
    insert into mongo values(5 , 4 , 'mat' , 10 , 'purge' );
    insert into mongo values(6 , 4 , 'matt' , 15 , 'purge' );



    Try this query it should meet your requirements:
    Code:
    go
    merge into mongo mon
    using (
    	select _key_in,
    	case ROW_NUMBER() OVER (PARTITION BY _key_out ORDER BY score desc) 
    		 when 1 then 'keep'
    		 else 'purge'
    	end new_dedupe
    	from Mongo
    ) new
    ON (mon._key_in = new._key_in )
    WHEN MATCHED THEN UPDATE
       SET mon.dedupe = new.new_dedupe
    ;

    Code:
    go
    select * from Mongo;
    
    _key_in     _key_out    name                   score       dedupe
    ----------- ----------- ---------------------- ----------- ---------
    1           1           ron                    10          purge
    2           1           ronn                   15          keep
    3           3           john                   5           keep
    4           4           matt                   15          keep
    5           4           mat                    10          purge
    6           4           matt                   15          purge
    
    (6 row(s) affected)
    Last edited by kordirko; 01-22-12 at 20:59.

  3. #3
    Join Date
    Jan 2012
    Posts
    7
    thanks kordirko that worked perfectly

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •