Results 1 to 4 of 4

Thread: Dedupe Query

  1. #1
    Join Date
    Oct 2003
    Posts
    268

    Unanswered: Dedupe Query

    Below, I have some T-SQL to create a table, add some sample records and do a dedupe query. This all works but performance is poor on large data sets and I was wondering if someone had any optimization tips.

    Thanks to blindman for helping me develop this version.

    Some background: the staging table has undeduped records about users. I want to dedupe and get the record with the most fields. I don't want to mix fields from different records. The staging table is currently unindexed but that can be changed.

    Code:
    CREATE TABLE StagingRecords
    (
    	EmailAddress VARCHAR(75) NULL,
    	FirstName VARCHAR(255) NULL,
    	LastName VARCHAR(255) NULL,
    	StreetAddress VARCHAR(255) NULL,
    	City VARCHAR(255) NULL,
    	State VARCHAR(255) NULL,
    	ZipCode VARCHAR(255) NULL,
    	RecordID INT IDENTITY NOT NULL
    )
    
    INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
    	VALUES ('usera@hotmail.com', 'John', 'Doe', NULL, NULL, NULL, NULL)
    INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
    	VALUES ('usera@hotmail.com', 'Abe', 'Abelman', NULL, NULL, 'MI', NULL)
    INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
    	VALUES ('usera@hotmail.com', 'Zach', 'Zedcynsky', NULL, NULL, 'TX', NULL)
    INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
    	VALUES ('usera@hotmail.com', 'Mary', 'Jane', NULL, NULL, NULL, NULL)
    
    
    INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
    	VALUES ('zzz@yahoo.com', 'Cletus', 'Van Damme', NULL, NULL, NULL, NULL)
    INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
    	VALUES ('zzz@yahoo.com', 'Alfonse', 'Ackbar', NULL, NULL, 'AL', '12345')
    INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
    	VALUES ('zzz@yahoo.com', 'Zoom', 'Zuckerman', NULL, NULL, 'NJ', '54321')
    INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
    	VALUES ('zzz@yahoo.com', 'Mary', 'Jane', NULL, 'Springfield', NULL, NULL)
    
    
    SELECT
    	NULLCountTable.MinNULLCount, IDTable.TargetRecordID,
    	StagingRecords.EmailAddress, StagingRecords.FirstName, StagingRecords.LastName, StagingRecords.StreetAddress, StagingRecords.City, StagingRecords.State, StagingRecords.ZipCode
    FROM
    	(SELECT EmailAddress
    		, MIN(CASE WHEN StagingRecords.FirstName IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.LastName IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.StreetAddress IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.City IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.State IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.ZipCode IS NULL THEN 1 ELSE 0 END) AS MinNULLCount
    	FROM StagingRecords
    	GROUP BY EmailAddress) AS NULLCountTable
    INNER JOIN
    	(SELECT Min(RecordID) AS TargetRecordID, EmailAddress
    		, (CASE WHEN StagingRecords.FirstName IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.LastName IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.StreetAddress IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.City IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.State IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.ZipCode IS NULL THEN 1 ELSE 0 END) AS NULLCount
    	FROM StagingRecords
    	GROUP BY EmailAddress, (CASE WHEN StagingRecords.FirstName IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.LastName IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.StreetAddress IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.City IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.State IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN StagingRecords.ZipCode IS NULL THEN 1 ELSE 0 END)) AS IDTable
    		ON (NULLCountTable.EmailAddress = IDTable.EmailAddress AND NULLCountTable.MinNULLCount = IDTable.NULLCount)
    INNER JOIN StagingRecords ON (StagingRecords.EmailAddress = IDTable.EmailAddress AND StagingRecords.RecordID = IDTable.TargetRecordID)

  2. #2
    Join Date
    Oct 2003
    Posts
    268
    This has been running on an undeduped table of 170 million records for over 2.5 hours and has yet to output a single row.

    Recently, people in this forum said I should use set based solutions such as this over cursors. I should get dramatic/exponential performance gains. And if I wasn't seeing that, then I wasn't doing it right. Well, I must not be doing this right so I'm asking for help.

    This seems like it must do extra logic and sorting that the cursor based code doesn't have to do. For example, the cursor code doesn't need unique staging record IDs and never has to join on them or perform a fraction of the joining of this query.

    The provided SQL should be say to run and experiment with on a tempdb.

    Any help is much appreciated.

  3. #3
    Join Date
    Jul 2004
    Posts
    52
    There are probably more efficient solutions, but this approach is about 4 times faster than blindmans more complex query
    if you build the view and indexes (which may make the whole thing a wash in the end)

    Also, you suggested that your cursor solution appeared faster. It may return some rows faster than blindmans query, but
    his suggestion is still fairly efficient. It will read the table 2-3 times but iterating through each row with a cursor,
    even if you read each row only once, will still be MUCH SLOWER to complete the entire process.
    Code:
    SET CONCAT_NULL_YIELDS_NULL ON
    SET ARITHABORT ON
    
    --Create a view with a calculated colum for count of null records
    CREATE VIEW v_StagingRecords WITH SCHEMABINDING AS 
    SELECT EmailAddress,
    	  CASE WHEN FirstName IS NULL THEN 1 ELSE 0 END
    	+ CASE WHEN LastName IS NULL THEN 1 ELSE 0 END
    	+ CASE WHEN StreetAddress IS NULL THEN 1 ELSE 0 END
    	+ CASE WHEN City IS NULL THEN 1 ELSE 0 END
    	+ CASE WHEN State IS NULL THEN 1 ELSE 0 END
    	+ CASE WHEN ZipCode IS NULL THEN 1 ELSE 0 END NullCount, RecordID
    FROM dbo.StagingRecords
    
    --3 seconds. 1700 reads
    CREATE UNIQUE CLUSTERED INDEX vSR_IDX on v_StagingRecords (RecordID)
    CREATE INDEX v_EN_IDX on v_StagingRecords (EmailAddress, NullCount)
    
    --100,000 row table. 0.6 seconds, 800 reads vs 2.8 seconds, 2200 reads for original query.
    SELECT * FROM v_StagingRecords WHERE RecordID IN
    (
    	SELECT (SELECT TOP 1 RecordID FROM v_StagingRecords WITH (NOEXPAND) 
    		WHERE EmailAddress = o.EmailAddress Order BY NullCount ASC) RecordID
    	FROM v_StagingRecords o WITH (NOEXPAND)
    	GROUP BY EmailAddress
    )
    ORDER BY EmailAddress

  4. #4
    Join Date
    Jul 2004
    Posts
    52
    I loaded 17 million rows into a test table and built the view, indexes, and ran the query. It took 16 minutes in total, 101 seconds for the query itself. I tried blindmans query on the same data it it took 113, seconds so scrap my suggestion. I might play with it a little more and see if I can come up with anything better.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •