Results 1 to 10 of 10
  1. #1
    Join Date
    Oct 2003
    Posts
    268

    Unanswered: Optimize Dedup TSQL?

    The following is fairly basic cursor based merge code. Order by UserName and DomainID; INSERT a single merged record for every distinct combination.

    I had originally written this as a single INSERT statement with subqueries. Avoiding cursors is supposed to be faster but comparing the execution plans for both strategies suggested the cursor based code was MUCH faster, so this is what I'm going to try and use.

    If anyone can think of any optimizations to this I would be very grateful. Thanks!

    Code:
    DECLARE @emailUser VARCHAR(50)
    DECLARE @domainID INT
    DECLARE @firstName VARCHAR(24)
    DECLARE @lastName VARCHAR(24)
    DECLARE @streetAddress VARCHAR(32)
    DECLARE @city VARCHAR(24)
    DECLARE @state VARCHAR(24)
    DECLARE @postal VARCHAR(10)
    DECLARE @sourceID INT
    DECLARE @numNameNULLs INT
    DECLARE @numAddressNULLs INT
    
    DECLARE @rowEmailUser VARCHAR(50)
    DECLARE @rowDomainID INT
    DECLARE @rowFirstName VARCHAR(24)
    DECLARE @rowLastName VARCHAR(24)
    DECLARE @rowStreetAddress VARCHAR(32)
    DECLARE @rowCity VARCHAR(24)
    DECLARE @rowState VARCHAR(24)
    DECLARE @rowPostal VARCHAR(10)
    DECLARE @rowSourceID INT
    DECLARE @rowNumNameNULLs INT
    DECLARE @rowNumAddressNULLs INT
    
    DECLARE stagingCursor CURSOR FOR
    	SELECT
    		UserName, DomainID, First, Last, StreetAddress, City, State, Postal, SourceID
    		, 	(CASE WHEN Stages.[First] IS NULL THEN 1 ELSE 0 END
    			+ CASE WHEN Stages.[Last] IS NULL THEN 1 ELSE 0 END) AS NumNameNULLs
    		,	(CASE WHEN Stages.[StreetAddress] IS NULL THEN 1 ELSE 0 END
    			+ CASE WHEN Stages.[City] IS NULL THEN 1 ELSE 0 END
    			+ CASE WHEN Stages.[State] IS NULL THEN 1 ELSE 0 END
    			+ CASE WHEN Stages.[Postal] IS NULL THEN 1 ELSE 0 END) AS NumAddressNULLs
    	FROM Stages
    	WHERE NOT EXISTS
    		(SELECT UserName FROM Recipients
    		WHERE Stages.UserName = Recipients.UserName
    		AND Stages.DomainID = Recipients.DomainID)
    	ORDER BY UserName, DomainID
    
    OPEN stagingCursor
    FETCH NEXT FROM stagingCursor INTO
    	@rowEmailUser, @rowDomainID, @rowFirstName, @rowLastName, @rowStreetAddress, @rowCity, @rowState, @rowPostal, @rowSourceID, @rowNumNameNULLs, @rowNumAddressNULLs
    WHILE @@FETCH_STATUS = 0
    BEGIN
    	IF (@emailUser = @rowEmailUser AND @domainID = @rowDomainID) BEGIN
    		-- Merge a consecutive row for the current UserName/DomainID combination
    		IF (@rowNumNameNULLs < @numNameNULLs) BEGIN
    			SET @numNameNULLs = @rowNumNameNULLs
    			SET @firstName = @rowFirstName
    			SET @lastName = @rowLastName
    		END
    		IF (@rowNumAddressNULLs < @numAddressNULLs) BEGIN
    			SET @streetAddress = @rowStreetAddress
    			SET @city = @rowCity
    			SET @state = @rowState
    			SET @postal = @rowPostal
    			SET @numAddressNULLs = @rowNumAddressNULLs
    		END
    	END ELSE BEGIN
    		IF (@emailUser IS NOT NULL) BEGIN
    			-- Finished iterating 1+ records of a UserName/DomainID combination. INSERT the merged record.
    			INSERT INTO Recipients(UserName, DomainID, First, Last, StreetAddress, City, State, Postal, SourceID)
    				VALUES (@emailUser, @domainID, @firstName, @lastName, @streetAddress, @city, @state, @postal, @sourceID)
    		END
    
    		-- Reached a new UserName/DomainID combination. Set current data.
    		SET @emailUser = @rowEmailUser
    		SET @domainID = @rowDomainID
    		SET @firstName = @rowFirstName
    		SET @lastName = @rowLastName
    		SET @streetAddress = @rowStreetAddress
    		SET @city = @rowCity
    		SET @state = @rowState
    		SET @postal = @rowPostal
    		SET @sourceID = @rowSourceID
    		SET @numNameNULLs = @rowNumNameNULLs
    		SET @numAddressNULLs = @rowNumAddressNULLs
    	END
    
    	FETCH NEXT FROM stagingCursor INTO
    		@rowEmailUser, @rowDomainID, @rowFirstName, @rowLastName, @rowStreetAddress, @rowCity, @rowState, @rowPostal, @rowSourceID, @rowNumNameNULLs, @rowNumAddressNULLs
    END
    CLOSE stagingCursor
    DEALLOCATE stagingCursor
    
    IF (@emailUser IS NOT NULL) BEGIN
    	-- Finished iterating 1+ records of a UserName/DomainID combination. INSERT the merged record.
    	INSERT INTO Recipients(UserName, DomainID, First, Last, StreetAddress, City, State, Postal, SourceID)
    		VALUES (@emailUser, @domainID, @firstName, @lastName, @streetAddress, @city, @state, @postal, @sourceID)
    END
    A coworker has a C++ solution that uses FAST INSERT operations (directly through the ADODB API; like bcp or BULK INSERT). He claims to get twice the performance over the above T-SQL although I can't use his application due to bizarre errors. The above T-SQL is obviously using plain old INSERT statements although I hope the looping and everything would be faster since its 100% native database code. Is there anyway I can get FAST/BULK INSERT like performance through the above T-SQL?
    Last edited by RogerWilco; 05-30-04 at 23:57.

  2. #2
    Join Date
    Apr 2004
    Location
    Kansas City, MO
    Posts
    734
    I hate to disappoint you on this, but cursors are "never" faster in SQL then set-based operations. Your best bet is to try to fit this into one operation. The reason the execution plan looks so much better is because it only shows one iteration of the cursor in the plan. As I recall, you would be cycling through this several million times. That's REALLY bad.
    MeanOldDBA
    derrickleggett@hotmail.com
    When life gives you a lemon, fire the DBA.

  3. #3
    Join Date
    Oct 2003
    Posts
    268
    Yes, that is the theory but the execution plan indicated that the single INSERT statement (shown below) is doing recursive index scans for each of the subqueries for each source row!!

    UPDATE: On a 10 million record test run on my dev system:

    CURSOR approach: completed in 100 minutes
    INSERT w/subqueries approach: killed job after 120 minutes when it wasn't near completion (I needed my dev system back).

    The cursor based strategy has been running on the production server for almost 18 hours and has output only 22 million deduped records. The source table has 500 million records and I anticipate approximately 250 million deduped records. At this rate, this process will take weeks!

    Any suggestions for optimizing either of these dedup strategies is greatly appreciated.

    Possibilities:
    - Study with SQL Profiler to find avenues for optimization
    - Stop process, delete all undeduped data that has been processed (if it matches one of the output 22 million deduped records), and restart. Would this make things better or worse?
    - Remove index from target table and add index once the dedup process has completed. Obviously, this would speed the INSERTs but add an extra step later.
    - BULK/FAST table loading technology???
    - Improve/Optimize INSERT statement??? Those subqueries just kill performance but I don't see how to avoid them.

    Any suggestions on how to determine approximate progress of these dedup processes? They will take easily over 24 hours to run and I'd like to be able to give good progress figures to management. Ideally, I'd like to select the last row added to the deduped table and approximate progress by alphabetical position; since the table is indexed by UserName, that should be a good approximation. Is there any way to do that without doing a table scan or a sort operation (don't want to consume lots of time and resources doing a progress check)? I can get the first row by doing SELECT TOP 1 * obviously. I wish there was a SELECT BOTTOM 1 *. The obvious SELECT TOP with an "ORDER BY" clause results in a huge sort operation which I can't afford.

    Code:
    INSERT INTO Recipients (UserName, DomainID, First, Last, StreetAddress, City, State, Postal, SourceID)
    SELECT
    	MainStages.UserName, MainStages.DomainID
    	, (SELECT TOP 1 [First] FROM Stages AS InnerStages
    		WHERE InnerStages.UserName = MainStages.UserName
    		AND InnerStages.DomainID = MainStages.DomainID
    		ORDER BY (CASE WHEN InnerStages.[First] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[Last] IS NULL THEN 1 ELSE 0 END) ASC) AS FirstName
    	, (SELECT TOP 1 [Last] FROM Stages AS InnerStages
    		WHERE InnerStages.UserName = MainStages.UserName
    		AND InnerStages.DomainID = MainStages.DomainID
    		ORDER BY (CASE WHEN InnerStages.[First] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[Last] IS NULL THEN 1 ELSE 0 END) ASC) AS LastName
    	, (SELECT TOP 1 [StreetAddress] FROM Stages AS InnerStages
    		WHERE InnerStages.UserName = MainStages.UserName
    		AND InnerStages.DomainID = MainStages.DomainID
    		ORDER BY (CASE WHEN InnerStages.[StreetAddress] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[City] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[State] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[Postal] IS NULL THEN 1 ELSE 0 END) ASC) AS StreetAddress
    	, (SELECT TOP 1 [City] FROM Stages AS InnerStages
    		WHERE InnerStages.UserName = MainStages.UserName
    		AND InnerStages.DomainID = MainStages.DomainID
    		ORDER BY (CASE WHEN InnerStages.[StreetAddress] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[City] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[State] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[Postal] IS NULL THEN 1 ELSE 0 END) ASC) AS City
    	, (SELECT TOP 1 [State] FROM Stages AS InnerStages
    		WHERE InnerStages.UserName = MainStages.UserName
    		AND InnerStages.DomainID = MainStages.DomainID
    		ORDER BY (CASE WHEN InnerStages.[StreetAddress] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[City] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[State] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[Postal] IS NULL THEN 1 ELSE 0 END) ASC) AS State
    	, (SELECT TOP 1 [Postal] FROM Stages AS InnerStages
    		WHERE InnerStages.UserName = MainStages.UserName
    		AND InnerStages.DomainID = MainStages.DomainID
    		ORDER BY (CASE WHEN InnerStages.[StreetAddress] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[City] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[State] IS NULL THEN 1 ELSE 0 END
    		+ CASE WHEN InnerStages.[Postal] IS NULL THEN 1 ELSE 0 END) ASC) AS Postal
    	, MIN(MainStages.SourceID) AS SourceID
    FROM Stages As MainStages
    WHERE NOT EXISTS
    	(SELECT UserName FROM Recipients
    	WHERE MainStages.UserName = Recipients.UserName
    	AND MainStages.DomainID = Recipients.DomainID)
    GROUP BY MainStages.UserName, MainStages.DomainID
    Last edited by RogerWilco; 05-31-04 at 18:05. Reason: Forgot to add two more possibilities

  4. #4
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    You don't need a cursor to do this.

    You don't need all those ineffecient subselects with TOP statements.

    You DO need to think about your query results. For instance, given these two records:

    UserName|DomainID|First|Last
    DupUser|DupDomain|Abe|Zedner
    DupUser|DupDomain|Zach|Abner

    You could very well end up inserting this record into Recipients:
    UserName|DomainID|First|Last
    DupUser|DupDomain|Abe|Abner

    ...even though "Abe Abner" does not exist. Same goes for the address info, and more so. You could also end up listing a First/Last combo with an address where that person has never lived.

    This is a good SQL recipe for scrambled data.

    Take a moment to think about and post your criteria for selecting records. What does you data look like?
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  5. #5
    Join Date
    Oct 2003
    Posts
    268

    Smile

    Quote Originally Posted by blindman
    UserName|DomainID|First|Last
    DupUser|DupDomain|Abe|Zedner
    DupUser|DupDomain|Zach|Abner

    You could very well end up inserting this record into Recipients:
    UserName|DomainID|First|Last
    DupUser|DupDomain|Abe|Abner
    The cursor code will never mix records like that; this can be proven. The INSERT with sub-select code shouldn't do that either and it has been tested although I can't quite prove it conceptually. The "first name"/last name" subqueries should choose the same record since they use the same ORDER BY although that behavior may not be 100% guranteed.

    I developed some fairly exhaustive test suites that stressed many combinations of dedup/merge scenarios. These suites explicitly tested for such record mixing which would be bad. The two approaches that I've mentioned have passed these suites perfectly

    Mixing name data or address data from different records, as in your exmple, is unacceptable. However, mixing name data from one record and address data from a separate record is OK and often required since often the two will exist in separate source records.

    Quote Originally Posted by blindman
    You don't need a cursor to do this.

    You don't need all those ineffecient subselects with TOP statements.
    I need to do something, right

    I'm really hoping to find a better solution. If you suggested one, I would be very happy.

    What I have is merging records properly; it's just going way to slow; been running for 15 hours and only created 20 million records (and apparently getting slower). Out of the source 500 million records I expect approximately 250 million output records. At the current rate, this will take weeks!

  6. #6
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    The INSERT statement's subqueries order by:

    CASE WHEN InnerStages.[First] IS NULL THEN 1 ELSE 0 END + CASE WHEN InnerStages.[Last] IS NULL THEN 1 ELSE 0 END

    There are only three possible results from this formula: 0, 1, and 2. There is nothing that prevents two records from having the identical UserName and DomainID values but different First and Last values, both of which would evaluate to "0" in the CASE statement you are using. If you aren't running into this problem then it's the result of luck, not logic, and that's not a safe way to design programs ("although that behavior may not be 100% guranteed"???).

    If your cursor code follows this logic, then it is subject to the same danger.

    If your data is such that you KNOW you don't have to worry about the sample data I gave in my previous post (I.E., duplicate records may have different degrees of completeness, but they never conflict with eachother) then you can merge your records very easily using functions such as COALESCE or MAX.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  7. #7
    Join Date
    Oct 2003
    Posts
    268
    Quote Originally Posted by blindman
    There are only three possible results from this formula: 0, 1, and 2. There is nothing that prevents two records from having the identical UserName and DomainID values but different First and Last values, both of which would evaluate to "0" in the CASE statement you are using. If you aren't running into this problem then it's the result of luck, not logic, and that's not a safe way to design programs ("although that behavior may not be 100% guranteed"???).
    Yes, I already pointed this out in my last post. To make the code 100% guaranteed and safe, I could add "First, Last" into the "ORDER BY" clause of the name subqueries (and a corresponding change to the address subqueries). However, the whole thing would remain absolutely useless since it runs way too slowly.

    I'm not using that INSERT statement at all (although I would if I could get it to go faster). I'm currently using the CURSOR code I posted earlier in this thread. And if you look at that code, you will see it is absolutely impossible for it to mix data from different source records. The first name/last name fields are always assigned together so it is absolutely, 100% impossible for that data to come from separate source records.

    I definitely DO have to worry about scenarios like the example that you posted so a simple COALESCE or MAX is out of the question.

    So if anyone could think of anything I can do to speed things up; I would be very grateful...
    Last edited by RogerWilco; 05-31-04 at 20:02.

  8. #8
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    For maximum speed, add a composite index to the temporary tables on the Username/DomainID/Priority columns.

    This would be a lot easier if your Stages table had a primary key. What the heck is SourceID?

    --Get a list of Names for each UserName/DomainID and assign a priority value.
    select distinct UserName, DomainID, CASE WHEN First IS NULL THEN 0 ELSE 1 END + CASE WHEN Last IS NULL THEN 0 ELSE 1 END + isnull(Last, '') + isnull(First, '') as Priority, First, Last
    into #UniqueNames
    from Stages

    --Get a list of Addresses for each UserName/DomainID and assign a priority value.
    select distinct UserName, DomainID, CASE WHEN StreetAddress IS NULL THEN 0 ELSE 1 END + CASE WHEN City IS NULL THEN 0 ELSE 1 END + CASE WHEN State IS NULL THEN 0 ELSE 1 END + CASE WHEN Postal IS NULL THEN 0 ELSE 1 END + isnull(State, '') + isnull(City, '') + isnull(StreetAddress, '') + isnull(Postal, '') as Priority, StreetAddress, City, State, Postal
    into #UniqueAddresses
    from Stages

    --Use Stages to combine the temporary tables, get the min(SourceID), and exclude records already present in the Recipients table
    INSERT INTO Recipients (UserName, DomainID, First, Last, StreetAddress, City, State, Postal, SourceID)
    select Stages.UserName, Stages.DomainID, BestNames.First, BestNames.Last, BestAddresses.StreetAddress, BestAddresses.City, BestAddresses.State, BestAddresses.Postal, min(Stages.SourceID)
    from Stages
    inner join
    (select UniqueNames.UserName, UniqueNames.DomainID, UniqueNames.First, UniqueNames.Last
    from #UniqueNames UniqueNames
    inner join (select UserName, DomainID, max(Priority) as Priority from #UniqueNames group by UserName, DomainID) MaxPriorities
    on UniqueNames.UserName = MaxPriorities.UserName and UniqueNames.DomainID = MaxPriorities.DomainID and UniqueNames.Priority = MaxPriorities.Priority) BestNames
    on Stages.UserName = BestNames.UserName and Stage.DomainID = BestNames.DomainID
    inner join
    (select UniqueAddresses.Username, UniqueAddresses.DomainID, UniqueAddresses.StreetAddress, UniqueAddresses.City, UniqueAddresses.State, UniqueAddresses.Postal
    from #UniqueAddresses UniqueAddresses
    inner join (select UserName, DomainID, max(Priority) as Priority from #UniqueAddresses group by UserName, DomainID) MaxPriorities
    on UniqueAddresses.UserName = MaxPriorities.UserName and UniqueAddresses.DomainID = MaxPriorities.DomainID and UniqueAddresses.Priority = MaxPriorities.Priority) BestAddresses
    on Stages.UserName = BestAddresses.UserName and Stage.DomainID = BestAddresses.DomainID
    left outer join Recipients on Stages.UserName = Recipients.UserName and Stage.DomainID = Recipients.DomainID
    where Recipients.UserName is null
    group by Stages.UserName, Stages.DomainID
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  9. #9
    Join Date
    Oct 2003
    Posts
    268
    I don't see the point of the temp tables; using your example, both "Abe Zedner" and "Zach Abner" rows get inserted in there.

    I had to fix a Stages typo of "Stage" and I had to add MIN function calls around all the main SELECT columns other than UserName/DomainID (even though there should only be one and the MIN should be effectively useless). After that, it ran, but returned zero rows on my test data, which isn't right, so there was something else I couldn't fix.

    It looks functionally very close to my original INSERT statement except that yours is more efficient in that there are only two main subqueries (which have their own subquery) rather than doing a separate subquery per column.

    Although, I'm not sure if that will be enough to outperform the cursor code, it is a significant optimization. I really appreciate you putting that much time into writing all of that.

    thank you!

  10. #10
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    Yes, both Abe and Zach are inserted into the temporary tables for prioritizing, but only one (complete) record is returned by the later subquery. The logic I used sorts not only by completeness, but then by last and first under the assumption that if you have one records with just a last name, and one record with just a first name, the last name is better to keep. Similiar prioritizing was used for the address fields.

    There are several advantages to breaking the process up into several parts (using temporary tables). It is easier to understand the code, the system requirements should be lower (at least for your example), and it is easier to troubleshoot the process to find out why you are getting unexpected results.

    Good luck.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •