Page 1 of 2 12 LastLast
Results 1 to 15 of 17
  1. #1
    Join Date
    Nov 2004
    Location
    on the wrong server
    Posts
    8,835
    Provided Answers: 6

    Unanswered: The fastest possible way to put a few hundred million strings in proper case

    First off, I know this is a presentation issue. Second, no, I can't force a change on my source systems.

    Some of the systems that send my BI application data, send that data in all upper case like so "JOHN DOE". We have this horrible SQL function that goes through and makes sure that the first letter in a word is always uppercase and the rest of the letters are lower case. So my results are "John Doe". That wonderful piece of code someone left me is below.

    As you can imagine this is dreadfully slow when executed a couple of hundred million times, but what are my options?

    I have not used Data Quality Services yet, but the chart in BOL says a DQS SSIS cleansing task can do 1 million records in 2 hours on a given set of hardware. That is still pretty horrible.

    I suppose I could cobble together a Script task in SSIS, but I am pretty sure clumsy dotNet is not going to be much faster. Does anyone have any black magic for this problem?

    CREATE FUNCTION [dbo].[udf_ProperCase](@UnCased varchar(max))
    RETURNS varchar(max)
    as
    begin
    declare @Reset bit;
    declare @Ret varchar(max);
    declare @i int;
    declare @c char(1);

    select @Reset = 1, @i=1, @Ret = '';

    while (@i <= len(@UnCased))
    select @c= substring(@UnCased,@i,1),
    @Ret = @Ret + case when @Reset=1 then UPPER(@c) else LOWER(@c) end,
    @Reset = case when @c like '[a-zA-Z]' then 0 else 1 end,
    @i = @i +1
    return @Ret
    end
    “If one brings so much courage to this world the world has to kill them or break them, so of course it kills them. The world breaks every one and afterward many are strong at the broken places. But those that will not break it kills. It kills the very good and the very gentle and the very brave impartially. If you are none of these you can be sure it will kill you too but there will be no special hurry.” Earnest Hemingway, A Farewell To Arms.

  2. #2
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,800
    Provided Answers: 11
    I would question the users to see if they really care that the names be in this "proper case business". Hopefully one of the folks consuming this data is a MacMaster, or a DeSouza. If they don't care, the problem mercifully goes away.

    As for rewriting the function, there is not a lot of hope for it. I assume you have already considered something along the lines of removing that while loop, and replacing it with something that
    1) LowerCases the whole string
    2) UpperCases the first character after a space or hyphen

    It may go slightly faster than the pseudo array walk across each string.

  3. #3
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    Scalar functions don't scale (pun intended). They perform RBAR (row-by-agonizing-row) operations, and are too be avoided.

    MCrowleys approach will sort you out in a much more set-based manner.


    Air-code algorithm (room for vast improvement, I'm sure):
    Code:
    DECLARE @current_id int     = 0
          , @last_id    int     = 0
          , @batch_size int     = 50000
          , @letter     char(1) = 'A'
    ;
    
    SELECT @current_id = Min(id)
         , @last_id    = Max(id)
    FROM   your_table
    ;
    
    WHILE @current_id <= @last_id
      BEGIN
        UPDATE your_table
        SET    that_field = Lower(that_field)
        WHERE  id >= @current_id
        AND    id <  @current_id + @batch_size
        ;
        
        SET @letter = 'A';
        
        WHILE Ascii(@letter) <= Ascii('Z')
          BEGIN
            UPDATE your_table
            SET    that_field = @letter + SubString(that_field, 2, 8000)
            WHERE  id >= @current_id
            AND    id <  @current_id + @batch_size
            AND    that_field LIKE @letter + '%' -- first character
            ;
            
            UPDATE your_table
            SET    that_field = Replace(Replace(that_field, ' ' + @letter, ' ' + @letter), '-' + @letter, '-' + @letter)
            WHERE  id >= @current_id
            AND    id <  @current_id + @batch_size
            AND    that_field LIKE '%[ -]' + @letter + '%' -- preceeding space or hyphen
            ;
    
            SET @letter = Char(Ascii(@letter) + 1);
          END
        ;
      
        SET @current_id += @batch_size;
      END
    ;
    Strangely, this is something I am probably going to have to look at doing soon with some of our data
    Last edited by gvee; 09-18-15 at 05:06. Reason: Substring(..., 1, 8000) changed to Substring(..., 2, 8000)
    George
    Home | Blog

  4. #4
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    Tested against 125M row dataset on a single column, on a dev box, and it took a few hours.

    Obviously this is a one-off task, so I'm pretty happy with that performance level. If time was more of a factor you could implement some logging to allow you to stop it and pick it up from where it left off.
    George
    Home | Blog

  5. #5
    Join Date
    Nov 2004
    Location
    on the wrong server
    Posts
    8,835
    Provided Answers: 6
    Quote Originally Posted by gvee View Post
    Tested against 125M row dataset on a single column, on a dev box, and it took a few hours.

    Obviously this is a one-off task, so I'm pretty happy with that performance level. If time was more of a factor you could implement some logging to allow you to stop it and pick it up from where it left off.
    That is certainly faster than the performance DQS was promising in the white papers. This ETL is written in a very stupid way for the amount of data they are dealing with. Any time a new version of the source data shows up, they delete and reload the last 18 months every time. It is on my to-do list and in the sprint plans to move them towards a more incremental load approach, so unfortunately this will not a be a one time operation until I get that done.

    Thanks for the code. I am having an umpa lumpa test out a few things, and this close to one of the approaches we talked about.
    “If one brings so much courage to this world the world has to kill them or break them, so of course it kills them. The world breaks every one and afterward many are strong at the broken places. But those that will not break it kills. It kills the very good and the very gentle and the very brave impartially. If you are none of these you can be sure it will kill you too but there will be no special hurry.” Earnest Hemingway, A Farewell To Arms.

  6. #6
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    The second re-run through was 2h31. I didn't keep the times from the first run as I had a mistake in the air-code (SubString was off by 1). Worth mentioning that the dev environment I ran against is a 2014 instance, on a middle-of-the-road spec VM. If you're massively interested in the spec, give me a shout.

    You may also wish to increase it to include a test for slashes, as I had a small handful of them in my data too. Not sure what else to bother expanding by - but if you think of anything, post back so I can add it to my script.

    P.S. I honestly didn't realise who was asking the question until now - so apologies for teaching you to suck eggs re: scalar functions.

    P.P.S. to suck an egg you first need to purse your lips...
    George
    Home | Blog

  7. #7
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,800
    Provided Answers: 11
    We naturally will contact Pootle to grant you the appropriate number of demerits.

    I was looking at splitting the names apart in a recursive CTE, but I was having trouble coming up with a way to collapse the results back together.

    If this is a day to day thing, it may very well be worth creating a CLR function for this. CLR is traditionally better at string operations, so this would be a pretty good test of that, as well.

  8. #8
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    I tested this against a pretty beefy table that I can't publicize, but there's a smidgeon of test data here. One of you bright energetic types can try this against your data and let me know how this compares.
    Code:
    DECLARE @pat NVARCHAR(99) = '%[^0-9A-Z^a-z][a-z]%'
    
    DECLARE @t TABLE
    (  junque       NVARCHAR(30)        NOT NULL
    )
    
    INSERT INTO @t
       VALUES ('JOHN DOE'), ('JANE SMITH')
    ,     ('ALFRED E. NEWMAN'), ('SARAH JESSICA PARKER')
    ,     ('lower case ringer!')
    
    UPDATE @t
       SET junque = Upper(Left(junque, 1)) + Lower(SubString(junque, 2, 4000))
    
    WHILE (0 < @@rowcount)
       UPDATE @t
          SET junque = Stuff(junque, 1 + PatIndex(@pat, junque COLLATE Latin1_General_BIN)
    ,        1, Upper(SubString(junque, 1 + PatIndex(@pat, junque COLLATE Latin1_General_BIN), 1)))
          WHERE  0 < PatIndex(@pat, junque COLLATE Latin1_General_BIN)
    
    SELECT *
       FROM @t
    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  9. #9
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    As a side note, my query performs pretty well as pure SQL but it causes the database log file to explode in size. An SSIS package that implemented the same logic using a single pass with a single update could cut the log growth down a lot by minimizing the log file usage. This could save fifty percent or more of the log file growth.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  10. #10
    Join Date
    Nov 2004
    Location
    on the wrong server
    Posts
    8,835
    Provided Answers: 6
    I had thought about CLR or a Script task in SSIS and I had been leaning that way, because I thought the orthodolxy on this was that application code (i.e. C#, vb.Net etc...) would be faster at this sort of thing than SQL.

    I am going to make the new girl test both gvees and Pat's approaches and either a CLR function or a script task and which ever version wins, I will award the BEST ANSWER merit badge.

    How much fun would flump be having with this whole best answer thing?
    “If one brings so much courage to this world the world has to kill them or break them, so of course it kills them. The world breaks every one and afterward many are strong at the broken places. But those that will not break it kills. It kills the very good and the very gentle and the very brave impartially. If you are none of these you can be sure it will kill you too but there will be no special hurry.” Earnest Hemingway, A Farewell To Arms.

  11. #11
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Nah, His Flump-nicity was really focused more on minions and demerits than on the more straightforward "Best Answer" approach. This might interest him if he could acquire another minion dedicated to computing demerits per contribution and ranking contenders that way though.

    As an observation, my algorithm could also be easily implemented as an SSIS package too. In very large cases, you could also keep a "working set" and only consider rows with changes in the current pass as candidates for additional passes but that would add extra overhead that would work against you in Transact-SQL.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  12. #12
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,800
    Provided Answers: 11
    As I recall, Pootle valued an application with a bit of soul, rather than cold utilitarian things like performance. Still, he would probably come up with a different way.

  13. #13
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Greetings Dweebs

    My name is Minion #937, the third of his name.

    His Supremeness instructed me to respond on his behalf because he can't be personally troubled by your "piddly-ass problems" (end quote). It is my tremendous honour to be released from the Codings Factory (TM) for up to a maximum of 10 minutes to represent the views of our Benevolent Leader. He is far too busy being productive (attending meetings, deconstructing then reconstructing his lunch, requesting 100 page technical proposals and opening the preliminary review with "Now, I'm confident your paper is not fit for purpose, so this is what we are going to do" etc.)

    String manipulation of this nature within SQL Server can be achieved efficiently through the use of CLR. T-SQL, even utilising recursive set based algorithms, is a poor performer on anything other than the most trivial of data volumes. Of course, the most efficient and scalable method of all is to bind a sufficient number of minions to eternal servitude, sit them in a dark, damp room and assign each of them 1 row to manually update but only a few possess the intellect necessary for such out of the blue box flagpole synergistic insight.

    I have downgraded all of your responses. Please reciprocate by marking my answer as the best.

    It is true that we have made great progress in Project Soul, solutioneering rapid and fully scalable backwards compatible agile collaborative application development soul frameworks. You can follow our pioneering work by subscribing to the hashtag #solutioneeringrapidandfullyscalablebackwardscompa tibleagilecollaborativeapplicationdevelopmentsoulf rameworks. It is totally trending right now.

    Hi Excellency finally requests that I close this post by wishing you all well, whoever the hell you are.

    Kind Regards,
    Minion #937
    on behalf of The Adorable Lord Chief Pootle of Flump

  14. #14
    Join Date
    Nov 2004
    Location
    on the wrong server
    Posts
    8,835
    Provided Answers: 6
    I needed that. Thanks.

    Mention his name in 3 consecutive posts, and he appears.

    I have no detailed results. I have not been in the office because the car is in the shop, but she emailed me this morning. My minion did not try out the CLR. She opted for a script task in SSIS instead. She said her script task was the hands down winner doing her test set in minutes and not hours on a cold cache.

    I told her to run with it, because I have bigger fires to either flame or fight at the minute.
    “If one brings so much courage to this world the world has to kill them or break them, so of course it kills them. The world breaks every one and afterward many are strong at the broken places. But those that will not break it kills. It kills the very good and the very gentle and the very brave impartially. If you are none of these you can be sure it will kill you too but there will be no special hurry.” Earnest Hemingway, A Farewell To Arms.

  15. #15
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Quote Originally Posted by Thrasymachus View Post
    Mention his name in 3 consecutive posts, and he appears.
    That third mention thing almost seems predictable.

    It is good to know that Poots still dispatches minions to amuse us.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •