Page 1 of 2 12 LastLast
Results 1 to 15 of 23
  1. #1
    Join Date
    Oct 2003
    Posts
    268

    Unanswered: Not Again - Slow T-SQL performance

    I'm running the following statements. They have been running for 48 hours and counting... In performance monitor, I can see the system is steadily 100% disc bound. Any idea how to get status or ETA on these? Any idea how I can speed up such actions? This seems unusually slow. There are no errors in the Error Log. Does other DBMS systems (such as Oracle) handle such scenarios more quickly or with better status information?

    Code:
    INSERT INTO Domains (Domain)
    	SELECT DISTINCT Domain FROM Stages WHERE Domain NOT IN (SELECT Domain FROM Domains)
    
    UPDATE Stages SET DomainID = (SELECT Domains.[ID] FROM Domains WHERE Domains.Domain = Stages.Domain)
    Stages is 173 million records
    Domains is 2.4 million records

    This is running on pretty decent hardware:
    Windows 2000 SP5
    SQL Server 2000 Enterprise Edition SP3a
    Dual 3.06 GHz Xeons with HyperThreading enabled (4 virtual CPUs)
    4GB RAM (OS 3GB switch is enabled to give SQL Server 3GB of RAM)
    70GB SCSI boot drive
    Data/Transaction Log is on a RAID 5, 30 spindle, fibrechannel SAN

  2. #2
    Join Date
    Aug 2004
    Posts
    8
    First question:

    Do you have an index on Stages.Domain?
    Do you have a PK (clustered index) on Stages table?
    Also, you should have a clustered index on Domains.Domain.
    My suggestion is to try a "divide and conquer" approach:

    Create a temporary table ##tempDomains with a Domains column. Make sure you have a clustered index on this column.

    Then
    declare @i int
    @maxStages int --here I presume you have an int primary key on Stages, e.g. StagesID

    set @i = 500000
    set @maxStages = (select max StagesID from Stages)
    set @currStages = (select min StagesID from Stages)


    while @currStages <= @maxStages
    begin
    insert into ##tempDomains
    select distinct Domain from Stages where StagesID >= @currStages and StagesID < @currStages + @i

    set @currStages = @currStages + @i
    end

    /*
    This iteration should be much faster then looking for Domains in the whole table.
    At the end of this while you should have a much smaller table, clustered on Domain ready to be compared with your Domains table

    For the update of the Stages table you may use a similar approach. Make updates only on segments of the table, and rely only on clustered indexes
    */

  3. #3
    Join Date
    Nov 2002
    Location
    Jersey
    Posts
    10,322
    Do we have to revists the work in batches thing again?

    Also, don't you want to see where the id is Null? So you don't have to do all of them?
    Brett
    8-)

    It's a Great Day for America everybody!

    dbforums Yak CorralRadio 'Rita
    dbForums Member List
    I'm Good Once as I ever was

    The physical order of data in a database has no meaning.

  4. #4
    Join Date
    Aug 2004
    Posts
    8
    To keep it simple, I presumed there are no nulls (I mean Stages.Domains does not allow nulls). It this column is nullable, then everything should be faster.

    What's wrong with working in batches, as long as you get a 2000% performance improvement?

  5. #5
    Join Date
    Nov 2002
    Location
    Jersey
    Posts
    10,322
    Quote Originally Posted by cmihaila
    TWhat's wrong with working in batches, as long as you get a 2000% performance improvement?

    Absolutley nothin...

    say it again....*

    And I figure the update is because they insert the non existant "domains" and want to update the "key" with the generated identity column....

    There should be no need to update them all...

    Just a guess....



    * (we've already covered this ground several times)
    Brett
    8-)

    It's a Great Day for America everybody!

    dbforums Yak CorralRadio 'Rita
    dbForums Member List
    I'm Good Once as I ever was

    The physical order of data in a database has no meaning.

  6. #6
    Join Date
    Oct 2003
    Posts
    268
    Thanks for the response!

    Yes, I have two indexes on Domains:
    PK_Domains PRIMARY KEY CLUSTERED (ID)
    IX_Domains UNIQUE NONCLUSTERED (Domain)

    I have a single index on Stages as well. There is no clustered index or primary key. In hindsight this index shouldn't be there since the UPDATE will have to update the index.
    IX_Stages NONCLUSTERED (DomainID, UserName)

    I see the approach you are recommending. The temporary table with a Domain clustered index is an interesting one. The partitioning approach seems less promising (from my gut optimization feel). Shouldn't SQL Server be able to much such simple optimizations internally? Why would manually breaking the UPDATE into batches speed things up?

    Quote Originally Posted by cmihaila
    First question:

    Do you have an index on Stages.Domain?
    Do you have a PK (clustered index) on Stages table?
    Also, you should have a clustered index on Domains.Domain.
    My suggestion is to try a "divide and conquer" approach:

    Create a temporary table ##tempDomains with a Domains column. Make sure you have a clustered index on this column.

    Then
    declare @i int
    @maxStages int --here I presume you have an int primary key on Stages, e.g. StagesID

    set @i = 500000
    set @maxStages = (select max StagesID from Stages)
    set @currStages = (select min StagesID from Stages)


    while @currStages <= @maxStages
    begin
    insert into ##tempDomains
    select distinct Domain from Stages where StagesID >= @currStages and StagesID < @currStages + @i

    set @currStages = @currStages + @i
    end

    /*
    This iteration should be much faster then looking for Domains in the whole table.
    At the end of this while you should have a much smaller table, clustered on Domain ready to be compared with your Domains table

    For the update of the Stages table you may use a similar approach. Make updates only on segments of the table, and rely only on clustered indexes
    */

  7. #7
    Join Date
    Oct 2003
    Posts
    268
    Quote Originally Posted by Brett Kaiser
    Do we have to revists the work in batches thing again?

    Also, don't you want to see where the id is Null? So you don't have to do all of them?
    There shouldn't be an ID of NULL. Each unique domain string will always have a unique ID. The INSERT statement creates new IDs. The UPDATE statement is guaranteed to have non-NULL values.

    The work in batches approach; is this the horizontal partitioning approach you reccommended earlier?

    From a conceptual level I have trouble seeing how these would optimize anything that shouldn't be done by SQL Server internally. If something can be tested quickly (and any schema changes easily undone), I'm eager to try another approach.

  8. #8
    Join Date
    Oct 2003
    Posts
    268
    Quote Originally Posted by cmihaila
    To keep it simple, I presumed there are no nulls (I mean Stages.Domains does not allow nulls). It this column is nullable, then everything should be faster.

    What's wrong with working in batches, as long as you get a 2000% performance improvement?
    That's correct that Stages.Domains doesn't allow NULLs.

    A 2000% performance improvement would be mighty nice (as would a 50% performance improvement). Can you explain how the batching improves things at a conceptual level?

  9. #9
    Join Date
    Aug 2004
    Posts
    8
    Simple:
    The size of the index to be scanned is hundred times smaller.
    The function timeOfScan (size of index) is not a linear one but rather exponential (I've tested this many times).
    So, 10 scans of an 100k index is much much faster then 1 scan of an 1million records index.
    Again, I've learned this thing on the hard way....

    (Now, maybe you gotta a supermachine there, with tons of gigs of RAM and super speed processors and HDs. Then doing this cycling may be avoided... )

  10. #10
    Join Date
    Oct 2003
    Posts
    268
    Quote Originally Posted by cmihaila
    Simple:
    The size of the index to be scanned is hundred times smaller.
    The function timeOfScan (size of index) is not a linear one but rather exponential (I've tested this many times).
    So, 10 scans of an 100k index is much much faster then 1 scan of an 1million records index.
    Again, I've learned this thing on the hard way....
    The lookup costs on an index (binary tree) are logarithmic not exponential. One scan on a million element binary tree is faster than ten scans on a 100K tree. This is basic computer science theory and have seen this in C++ apps all the time..

    However, I can't say the empricial evidence in front of me is disagreeing with you. 50 hours is a LONG time for a single operation to be running. 170 million records is large but not THAT large. And this server has some heavy hardware to work with.

  11. #11
    Join Date
    Aug 2004
    Posts
    8
    You are right... The theory is saying you should have a logarithmic function.
    The practice is that you get an exponential behaviour.
    I've had your problem long time ago and I've tried to follow the "theoretical" path waiting couple of days to get the result.

    The solution I gave you is the only one I know and it's working. Maybe somebody more experienced can give you a better (and practical) advice.
    Cheers,
    Cezar

  12. #12
    Join Date
    Mar 2004
    Location
    Toronto
    Posts
    28
    What if you replace the original update

    UPDATE Stages SET DomainID = (SELECT Domains.[ID] FROM Domains WHERE Domains.Domain = Stages.Domain)

    with this update

    UPDATE S
    SET S.DomainID = D.ID
    FROM Stages S
    INNER JOIN Domains D ON D.Domain = S.Domain

    Haven't tested it, but it uses a join instead of a nested select.

  13. #13
    Join Date
    Nov 2002
    Location
    Jersey
    Posts
    10,322
    That's not true...

    You're inserting in to the domain table where the don't exists in stages
    The you go process every row with an update.....

    Is the query still running?

    UPDATE S
    SET S.DomainID = D.ID
    FROM Stages S
    RIGHT JOIN Domains D
    ON D.Domain = S.Domain
    WHERE S.DomainID IS NULL
    Brett
    8-)

    It's a Great Day for America everybody!

    dbforums Yak CorralRadio 'Rita
    dbForums Member List
    I'm Good Once as I ever was

    The physical order of data in a database has no meaning.

  14. #14
    Join Date
    Oct 2003
    Posts
    268
    Quote Originally Posted by homer37
    What if you replace the original update

    UPDATE Stages SET DomainID = (SELECT Domains.[ID] FROM Domains WHERE Domains.Domain = Stages.Domain)

    with this update

    UPDATE S
    SET S.DomainID = D.ID
    FROM Stages S
    INNER JOIN Domains D ON D.Domain = S.Domain

    Haven't tested it, but it uses a join instead of a nested select.
    The estimated execution plan for both statements differ only very slightly and have identical performance values.

    I noticed that 75% of the execution time is a massive sort after the JOIN but before an index update. I think my big mistake was leaving that index on Stages. I can see how that gets exponentially worse with table size and how partitioning or updating in blocks would be helpful as well.

    However, if I cancel now, I incur potentially massive rollback penalties...

  15. #15
    Join Date
    Oct 2003
    Posts
    268
    Quote Originally Posted by Brett Kaiser
    That's not true...
    Ah, I see what you are saying. Basically skip those that already have their DomainID columns set and only deal with those that don't. All 170 million of the Stages records have NULL DomainIDs. Or at least they did before the UPDATE command started. These records were just bulk loaded into that table.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •