Results 1 to 4 of 4
  1. #1
    Join Date
    Sep 2009
    Posts
    2

    Unanswered: Iterating over result and executing in batch

    I am trying to iterate over a result set from a single table query and then perform a delete operation on the results in batches where I can do a 'waitfor' after each batch. The reason we need it to be performed in batch is because we have a large replication infrastructure and there has been times in the past where updates on a large data set have caused the replication queues to fill and replication crashes.

    I initially thought cursors but it doesn't support batch deletes (based on docs). I then though to just use temp tables and row count but that seems problematic as well.

    Here is an example of our situation: We have MyTable that has three columns (uid,value) and 10 million records in it. UID values is indexed but 'value' is not. Of these 10 million records we will be deleting about 3 million based on a select of 'value' column. Ideally we want to delete 10,000 records in a transaction and do a waitfor 2 seconds to give the replication queues times to clear. Then do the next batch of 10,000 in a different transaction until all have been processed.

    I am open to alternative solutions that meet our 'ensure replication queue isn't filled' requirement.

  2. #2
    Join Date
    Mar 2009
    Location
    Sydney, Australia
    Posts
    258
    1 Where you need to update/delete masses of data, it is good practice to break them into batches. Not just for avoiding replication issues, but for avoiding transaction log and lock contention as well. Small transactions provide sociable code, large tranasactions choke the server and cause delays for all other users. Even 10,000 is too large, we use 500 or 1,000 max to produce high-performance code. The delay of course is only necessary for the replication issue.

    2 Yes, cursors are one-row-at-a-time, they cannot handle batches.

    3 Temp tables are avoided unless absolutely necessary, and here they are certainly not necessary. All you would do is perform 1 x additional reads and 6 x additional writes. I will identify the additional, unnecessary work in brackets:
    • read from Mytable
    __write to #table (100% write)
    ____which involves logging (200% write)
    read from #table (100% read)
    __update/delete from Mytable
    delete from #table (100% write)
    __which involves logging (200% write)
    Therefore avoid #tables unless absolutely necessary.

    4 Of course the best design,is to avoid the intermediate result set as well, along the lines of avoiding the #table.

    Here's the code. Note, you do not need a delay if you use small transactions (batch size of 500 or 1,000; not 10,000); it is only required to give repserver a chance to read and process the massive "transaction" in the transaction log; I have stuck it in there, so that if required, you know where it goes. Deletes are expensive (I have not identified that expense in [3], just the logical delete); here if Value is not indexed, it will be slow as well, which should be acceptable as it is a background batch job; I do not recommend adding an index on Value because although the implicit read to find the qualified rows will be fast, the delete itself will be even slower.
    Code:
    SET ROWCOUNT 500
    WHILE (1=1)
        BEGIN
        DELETE MyTable
            WHERE value = "..."
         IF @@ROWCOUNT != 500
            BREAK
        END
    SET ROWCOUNT 0

    Here's another post that should give you additional context; that the construct is a generic design for any verb(ala standard good practice). If used with set rowcount 1, it replaces a cursor, and has less overhead (no cursor class locks).
    Regards
    Derek Asirvadem (Formerly DerekA)
    Information Architect / Senior Sybase DBA
    Copyright 2009 Software Gems Pty Ltd

    I answer questions from the Original Poster only. If you have a genuine question, as the moderators have requested, start a new thread.

    http://www.softwaregems.com.au

  3. #3
    Join Date
    Sep 2009
    Posts
    2
    Thanks for the reply Derek,

    I did previously think about your option #4 but I am concerned it will take forever to complete given our large data set and lack of index. This isn't a continually running batch job, it is a release weekend maintenance task to cleanup bad data so it needs to finish in a few hours.

    The temp table seems like a better option due to the lack of index. The problem I have w/ that approach is how do I 'batch' process the #table delete and the MyTable delete? Worried if I do row count that there isn't a way to guarantee my data is consistent between #table selections like

    Code:
    SET ROWCOUNT 500
    WHILE (1=1)
        BEGIN
           delete MyTable where uid in (select UID from #table)
           delete #table where uid in (select UID from #table)
         IF @@ROWCOUNT != 500
            BREAK
        END
    SET ROWCOUNT 0
    Maybe I can have a third temp table, ugh, to do it like

    Code:
    SET ROWCOUNT 500
    WHILE (1=1)
        BEGIN
           select into #processingTable from #table
           delete MyTable where uid in (select UID from #processingTable)
           delete #table where uid in (select UID from #processingTable)
           delete #processingTable
         IF @@ROWCOUNT != 500
            BREAK
        END
    SET ROWCOUNT 0
    Man that is a mess but seems like it might be the fastest for me because the non-indexed query is only done once.

  4. #4
    Join Date
    Mar 2009
    Location
    Sydney, Australia
    Posts
    258
    The temp table seems like a better option due to the lack of index.
    3 You appear to have missed my point re temp tables. If the pure delete is fast, the #table approach is an additional 100% (read) and 600% write, so it will be much slower. If the pure delete is slow, the #table approach is an additional 100% (read) and 600% write, so it will be much, much slower.

    Since one additional #table adds so much overhead, I do not see the point in adding a second #table. That's Ugh Squared!

    If you have only a small, one-time maintenance window, then I do not see the point in adding a known 100% read plus 600% write to the job. That's an Ugh!

    3.1 Personally I would not run it as a one-time maintenance task; I would run it as a background task at lower priority, in production (the production users [due to the lack of data integrity control] created the bad data). That way it could take hours, without interfering with production performance.

    4.1 There is no problem guaranteeing the data consistency between the tables if you identify a transaction (and no guarantee, if you do not):
    Code:
    DECLARE @error INT, @rowcount INT
    SET ROWCOUNT 500
    WHILE (1=1)
        BEGIN
        BEGIN TRAN LANCE
        delete MyTable where uid in (select UID from #table)
        delete #table where uid in (select UID from #table)
        SELECT @error = @@ERROR, @rowcount = @@ROWCOUNT
        COMMIT TRAN LANCE
         IF @ROWCOUNT != 500
            BREAK
        END
    SET ROWCOUNT 0
    5 The lack of index on value slows the search (to find the rows where value = "xxx") down. It has no effect on the speed of the delete itself.

    The delete itself is slow (as slow as an insert) because it has to delete the row, plus any index entries: an APL table with a CI and 3 indices means 4 logical writes; not counting page deallocations, etc; the same DPL/DRL table would be 5 logical writes; DPL/DRL may be a tiny bit faster for the delete itself but cause massive fragmentation (which slows down all accesses) and which requires reorg_rebuild.

    6 If you want maximum speed for the delete task, for the duration of the delete:
    drop all indices except (UID)
    add an index MyTable (value)
    Assuming MyTable is in Clustered/Placement order of (UID), ensure the #table is select-into in UID order
    do not create an index #table (UID)
    use a rowcount of 1000
    Dropping and adding indices other than Clustered/Placement Index is very fast.
    Regards
    Derek Asirvadem (Formerly DerekA)
    Information Architect / Senior Sybase DBA
    Copyright 2009 Software Gems Pty Ltd

    I answer questions from the Original Poster only. If you have a genuine question, as the moderators have requested, start a new thread.

    http://www.softwaregems.com.au

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •