Page 1 of 2 12 LastLast
Results 1 to 15 of 16
  1. #1
    Join Date
    Nov 2002
    Posts
    15

    Unanswered: HELP - Trying to upload file contents to database on a nightly

    Hi,

    I have created a Java program which I call each night at approximately
    7PM whereby I pick up all files residing in a local directory in a
    pre-defined format. The format is basically part,quantity with the
    file name being [id].txt.

    From this, I create two PreparedStatements.

    1 - INSERT INTO [TABLE] ([id], [part], [quantity])
    AND
    2 - UPDATE [TABLE] SET QUANTITY = ? WHERE ID = [id] and PART = part

    The file contains approximately 110000 lines and it is assumed that
    the file is a full file. I loop through the file calling the
    PreparedStatement(2) and update each part with the new quantity. If I
    hit an error whereby the part does not exist, I insert the part and
    quantity via PreparedStatement(1).

    The problem I have is that this program generally takes 8-9 hours to
    complete which is quite long. We are looking to expand it to update
    many files but at this rate we will have to wait weeks to complete all
    files!!!

    The format of the file can not be changed as it is sent to use from
    another company and they have defined the format.

    Is there a quicker better way??????

    Please help, I'm at a loss to figure out how to speed this up.

    Shannon

  2. #2
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    I am not sure exactly how you are "preparing" the SQL statements. Do you repeat the prepare for each record in the input file, then perform the execute update/inserts? Or is there one prepare for the entire input file and then multiple executes for each record in the file. The first option (a prepare for each record in the input file) is very inefficient. Static SQL would be even better.

    The other question I have is how many inserts are there per day, and are they inserted at the end of the table or in middle of the table (this may depend on the definition of a clustering index and the new part number sequence that is being inserted). If you are doing lots of inserts in the middle of the table, and there is not sufficient free space, then performance will suffer.

    Other things to consider:

    - frequent commits, maybe every 100 input records processed
    - significantly increase size of log buffer
    - significantly increase size of database buffer pool (the total of all buffer pools that are active at any one time should usually be at least 1/4 - 1/2 of total system memory)
    - using separate tablespaces for each table
    - using separate tablespaces for indexes
    - define tables and indexes with sufficient freespace to handle inserts in-between reorgs.

    On second thought, if the input file contains the complete set of data that ends up in the DB2 table (i.e., every existing row in the DB2 table has an input record with updated quantity, and there may also be some new rows to be inserted), then you might do better with a load replace.

  3. #3
    Join Date
    Nov 2002
    Posts
    15
    Marcus,

    Thanks for your response. In response to your questions:

    - I am preparing both statements at the beginning of the program and reading each line in through the following statement:

    BufferedReader bufferedInput = new BufferedReader (new FileReader(uploadFile));

    while ((line = bufferedInput.readLine()) != null) {

    updateStmt.setString(1, part);
    updateStmt.setInt(2, Integer.parseInt(partQty));
    updateStmt.executeUpdate();

    }

    - The rows are updated wherever they exist within the table or inserted at the end of the table.
    - The table has no indexes (had none anyway, but performance hasn't changed with the addition of indexes recently)
    - Autocommit = true, so very frequent commits. Would this slow things down?
    - will look in to this
    - will look in to this
    - I am using a separate tablespace purely for this upload and have a view in another tablespace viewing it.
    - indexes are created in separate INDX tablespace
    - will look in to this as well

    Can you please point me towards a good guide on the load replace? Also, can I specify an SQL statement in a load replace? e.g. can I add where id = ? to the end of the load ?

    Regards
    Shannon
    Shannon Whitty
    reynolds.com.au

  4. #4
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    Commits are very expensive if taken after each update/insert. Switch to every 100 updates. This change alone might significantly decrease elapsed time. A commit after each update/insert effectively prevents asynchronous writes (see my post in http://dbforums.com/t792798.html).

    I am not familiar with how autocommit works in a java program, so I am not sure what your actual commit frequency is. But a single commit after all updates (100,000+) is not good either.

    The creation of an index should speed the update of the quantity on existing rows (if not, then something is wrong) but it will slow down the insert process. If you defined one of the indexes as a clustering index, then DB2 will try to insert the data row into the same data page as the index order. DB2 can spend a lot of time looking for free space on the correct data page (or on nearby pages) that does not exist, so if a clustering index is defined, then freespace should be defined on the table. Freespace is only materialized after a reorg or load replace.

    Likewise, if index entries are not added at the end (as would occur if the new index entry is higher than all previous index entries), then DB2 can spend lots of time doing index page splits to keep the indexes in order. Indexes are always kept in exact order. If you have indexes during the insert process, they should probably have freespace defined (unless all new indexes are added at the end because they higher than all previous index entries.

    For best performance, the DB2 data rows should in part number sequence (defined by making the index on part number the clustering index, which is enforced with reorg) and the input file should be sorted in the same part number sequence. This would cause each sequential quantity update to usually be made to the same DB2 physical 4K page as the previous update.

    The load utility is discussed in the Command Reference Guide. The replace option would completely reload the table from a sequential input file and rebuild the indexes. But whether this is best for you, might depend on what percent of the records get inserted each night (as opposed to just an update of the quantity).

  5. #5
    Join Date
    Nov 2002
    Posts
    15
    Marcus,

    I turned off AutoCommit and committed the records every 500. This appeared to have a drastic effect, improving the upload time by a great deal. It has now been reduced for 103826 records down to 3 hours 26 minutes but I'm sure there is a lot more time to be saved somewhere else.

    The table is a basic one with default allocation of freespace etc. and three columns (site_id, part_number, quantity). I have an index on the site_id which is clustered and allows reverse scans and an index on part_number. So I placed the cluster on site_id because each part number will be selected based on the site_id. Should I be converting the cluster as you mentioned, to the part_number field.

    Realistically, normal files contain anywhere between 5000-10000 rows, it's just one that contains 100000 and couple of others up around the 30000 mark. There is a rough total of 200000 rows in the entire table and I have separated the indexes to be stored in the INDX tablespace.

    Roughly how much freespace should I be allocating to the parts table before performing a reorg? I'm an Oracle guy who's just recently converted to DB2 and I am still learning the ropes. Any advice on table set-up and allocation would be greatly appreciated.
    Shannon Whitty
    reynolds.com.au

  6. #6
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    I do not understand what "site_id" is and how many unique values it has. Not sure if an index on that column is useful. Is the combination of site_id and part_number unique, or is part_number unique by itself?

    I would consider dropping the index on site_id and checking update and query results. You might be better off with no clustering index and DB2 will just do inserts at the end of the table.

    It's hard for me to tell you because I don't really understand the data. But based on the following update statement, you definitely want an index on part_number:

    update table
    set quantity = quantity_host_var
    where part_number = part_number_host_var;

    It would help if the table and the input file where in the same physical sequence: part_number, or (site_id and part_number), etc. This is so that the part_number being updated is on the same 4K tablespace page as the previous update transaction. DB2 does not store one row at a time, it stores it physically in 4K pages.

    The amount of free space needed depends on how many inserts are done in-between reorgs. If you reorg after each update/insert, it might be better to not have clustering index or freespace.

  7. #7
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    I don't recall if this has been mentioned, but when you create an index, you need to reorg the table/indexes and then run runstats. This will help DB2 to decide whether to use any indexes you have created.

  8. #8
    Join Date
    Nov 2002
    Posts
    15
    Marcus,

    An example of what the table contents would look like are as follows:

    siteId, partNumber, quantity
    100,part1,1
    100,part2,7
    100,part3,5
    100,part4,8
    200,part1,1
    200,part2,7
    200,part3,5
    200,part4,8
    300,part1,1
    300,part2,7
    300,part3,5
    300,part4,8
    400,part1,1
    400,part2,7
    400,part3,5
    400,part4,8

    So with this in mind, the siteId and password are unique and the partNumber will not be unique.

    Thanks
    Shannon
    Shannon Whitty
    reynolds.com.au

  9. #9
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    First, I assume that when you said "password" you meant part_number so that the site_id and part_number "together" are unique in the table.

    OK, then I would assume that you would NOT have the following update statement:

    update table
    set quantity = quantity_host_var
    where part_number = part_number_host_var;

    since the update statement would update multiple rows (all part_numbers for each site_id).

    Rather you WOULD have the following update statement:

    update table
    set quantity = quantity_host_var
    where part_number = part_number_host_var
    and site_id = site_id_host_var;

    If I wrong, please explain.

    Therefore, I would suggest one composite index (defined as unique) on part_number and site_id together. I suspect that it would be better to have the part_number first, but not sure because I don't know how the data is accessed. If you have a query which shows all the site_ids and quantities for a particular part_number, then part number should be first in the index.

    This may sound strange, but since part_number and site_id will be in the index, and the only other column is quantity, it would make sense to put the quantity as the last column in the index so that all queries will be "index only" access path. This will not speed up the updates (nor slow it down), but will speed up the queries.

    So the unique index would be:

    part_number,
    site_id,
    quantity

    in that order. As previously mentioned, have the sequential input file sorted in the same order. Do a reorg and runstats after the index has been created.

  10. #10
    Join Date
    Nov 2002
    Posts
    15
    Thanks Marcus,

    Further calrification to your points:

    - Yes you were correct in assuming by password I meant partNumber (input error).

    - The update statement is exactly as follows:
    Code:
    UPDATE BATCH_PARTS_TB
    SET QUANTITY = &qantity 
    WHERE SITE_ID = &siteId 
    AND PART_NUMBER = &partNumber
    - The only other SQL statement is the insert which is as follows:
    Code:
    INSERT INTO BATCH_PARTS_TB VALUES (&siteId , &partNumber , &quantity)
    - No, we will never display/search on all sites at one time, it will only be a part Number on a per site basis (each site is a retailer and each part is what they currently stock and we will always display comparitively what each retailer stocks)

    The String I use to access the data is exactly the same every time and is as follows:

    Code:
    SELECT QUANTITY FROM BATCH_PARTS
    WHERE SITE_ID = &siteId
    AND UCASE(PART_NUMBER) LIKE UCASE('% &part %')";
    Can I still assume that your suggestions are valid?

    Thanks for your continual support on this issue.

    Regards
    Last edited by shannonw; 05-27-03 at 05:22.
    Shannon Whitty
    reynolds.com.au

  11. #11
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    Given what you said about the "like" statement (you threw a curve with that one), I would create one index with the following columns (in the following order):

    site_id,
    part_number
    quantity

    If you feel uncomfortable about including quantity in the index, just create an index on:

    site_id
    part_number

    Which ever index you create, it should be defined as unique. Make sure you reorg the table and execute runstats, and let us know how it performs. Sort the input file in the same sequence as the index.

    If you could change the SQL which accesses the data to only have the % at the end (if the user knew the beginning of the part_number), and not put a % at the beginning, it would run a lot faster. Obviously, this can only be done depending on application requirements.

  12. #12
    Join Date
    Nov 2002
    Posts
    15
    I'm going to re-create the index in the way you've explained but currently, there is no issue with speed of accessing the data, it is really only with the updating and inserting of the 100K+ record file.

    What about Clustering and Reverse scans?

    Thanks for all you help.
    Shannon Whitty
    reynolds.com.au

  13. #13
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    I don't think you need reverse scans based on the SQL access you said that you use. I am not exactly sure what the performance penalty for allowing reverse scans is, but I am sure there is some penalty, otherwise it would be the default--so don't use it if you will never use it on queries.

    The issue of clustering and percent free (on the table and index) depends on what percent of the data you insert each night (vs. just updating the quantity). If 10-12 percent of the data is inserted, then 15% free space would be about right. But this assumes that you reorg after each nightly update to recreate the free space for the next days updates/inserts.

    So another option is to not define any clustering and let the data be inserted at the end of the table. However, indexes are always kept in exact order, so percent free should be used on indexes, and reorgs should be done nightly to improve the update/insert performance of the next day (aslo helps on queries somewhat).

    I just reorg'ed a table with 16,000 rows and one index and it took about 5 seconds on a Pentium 3 running DB2 8.1. My database buffer pool size is 24 MB (defined as 6,000 4K pages). I would be interested in knowing how long it takes to reorg your 100,000 row table.

  14. #14
    Join Date
    May 2003
    Location
    Phoenix, US
    Posts
    39

    Re: HELP - Trying to upload file contents to database on a nightly

    Hi,

    Are there any reasons, as why the load option was not considered.
    if u want to use the load option i can suggest some method to perform this operation.

    Thanks
    Sateesh


    Originally posted by shannonw
    Hi,

    I have created a Java program which I call each night at approximately
    7PM whereby I pick up all files residing in a local directory in a
    pre-defined format. The format is basically part,quantity with the
    file name being [id].txt.

    From this, I create two PreparedStatements.

    1 - INSERT INTO [TABLE] ([id], [part], [quantity])
    AND
    2 - UPDATE [TABLE] SET QUANTITY = ? WHERE ID = [id] and PART = part

    The file contains approximately 110000 lines and it is assumed that
    the file is a full file. I loop through the file calling the
    PreparedStatement(2) and update each part with the new quantity. If I
    hit an error whereby the part does not exist, I insert the part and
    quantity via PreparedStatement(1).

    The problem I have is that this program generally takes 8-9 hours to
    complete which is quite long. We are looking to expand it to update
    many files but at this rate we will have to wait weeks to complete all
    files!!!

    The format of the file can not be changed as it is sent to use from
    another company and they have defined the format.

    Is there a quicker better way??????

    Please help, I'm at a loss to figure out how to speed this up.

    Shannon

  15. #15
    Join Date
    Nov 2002
    Posts
    15
    I removed all indexes on the table and put 1 index on the siteid, partNumber & Quantity. Last nights 100000 row update/insert ran in approximately 5 minutes!!! I believe this is perfect for our needs.

    I think the two main contributing factors to everything that I did was:

    1. Change from commiting every row to every 500 rows.
    2. Remove all indexes from the table and replace with one unique, clustered index on siteid, partNumber & Quantity

    Thanks for all your help, you have definitely opened my eyes to better processes and methods.

    PS The reorg on 170000 rows took about 25 seconds.

    Shannon Whitty
    reynolds.com.au

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •