Results 1 to 4 of 4
  1. #1
    Join Date
    Jun 2007
    Location
    Ohio, USA
    Posts
    142

    Unanswered: Looking for suggestions on multiple table import job.

    Ok, here's the current situation:

    We have about 60 tables in a DB2 mainframe DB, of which we copy a small amount of each over to an MS SQL DB every morning. The process takes about an hour, and the availability window is only about 90 minutes. This all hinges on waiting for one particular table to be ready in DB2. We're also setting the DB to single user to keep people from modfying data during loading.

    Right now, the data flow goes something like this:
    1. Truncate the staging table in SQL.
    2. Import from the DB2 staging table to the SQL staging table.
    3. Delete all rows from the DB2 staging table.
    4. Delete rows from the SQL destination table that exist in the SQL staging table. (Matching on the PK.)
    5. Import all rows from the SQL staging table to the destination table.
    Right now, this basic construct is run table per table, and spread out over a total of 4 jobs, 17 SSIS packages, tons of embedded SQL instead of sprocs, and mixed in with various DBREINDEX, SET RESTRICTED_USER, cache purges and other stuff. (Why, Ceiling Cat? Why?)

    Here's what I was thinking: I'm proposing that we have three jobs in total.
    1. Master job - Calls the other two jobs, and performs any cleanup or maintenance required after they run. Handles logging, error reporting, and control of the other two jobs.
    2. DB2 Import job - Checks a log table in DB2 that will hold info on the readiness of the DB2 staging tables. When they are ready, they will be imported into the SQL staging tables.
    3. SQL Import job - Checks a log table on the SQL side that will hold info on the readiness of the SQL staging tables, when they're ready, they will be imported into the SQL destination tables.
    I'm thinking jobs 2 and 3 can be run simultaneously so that when a table has been imported from DB2, the SQL import job can import it to SQL without waiting for the rest of the tables to be brought over from DB2.

    A few other questions come to mind.
    • Should I stick with the Delete/Insert logic in steps 4 & 5 of the existing process, or would an UPSERT be better?
    • Is there a way I can execute multiple DB2 or SQL imports in parallel?
    • Should I have a separate staging database, or a separate staging schema in SQL? It's all in one big MDF right now, and the size of the imported data is not trivial, maybe 100 GB.
    • Would locking the tables individually be better than setting the DB to restricted_user?
    • Is this the best way to do this?
    As always, any help or pointers in the right direction are appreciated.
    David Maxwell
    Data Integrity? Yeah, I've heard of that...

  2. #2
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    How long is the current insert/delete taking? This doesn't sound like the bottleneck, but if it is then it could well be worth invesitagtion.

    I doubt you'll be able to do this in single user mode...

    It is my personal preference to use schemas for this kind of work, but that's ainly from a maintainability point of view - all objects used in interface X are part of schema Y.

    I don't know the answer to the last point, but I feel it is a decision that must be made by your business rules. I guess one benefit of single user mode is that you don't have to contend for resources with other connections *shrug*.

    How long does each stage in the process currently take? Are there any indexes on the staging table? If so, consider dropping them prior to import and re-creating after...
    George
    Home | Blog

  3. #3
    Join Date
    Jun 2007
    Location
    Ohio, USA
    Posts
    142
    Quote Originally Posted by georgev
    How long is the current insert/delete taking? This doesn't sound like the bottleneck, but if it is then it could well be worth invesitagtion.
    The real bottleneck is a delay step in one of the jobs, and the fact that there are 4 separate jobs all running at different times throughout the 90 minute window. The jobs are scheduled depending on when it is assumed that the DB2 tables will be ready, with one exception:

    The delay step is the second step in job 3 of 4, and it waits for a file to be created before continuing to run the rest of the job. That file gets created when one of the DB2 jobs loads a table on it's side. So meanwhile, there are several other tables that could be imported from DB2 to SQL while that load completes on the DB2 side. This usually creates a delay of around 10 to 15 minutes and in a 90 minute window, that makes me a bit nervous.

    The Delete/Insert vs. Upsert was just a thought I had, which I'm really not tied to. Otherwise, the Delete/Insert steps don't cause a significant performance impact.

    Quote Originally Posted by georgev
    I doubt you'll be able to do this in single user mode...
    No, but in restricted user, we can. The users and application ID's don't have dbo, so that would effectively keep them out, while allowing any number of sa's in.

    Quote Originally Posted by georgev
    It is my personal preference to use schemas for this kind of work, but that's ainly from a maintainability point of view - all objects used in interface X are part of schema Y.
    Good point. Another piece of info which I probably should have shared in the beginning is that backup space is an issue on this server. (As it is everywhere around here.) I was hoping that by moving all the rest of the tables to a staging database, I could reduce the amount of space and time it would take to back up the production DB. I could just truncate all the staging tables at the end of the load, but I've run into more than one situation where we had to reload some data from staging after the jobs had completed. Having a separate staging database would cut the production DB size significantly.

    Quote Originally Posted by georgev
    I don't know the answer to the last point, but I feel it is a decision that must be made by your business rules.
    I've talked things over with the business and dev teams, and they're pretty much in favor of this approach, though they're not the most technical of people. That's why I wanted to get some input from the folks on here.

    Quote Originally Posted by georgev
    How long does each stage in the process currently take? Are there any indexes on the staging table? If so, consider dropping them prior to import and re-creating after...
    Each stage depends mostly on the size of the data being imported, but even the largest tables don't take more than a few minutes to load, most less than a minute. There are no indexes on the staging tables at this time and, honestly, creating them slipped my mind. Good call. I'll do some experimenting with that and see what the performance looks like.
    David Maxwell
    Data Integrity? Yeah, I've heard of that...

  4. #4
    Join Date
    Jun 2004
    Location
    Long Island
    Posts
    696
    I would use the lowest level tools possible to move data, ie BCP/Bulk Insert in SQL and whatever tools DB2 has to export data, and try to eliminate SSIS altogether (if everything else can be done via shell scripts and SQL). I had a similar issue with LoanPerformance data, and stuck to the basics (BCP/ Bulk Copy, SQL and Shell scripts) in which DTS was no match speed/data movement-wise.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •