Results 1 to 6 of 6
  1. #1
    Join Date
    Jul 2009
    Posts
    44

    Differential inserts

    Hi,
    I have possibly a few hundred thousand entries from the user coming in. In the most usual case, only 10% would be changes/updates to existing entries. Rest would be exact copies of existing stuff. So the two ways to do this I could think of were:

    1) Delete all existing rows, insert all fresh
    2) Do a comparison of each incoming row, figure out what to update, if does not exist, do a create.

    Now both of these options seem far from ideal to me. Can anyone help me out with a really neat way to figure out the delta and maybe do a bulk update followed by create?

    thanks

  2. #2
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Load all your imported data into staging tables.

    Delete records from your staging tables based upon existing natural keys in your production tables.

    Process what is left.

    By the way, are you only dealing with new records, or do you need to be concerned about modifications to existing records as well?
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  3. #3
    Join Date
    Jul 2009
    Posts
    44
    thanks for the quick reply!

    as far as the incoming rows, there would be updates+fresh creates all in the same set. However, the use case we know would mean that on a normal basis, only 10% would be updates, and only 4-5% would be new entries

  4. #4
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    Are there performance requirements you need to comply with? You have the usual suspects on the table already, truncate and repopulate (easy, reasonably quick, involves some potential downtime), party on production in real time while processing records (only real reason to do this is if there's some constraint preventing you from using staging tables), and using blindman's suggestion of using staging tables to shape your data and the process the stuff you care about.

    The major unknown so far is how expensive it is to compare your incoming data with your existing data. That could possibly be cost prohibitive resources wise, assuming you have tolerance for temporarily blowing out the data with a truncate/restore. If you don't have that tolerance, then you don't get much choice...
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

  5. #5
    Join Date
    Jul 2009
    Posts
    44
    So we do have the concept of staging tables and are in fact storing everything that comes in there.
    Wouldn't truncating+inserting be really expensive since for like a 100,000 rows, if there were only 20 with updated values and 10 with new rows, I would be deleting 100,000 and then inserting all those again!
    On the other hand, doing a comparison would mean figuring out the deltas for each row in staging? The result of which would be to do an update in case there is a change in value for an existing record or do a create in case none exists. I see none of these to be neat for the small % of deltas we expect. In a case were a better solution does not exist, I would love to know which of the two methods is relatively better?

    thanks again!

  6. #6
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Quote Originally Posted by rocker86 View Post
    as far as the incoming rows, there would be updates+fresh creates all in the same set. However, the use case we know would mean that on a normal basis, only 10% would be updates, and only 4-5% would be new entries
    "The use case we know"?
    "only 10% would be updates"?
    I can tell you are new at this, but you seem like a quick learner.
    Rule Of (Database) Application Development #1: Code for 100%. Not 90%. Not 95%.
    100%.
    Rule Of (Database) Application Development #2: Don't design around "Use Cases". Design around your Business Model.
    Use Cases are for developers, who spend more time debating what font style and background color to use than they do thinking about entities and relationships. Design your database around Use Cases in iteration #1, and I guarantee (or your money back) that it will be woefully unable to support the Use Cases in iteration #2.

    But back to your problem...
    1) Load the data into staging tables.
    2) Cleanse the staging data (consistent use of nulls, zero-length strings, default values, etc).
    3) Verify the staging data. (relational integrity, datatypes and domains, etc.) Flag any records that fail.
    4) Update existing records.
    5) Add new records.
    6) Delete all successfully processed staging data.

    I strongly advise against deleting existing data an reimporting it. You will never be able to store additional related data in your database that is not in your source data, and it will play havoc with relational integrity.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •