Results 1 to 5 of 5
  1. #1
    Join Date
    Oct 2005
    Posts
    15

    Detecting change records from CSV import

    I'm developing a database that I'm going to routinely import CSV data into. The CSV data doesn't contain a primary key and I have no control over the source of the CSV so I'm stuck with it as it is.

    The data has a first name column, last name column and date column which combined will probably be unique 99.9% of the time. There are then 10 other columns which may be updated over time.

    So my issues are I need to be able to import new records, not import duplicates into my database and detect changes in existing records.

    I could use the first name, last name and date columns as a composite key but I'm thinking maybe I could generate a hash based on the combined data of those columns and then create a hash based on the combined data of the other columns. I'd then be able to recognise existing records and then check for changes to those records using the hashes rather than have to check column by column or update the whole lot 'just in case'.

    I'm pretty sure this would work - does anyone have any comments, suggestions or better ideas?

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    On most database engines, there is little cost/performance difference between a seek on one hash column versus a date and two char columns. You ought to test on your configuration, but I doubt that there is much if any difference in performance. I would only introduce the complexity of two extra hashes if it bought me a material performance improvement.

    If your database engine implements something comparable to the SQL MERGE statement, I'd bet that the efficiency of a MERGE using one or two appropriate indexes would make it both perform better and easier to understand than using the hashes.

    Without access to your data and hardware I can't test this, but I'd only add the complexity of the hashes if they bought me enough process performance to offset the extra people time needed for their and feeding.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  3. #3
    Join Date
    Oct 2005
    Posts
    15
    Thanks Pat.

    I was thinking the main benefit I'd get would be by hashing the non key columns I'd be able to see if that record has changed in the imported CSV without having to do a column by column comparison as any of these 10 columns may be updated.

    As you say I might be re-inventing the wheel though - think I'll do some testing to see if I'm actually gaining anything other than complexity

    Steve.

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    If you do a MERGE with pk = pk and not (col1 = col1 AND col2 = col2) that will only update the changed rows (and insert any new rows). I'd be surprised if there was any material difference.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  5. #5
    Join Date
    Oct 2005
    Posts
    15
    Thanks - I'll give that a try!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •