Results 1 to 6 of 6
  1. #1
    Join Date
    Dec 2006
    Posts
    1

    Question Unanswered: massive data import, how to avoid dublicates?

    Hello,

    I am currently working on a project where I have to import a huge amount of data from CSV files into a database.
    I don't want to have dublicate keys in my table, but my CSV file contains them. That means the line more at the end of the file contains the mor up to date information that I have to store.

    I try to fix this problem since serveral weeks, but my algorithm is very slow and blocks all other processes on the server. At the moment I am copying all records into a temp table that occure more than once in the CSV file. After that I am running through this table line by line and check if the key already exists in the target table and then either make an insert or an update.

    Does somebody know a better process?

    I hope somebody can help me...

  2. #2
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    If by "running through this table line by line" you mean you are using a cursor, then yes there are faster ways:

    delete from [yourtable] A
    inner join [yourtable] B
    on A.PrimaryKey = B.PrimaryKey
    and A.[OrdinalColumn] < B.[OrdinalColumn]
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  3. #3
    Join Date
    Jan 2006
    Location
    USA
    Posts
    115
    Quote Originally Posted by dreamingKiwi
    Hello,

    I am currently working on a project where I have to import a huge amount of data from CSV files into a database.
    I don't want to have dublicate keys in my table, but my CSV file contains them. That means the line more at the end of the file contains the mor up to date information that I have to store.

    Does somebody know a better process?

    I hope somebody can help me...
    This should work...

    1. Create table same as you required, without defining any relationship.
    2. Upload all data in that table from CSV file.
    3. Identify column/s for unique integrity.
    4. Delete all duplicate data using that (unique data) columns considering as base. Check the below link for identifying & deleting duplicate rows.

    http://www.sql-server-performance.co...duplicates.asp

    5. Transfer your data to the actual table using DTS or Import - Export Wizard.
    6. Delete previously created table.
    Rajesh Patel

    Everybody says - mistake is the first step of success, but it's not true. The correction of the mistake is the first step of success.

  4. #4
    Join Date
    May 2004
    Location
    Seattle
    Posts
    1,313
    couldn't you also just create a pk or unique constraint in an empty table with IGNORE_DUP_KEY=ON and then bcp it all in? That way dupes will be silently ignored (should be anyway).

    after bcp'd in, alter the constraint to be IGNORE_DUP_KEY=OFF.

  5. #5
    Join Date
    Nov 2002
    Location
    Jersey
    Posts
    10,322
    I would still go with a work table with no constainst, but have an identitiy column...it's not a gaurentee that the "older" dup key value will have a higher identity value, but I have never seen it otherwise...too bad the file doesn't have a date add column...

    In any case, since you want the "last" record, and it's pretty arbitrary, it should work for

    Depending on what "a lot" of data means

    I might

    SELECT* INTO SecondStageTable FROM FirstStageTable o
    WHERE EXISTS (SELECT key FROM FirstStageTable i ON i.Key = o.Key
    GROUP BY key HAVING COUNT(*) = 1)

    That gets the unique rows into a table with no logging

    Then do the same where > 1 and find the max identity and insert into another stage table, or if it's tiny into the existin second stage table

    And again depending on the volume, to avoid logging I would bcp the data, then in to the baase table.

    Other wise you could just insert to base.

    I would then drop the stage tables, either now, or before I have to re-do this process...
    Brett
    8-)

    It's a Great Day for America everybody!

    dbforums Yak CorralRadio 'Rita
    dbForums Member List
    I'm Good Once as I ever was

    The physical order of data in a database has no meaning.

  6. #6
    Join Date
    Dec 2006
    Posts
    3
    Hmmm...I don't know you specific situation, however have you thought of scrubbing the data before you pass it to a stored procedure using C#, VB.nET, VB6 or whatever?

    Cheers

    Erron

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •