Results 1 to 3 of 3
  1. #1
    Join Date
    Mar 2007
    Posts
    212

    Unanswered: importing and normalising data.

    I have an excel spread sheet with 18000 records. Each record has the following data:

    First Name, Last Name, Address, DOB, Disabled Permit Number, Issue Date, Exp Date.

    A disabled permit number is a unique number and last for 3 years. Since data in the spread sheet is not held in a normalized fashion i wish to import it into a database that has two normalised tables: tbl_citizen and tbl_permit. Primary key for tbl_citizen is citizen_id. Primary key for tbl_permit is (permit_num, citizen_id) with citizen_id being FK referencing tbl_citizen. I have imported the entire spreadsheet into a temporary table called tbl_temp. How would you run a select update query to get citizen data from tbl_temp and insert into tbl_citizen and then insert the corresponding permit data into tbl_permit? Also many of the records in tbl_temp have typos e.g.

    John, Smith, 101 Test St, 12.12.1954, 123456, 12.12.2001, 12.12.2004
    John, Smit, 101 Test St, 12.12.1954, 234567, 11.11.2004, 11.11.2007

    As you can see from the above the surname has a typo but it is the same customer and these sort of typos are all over the show i.e they sometimes appear in the address field or DOB field. Sometimes they are not typos at all e.g. if the citizen has moved address after the first permit was issued therefore the address associated with the second permit will be different.

    Any help would be appreciated.

    Cheers.

  2. #2
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    Provided Answers: 59
    so how would you write down the logic to work out whether its a duplicate or not
    what (manual) logic would you apply to identify a typo?

    assuming you can devise a manual process then it should be easy enough to translate into an automated process... but I wouldn't be holding my breath on that.

    you could use a variant of the soundex algolrhythm to help identify simlar names. I guess you need some sort of matching criteria say a 90% match might be a typo.

    the alternative is to bung all the data into a signle table, process everything then gove you users access to some process a from/screen that allows thm to plough through the data to find similar.
    I'd rather be riding on the Tiger 800 or the Norton

  3. #3
    Join Date
    Mar 2007
    Posts
    212
    I would use the name and DOB to uniquely identify a citizen - this would give me almost 90% am hoping. Using the above I could move the bulk of the the records into the normalised tables. Then what ever records are left in the temp table can be manually sorted out. I am hoping there wont be that many left!

    Ok what would be select update/insert queries to achieve the above.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •