If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > MySQL > importing and normalising data.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 08-15-09, 12:10
ozzii ozzii is offline
Registered User
 
Join Date: Mar 2007
Posts: 194
importing and normalising data.

I have an excel spread sheet with 18000 records. Each record has the following data:

First Name, Last Name, Address, DOB, Disabled Permit Number, Issue Date, Exp Date.

A disabled permit number is a unique number and last for 3 years. Since data in the spread sheet is not held in a normalized fashion i wish to import it into a database that has two normalised tables: tbl_citizen and tbl_permit. Primary key for tbl_citizen is citizen_id. Primary key for tbl_permit is (permit_num, citizen_id) with citizen_id being FK referencing tbl_citizen. I have imported the entire spreadsheet into a temporary table called tbl_temp. How would you run a select update query to get citizen data from tbl_temp and insert into tbl_citizen and then insert the corresponding permit data into tbl_permit? Also many of the records in tbl_temp have typos e.g.

John, Smith, 101 Test St, 12.12.1954, 123456, 12.12.2001, 12.12.2004
John, Smit, 101 Test St, 12.12.1954, 234567, 11.11.2004, 11.11.2007

As you can see from the above the surname has a typo but it is the same customer and these sort of typos are all over the show i.e they sometimes appear in the address field or DOB field. Sometimes they are not typos at all e.g. if the citizen has moved address after the first permit was issued therefore the address associated with the second permit will be different.

Any help would be appreciated.

Cheers.
Reply With Quote
  #2 (permalink)  
Old 08-15-09, 12:36
healdem healdem is offline
Jaded Developer
 
Join Date: Nov 2004
Location: out on a limb
Posts: 9,258
so how would you write down the logic to work out whether its a duplicate or not
what (manual) logic would you apply to identify a typo?

assuming you can devise a manual process then it should be easy enough to translate into an automated process... but I wouldn't be holding my breath on that.

you could use a variant of the soundex algolrhythm to help identify simlar names. I guess you need some sort of matching criteria say a 90% match might be a typo.

the alternative is to bung all the data into a signle table, process everything then gove you users access to some process a from/screen that allows thm to plough through the data to find similar.
__________________
I'd rather be riding my Versys or my Tiger 800 let alone the Norton
Reply With Quote
  #3 (permalink)  
Old 08-15-09, 13:06
ozzii ozzii is offline
Registered User
 
Join Date: Mar 2007
Posts: 194
I would use the name and DOB to uniquely identify a citizen - this would give me almost 90% am hoping. Using the above I could move the bulk of the the records into the normalised tables. Then what ever records are left in the temp table can be manually sorted out. I am hoping there wont be that many left!

Ok what would be select update/insert queries to achieve the above.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On