Thread: Data cleaning help
08-23-06, 10:31 #1Registered User
- Join Date
- Sep 2003
- New York, NY
Unanswered: Data cleaning help
The application I am working on was developed about an year ago. In one of the tables users enter a new business name (id is a sequence id) and within the last year it has grown upto 6500 recs.
Manually looking at the data I find lots of records which were entered incorrectly...for eg
223 $10.0MM to $20MM of Sen Sec Loan
887 $10MM to $20MM of Sen Sec Loan
332 XXX CORP BRIDGE FIN.
768 XXX CORP BRIDGE FINANCING
432 PARTICLEBOARD L.P.
543 PARTICLEBOARD L.P
is there a way to find out such similar matches within these 6500 recs? I would like to identify such similar business names and then work on deleting one of the duplicate names from the table.
Any suggestions, google search links would be helpful.
ok...now i know its called deduplication...any tips on how to do that without using expensive tools...
Last edited by rohitkumar; 08-23-06 at 10:58.
08-23-06, 11:23 #2Registered User
- Join Date
- Jun 2003
- West Palm Beach, FL
You could try using the SOUNDEX() function:
SQL> select soundex(name), name from test; SOUN NAME ---- ---------------------------------------- M351 $10.0MM to $20MM of Sen Sec Loan M351 $10MM to $20MM of Sen Sec Loan X261 XXX CORP BRIDGE FIN. X261 XXX CORP BRIDGE FINANCING P632 PARTICLEBOARD L.P. P632 PARTICLEBOARD L.P 6 rows selected. SQL>
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb