Results 1 to 2 of 2
  1. #1
    Join Date
    Sep 2003
    Location
    New York, NY
    Posts
    136

    Arrow Unanswered: Data cleaning help

    The application I am working on was developed about an year ago. In one of the tables users enter a new business name (id is a sequence id) and within the last year it has grown upto 6500 recs.
    Manually looking at the data I find lots of records which were entered incorrectly...for eg

    id name
    223 $10.0MM to $20MM of Sen Sec Loan
    887 $10MM to $20MM of Sen Sec Loan
    332 XXX CORP BRIDGE FIN.
    768 XXX CORP BRIDGE FINANCING
    432 PARTICLEBOARD L.P.
    543 PARTICLEBOARD L.P

    is there a way to find out such similar matches within these 6500 recs? I would like to identify such similar business names and then work on deleting one of the duplicate names from the table.

    Any suggestions, google search links would be helpful.

    Thanks
    Rohit

    ================================================== ========
    ok...now i know its called deduplication...any tips on how to do that without using expensive tools...
    Last edited by rohitkumar; 08-23-06 at 11:58.

  2. #2
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool



    You could try using the SOUNDEX() function:
    Code:
    SQL> select soundex(name), name from test;
    
    SOUN NAME
    ---- ----------------------------------------
    M351 $10.0MM to $20MM of Sen Sec Loan
    M351 $10MM to $20MM of Sen Sec Loan
    X261 XXX CORP BRIDGE FIN.
    X261 XXX CORP BRIDGE FINANCING
    P632 PARTICLEBOARD L.P.
    P632 PARTICLEBOARD L.P
    
    6 rows selected.
    
    SQL>

    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •