Results 1 to 10 of 10
  1. #1
    Join Date
    Feb 2012
    Posts
    36

    Unanswered: Pl SQL Code for Comparing the data in the Column

    The scenario is like this,


    Column 'COMPANY ' in table ENTITY has different company name, Basically it is a Data Cleansing project.

    Now, Column 'COMPANY' has different company names, but some company names are repeated with different prefix or sufixes.


    ID COMAPNY
    EX: 1 National Commercial Bank
    2 National Commercial Bank, Saudi
    3 National Commercial Bank, Saudi Arabia
    4 National Commercial Banq.

    This all are the possible datas in the column.But here the problem is ID.
    The Corresponding ID For this Values are unique.So We cannot Decide Based on ID.

    So Business is asking to.

    1.Create a PLSQL Functions / Procedures etc,Which will take the first row ( ex : First row data) and compare with whole Column, thus finding the similarities.

    2.They want to compare it by string to string wise.

    EX: In the above data example, if we consider 1st and 2nd row, First 3 strigs are be equal. So, they are asking to assign percentage for comarison (If first string compares then 10%, if second string compares 20% like that).In this case above example data will have 30% value (1st and 2nd row).

    3) After this the values should be inserted into a temp table so that we can see the differences between the diffrent company names.

    Table shoud be of the below format.

    ID1 COMPANY_NAME1 ID2 COMPANY_NAME2 PERCENTAGE OF MATCH

    (Please find the attached sheet for more infor on table with example.)

    Here ,

    ID1 And COMPANY_NAME1 will be the record considered for matching,

    ID2,COMAPNY_NAME2, PERCENTAGE of MATCH will be the record that partially or fully matches,
    and theire ID and Percentage of match. (Percentage is based on the string that matches, if its a 3 word letter, and if first 2 word matches then 20%,
    if 3 word matches then 30% like that)


    Hope you got it Friends,

    Please help me on this,its very important.

    If you have any other method for this, please suggest that.

    Its not that we are looking for same work around, if we get other good ideas we will do that also.
    Attached Thumbnails Attached Thumbnails Datasample.jpg  

  2. #2
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    So, first you might want to separate COMPANY into words.
    Then compare each word by word.

    But, are orders relevant?
    For example:
    Code:
    ID1 COMPANY_NAME1                   ID2 COMPANY_NAME2                     PERCENTAGE OF MATCH
    1   National Commercial Bank        5   The National Commercial Bank      ??
    1   National Commercial Bank        6   Commercial Bank of Saudi National ??
    2   National Commercial Bank, Saudi 6   Commercial Bank of Saudi National ??
    Another minor issue might be what are the separators of words?
    Space(" ") and comma(",") were all?

  3. #3
    Join Date
    Jan 2004
    Location
    Croatia, Europe
    Posts
    4,094
    Provided Answers: 4
    I think that you might benefit from deeper reading on Oracle Text.

  4. #4
    Join Date
    Feb 2012
    Posts
    36
    @Tonkuma : Orders are very important.

    We cannot alter it thus making our problem more stringent.

    Already words are seperated by spaces.

  5. #5
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    (1) You wrote
    Orders are very important.
    So, what PERCENTAGE OF MATCH do you want for these examples?

    Quote Originally Posted by tonkuma View Post
    But, are orders relevant?
    For example:
    Code:
    ID1 COMPANY_NAME1                   ID2 COMPANY_NAME2                     PERCENTAGE OF MATCH
    1   National Commercial Bank        5   The National Commercial Bank      ??
    1   National Commercial Bank        6   Commercial Bank of Saudi National ??
    2   National Commercial Bank, Saudi 6   Commercial Bank of Saudi National ??
    Please clarify that do you want to consider the relative sequence of words or the absolute sequence for them.

    For example,
    (a) the relative sequence of words 'National', 'Commercial' and 'Bank' were same in COMPANY_NAME1 and COMPANY_NAME2 in the first row.
    But, absolute sequence for the words was (1, 2, 3) in COMPANY_NAME1 and was (2, 3, 4) in COMPANY_NAME2.

    (b) the relative sequence of words 'Commercial', 'Bank' and 'Saudi' were same in COMPANY_NAME1 and COMPANY_NAME2 in the third row(if ignored comma).
    But, absolute sequence for the words was (2, 3, 4) in COMPANY_NAME1 and was (1, 2, 4) in COMPANY_NAME2.


    (2) You wrote
    Already words are seperated by spaces.
    (2-1) Do you want to consider "Bank"(like in ID=1) and "Bank,"(like in ID=2, 3) are different words?
    I asked
    Another minor issue might be what are the separators of words?
    Space(" ") and comma(",") were all?
    (2-2) seperated by spaces in a string(i.e. COMAPNY column)
    So, it might be better to extract each word into separate variables or separate columns, or rows in a column,
    to compare the word by word.

  6. #6
    Join Date
    Jun 2004
    Location
    Liverpool, NY USA
    Posts
    2,509
    Maybe soundex


    >select soundex('National Commercial Bank') from dual;

    SOUN
    ----
    N354

    select soundex(' National Commercial Bank, Saudi') from dual

    SOUN
    ----
    N354

    1* select soundex('National Commercial Bank, Saudi Arabia') from dual


    SOUN
    ----
    N354

    1* select soundex('National Commercial Banq.') from dual


    SOUN
    ----
    N354

    1* select soundex('North Commercial Banq.') from dual


    SOUN
    ----
    N632
    Bill
    You do not need a parachute to skydive. You only need a parachute to skydive twice.

  7. #7
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    The result of soundex is based on the first character and trailing digits base on following characters.

    For example
    soundex('The National Commercial Bank')
    may return
    T535

    It means that soundex('The National Commercial Bank') might be completely different from soundex('National Commercial Bank') (i.e. N354).


    So, I want ask again
    Quote Originally Posted by tonkuma View Post
    (1) You wrote
    So, what PERCENTAGE OF MATCH do you want for these examples?


    Please clarify that do you want to consider the relative sequence of words or the absolute sequence for them.

    For example,
    (a) the relative sequence of words 'National', 'Commercial' and 'Bank' were same in COMPANY_NAME1 and COMPANY_NAME2 in the first row.
    But, absolute sequence for the words was (1, 2, 3) in COMPANY_NAME1 and was (2, 3, 4) in COMPANY_NAME2.

    (b) the relative sequence of words 'Commercial', 'Bank' and 'Saudi' were same in COMPANY_NAME1 and COMPANY_NAME2 in the third row(if ignored comma).
    But, absolute sequence for the words was (2, 3, 4) in COMPANY_NAME1 and was (1, 2, 4) in COMPANY_NAME2.

    ...
    ...
    If the values like 'The National Commercial Bank' was not included in the table by some ways(or excluded essentially),
    then my anxious might be unneccesary.

  8. #8
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    Here are some examples of SOUNDEX tested on DB2 9.7.5 on Windows.

    Code:
    COMPANY                                SOUNDEX Matched with 'National Commercial Bank' ?
    -------------------------------------- ------- -----------------------------------------
    National Commercial Bank               N354    Matched                                  
    National Commercial Bank, Saudi        N354    Matched                                  
    National Commercial Bank, Saudi Arabia N354    Matched                                  
    National Commercial Banq.              N354    Matched                                  
    North Commercial Banq.                 N632    Not matched                              
    
    /* I doubted the following results of SOUNDEX for OP's requirements. */ 
    Nation Leading Bank.                   N354    Matched                                  
    National (Warranteed by Iran) Bank     N354    Matched                                  
    Nation Wide Commercial Bank            N353    Not matched                              
    The National Commercial Bank           T535    Not matched                              
    
      9 record(s) selected.

    The query was
    Code:
    ------------------------------ Commands Entered ------------------------------
    SELECT company
         , SOUNDEX(company) AS soundex
         , CASE SOUNDEX('National Commercial Bank')
           WHEN SOUNDEX(company) THEN 'Matched'
           ELSE                       'Not matched'
           END  AS "Matched with 'National Commercial Bank' ?"
     FROM  (VALUES
               'National Commercial Bank'
             , 'National Commercial Bank, Saudi'
             , 'National Commercial Bank, Saudi Arabia'
             , 'National Commercial Banq.'
             , 'North Commercial Banq.'
             , 'Nation Leading Bank.'
             , 'National (Warranteed by Iran) Bank'
             , 'Nation Wide Commercial Bank'
             , 'The National Commercial Bank'
           ) AS s(company)
    ;

  9. #9
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    The essential of my issue might be
    (first) how to rank the similarity,
    and
    (second) where to put the automatic limit of matched rows(or to let the user's dicision).

  10. #10
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool

    Are you looking for a result similar to this?:
    Code:
    SQL> WITH tab (id, company)
      2       AS (SELECT 1, 'National Commercial Bank' FROM DUAL UNION ALL
      3           SELECT 2, 'National Commercial Bank, Saudi' FROM DUAL UNION ALL
      4           SELECT 3, 'National Commercial Bank, Saudi Arabia' FROM DUAL UNION ALL
      5           SELECT 4, 'National Commercial Banq.' FROM DUAL UNION ALL
      6           SELECT 5, 'North Commercial Banq.' FROM DUAL UNION ALL
      7           SELECT 6, 'Nation Leading Bank.' FROM DUAL UNION ALL
      8           SELECT 7, 'National (Warranteed by Iran) Bank' FROM DUAL UNION ALL
      9           SELECT 8, 'Nation Wide Commercial Bank' FROM DUAL UNION ALL
     10           SELECT 9, 'The National Commercial Bank' FROM DUAL)
     11  SELECT x.id id1, x.company comp_1, t.id id2, t.company comp_2
     12       , compare_txt ( x.company, t.company) pct
     13    FROM tab t
     14       , (SELECT *
     15            FROM tab
     16           WHERE id = 1) x
     17    WHERE t.ID > 1
     18  /
    
     ID1 COMP_1                          ID2 COMP_2                                      PCT
    ---- ------------------------------ ---- ---------------------------------------- ------
       1 National Commercial Bank          2 National Commercial Bank, Saudi            81.5
       1 National Commercial Bank          3 National Commercial Bank, Saudi Arabia     66.7
       1 National Commercial Bank          4 National Commercial Banq.                  81.8
       1 National Commercial Bank          5 North Commercial Banq.                     45.5
       1 National Commercial Bank          6 Nation Leading Bank.                       45.5
       1 National Commercial Bank          7 National (Warranteed by Iran) Bank         42.9
       1 National Commercial Bank          8 Nation Wide Commercial Bank                83.3
       1 National Commercial Bank          9 The National Commercial Bank               88.0
    
    8 rows selected.
    Where pct is computed as follows: sum of characters from each word that matched * 100 / length (greatest (comp_1, comp_2))
    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •