Results 1 to 13 of 13
  1. #1
    Join Date
    May 2012
    Posts
    13

    Unanswered: Levenshtein Edit Distance

    Hi ,

    I'm quite new to DB2 - only a few days in fact. I'm having trouble both A) finding a version of the Levenshtein Edit Distance as a function for DB2, or B) finding a function written in TSQL that I cam able to convert without running into issues caused by differences between the two. I just need th fastes Levenshtein function possible to compare two strings but haven't made any progress after a few days of searching.

    I've looked at Soundex, and it is not what I'm looking for.

    Has anyone here created or re-written a Levenshtein distance function for Db2?

    Thanks so much,
    Tor

  2. #2
    Join Date
    Aug 2001
    Location
    UK
    Posts
    4,650
    You can use a Java/C UDF for Levenshtein Edit Distance.

    Do you have TSQL code for this? Then post what you have to get help.
    Visit the new-look IDUG Website , register to gain access to the excellent content.

  3. #3
    Join Date
    May 2012
    Posts
    13
    Thanks - it must be an internal function as far as I know right now. I'm a student employee and have liomited access to certian libraries and for now would just like to be able to see if this function is feasible or fast enough.

    What I had tried to manually convert is actually quite short and apparently very fast. It's at Softwhack: 35x Improved T-SQL LevenShtein Distance Algorithm...at a cost

    However, I don't think I converted it properly, as I was getting very wrong results ('testing','testing one two three') was giving me a difference of only '2'.

    Thanks,
    Tor

  4. #4
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    I thought that the algorithm described in the article Softwhack: 35x Improved T-SQL LevenShtein Distance Algorithm...at a cost
    might not work for strings including more than 2 consecutive inserted/deleted characters.


    I'm sorry!

    The issue was already noted in the article...
    This new algorithm comes at a cost of not checking for insertions/deletions along the entire length of the comparison string, so we can now only handle one insertion/deletion consecutively except at the end of the comparison string. This, in effect, makes it no longer follow the Levenshtein algorithm's end capabilities, but this is an acceptable cost for our needs.
    Last edited by tonkuma; 05-16-12 at 22:25. Reason: Add [QUOTE] tag to quoted paragragh. Add "I'm sorry! The issue was already noted in the article..."

  5. #5
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    Quote Originally Posted by trobinson1 View Post
    What I had tried to manually convert is actually quite short and apparently very fast. It's at Softwhack: 35x Improved T-SQL LevenShtein Distance Algorithm...at a cost

    However, I don't think I converted it properly, as I was getting very wrong results ('testing','testing one two three') was giving me a difference of only '2'.

    Thanks,
    Tor
    Although I don't know how did you converted,
    this might be an answer

    Example 1: Implementation of algorithm of Brent Bulla.
    Converted from Brent Bulla's TSQL code.
    Code:
    ------------------------------ Commands Entered ------------------------------
    CREATE OR REPLACE FUNCTION quasi_Levenshtein
    ( left varchar(255) , right varchar(255) )
    RETURNS INT
    DETERMINISTIC
    NO EXTERNAL ACTION
    
    BEGIN
    DECLARE difference , leftIndex , rightIndex INT;
    DECLARE left_char , right_char VARCHAR(2);
    
    CASE
    WHEN LENGTH(left)  = 0 THEN
         RETURN LENGTH(right);
    WHEN LENGTH(right) = 0 THEN
         RETURN LENGTH(left);
    ELSE SET (rightIndex , leftIndex , difference) = (1 , 1 , 0);
    END CASE;
    
    /* equivalent to Brent Bulla's original code.
    WHILE leftIndex <= MAX( LENGTH(left) , LENGTH(right) ) DO
    */
    WHILE MIN( leftIndex    , rightIndex    )
       <= MAX( LENGTH(left) , LENGTH(right) ) DO
       SET left_char  = SUBSTRB(left , leftIndex , 1) || '*'
         , right_char = SUBSTRB(right, rightIndex, 1) || '*';
       IF left_char <> right_char THEN
          IF left_char  = SUBSTRB(right, rightIndex + 1 , 1) || '*' THEN    
             SET rightIndex = rightIndex + 1;
          ELSEIF
             right_char = SUBSTRB(left , leftIndex  + 1 , 1) || '*' THEN
             SET leftIndex  = leftIndex  + 1;
          END IF;
          SET difference = difference + 1;
       END IF;
       SET leftIndex  = leftIndex  + 1
         , rightIndex = rightIndex + 1;
    END WHILE;
    
    RETURN difference;
    END@
    ------------------------------------------------------------------------------
    DB20000I  The SQL command completed successfully.
    Some test results.
    Code:
    ------------------------------ Commands Entered ------------------------------
    VALUES quasi_Levenshtein('testing' , 'testing one two three')@
    ------------------------------------------------------------------------------
    
    1          
    -----------
             14
    
      1 record(s) selected.
    
    
    ------------------------------ Commands Entered ------------------------------
    VALUES quasi_Levenshtein('testing' , 'testxxxing')@
    ------------------------------------------------------------------------------
    
    1          
    -----------
              6
    
      1 record(s) selected.
    
    
    ------------------------------ Commands Entered ------------------------------
    VALUES quasi_Levenshtein('testing' , '')@
    ------------------------------------------------------------------------------
    
    1          
    -----------
              7
    
      1 record(s) selected.
    
    
    ------------------------------ Commands Entered ------------------------------
    VALUES quasi_Levenshtein('' , 'test')@
    ------------------------------------------------------------------------------
    
    1          
    -----------
              4
    
      1 record(s) selected.

    Note: I replaced WHILE condition

    Code:
    /* If used equivalent to Brent Bulla's original code
    WHILE leftIndex <= MAX( LENGTH(left) , LENGTH(right) ) DO
    */
    ------------------------------ Commands Entered ------------------------------
    VALUES quasi_Levenshtein('texsxtxixnxg' , 'testingyz')@
    ------------------------------------------------------------------------------
    
    1          
    -----------
              5
    
      1 record(s) selected.
    Code:
    /* If new condition
    WHILE MIN( leftIndex    , rightIndex    )
       <= MAX( LENGTH(left) , LENGTH(right) ) DO
     was used */
    ------------------------------ Commands Entered ------------------------------
    VALUES quasi_Levenshtein('texsxtxixnxg' , 'testingyz')@
    ------------------------------------------------------------------------------
    
    1          
    -----------
              7
    
      1 record(s) selected.
    Last edited by tonkuma; 05-17-12 at 00:25. Reason: Replace WHILE condition.

  6. #6
    Join Date
    May 2012
    Posts
    13
    Perhaps I should read more about the algorithm, as I didn't know that it would have such an effect on the results. I still think I must have converted to work with db2 wrong. I beleive substring indexes aren't zero-based here? Either way, this is something I'd like to have working and testable in a few days. If there aren't any for db2 already written, I'll have to make some time at work to find another to convert and post back if I have any issues.


    But maybe before going down this route too eagerly, maybe I should ask: is there a better way to imitate some sort of "google auto suggest"? I wanted too see how fast this algorithm worked and then attemp some combinations of it and soundex to possibly get something similar to how google suggests searches as you type. Would you happen to know a Better way around this?

    Thanks

  7. #7
    Join Date
    May 2012
    Posts
    13
    Oh wow, thanks for that code sample! I'll have to try this tomorrow at work. That looks exactly what I was expecting.

    Thanks again

  8. #8
    Join Date
    May 2012
    Posts
    13
    Thanks tonkuma - the function worked great, and I am wondering one additional thing:

    It appears I may be after the Damerau–Levenshtein instead. I don't know where you found the above function you provided me, but was there a Damerau–Levenshtein there by chance? If not I can experiment and try looking elsewhere. Basically I need to look up suggestions from a Db column with 1000's of records as fast as possible and taking the re-ordering of words into account ( aid first - "Did you mean: first aid?", etc).

    Thanks again.

  9. #9
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    The UDF quasi_Levenshtein was converted from Softwhack: 35x Improved T-SQL LevenShtein Distance Algorithm...at a cost
    and improved to make it compact/effective.

    I have started to implement full Levenshtein algorithm by SQL on DB2 for LUW,
    referencing some web sites like...
    Levenshtein distance - Wikipedia, the free encyclopedia
    Efficient Implementation of the Levenshtein-Algorithm, Fault-tolerant Search Technology, Error-tolerant Search Technologies
    Levenshtein Distance
    so on...

    Now, I want to attack Damerau–Levenshtein after implementation of Levenshtein.
    But, I afraid I may loose interest on this subject later.

  10. #10
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    Here is a trial Levenshtein function.

    By short test, it looks wrok well.
    But, performance rapidly get worse, when parameter strings get longer.

    Example 2: used an algorithm described in references in my previous post.
    Code:
    ------------------------------ Commands Entered ------------------------------
    CREATE OR REPLACE PROCEDURE Levenshtein
    ( IN  s VARCHAR(255)    , IN  t VARCHAR(255)
    , OUT distance SMALLINT , OUT row_count INTEGER
    )
    MODIFIES SQL DATA
    DETERMINISTIC
    NO EXTERNAL ACTION
    
    BEGIN
    DECLARE m , n , k SMALLINT;
    
    DECLARE CONTINUE HANDLER FOR SQLSTATE '42710'
     BEGIN /* Already exists? Ignore */ END;
    
    DECLARE GLOBAL TEMPORARY TABLE
    d ( d SMALLINT NOT NULL , i SMALLINT NOT NULL , j SMALLINT NOT NULL )
    NOT LOGGED
    ;
    
    DECLARE GLOBAL TEMPORARY TABLE
    natural_zero ( n SMALLINT NOT NULL )
    NOT LOGGED
    ;
    CREATE INDEX SESSION.natural_zero_n
     ON SESSION.natural_zero (n)
    ;
    
    INSERT INTO SESSION.natural_zero
    SELECT n1 + n2 + n3 + n4 + n5
     FROM (VALUES 0 ,   1 ,   2 ,   3 ) n(n1)
         ,(VALUES 0 ,   4 ,   8 ,  12 ) n(n2)
         ,(VALUES 0 ,  16 ,  32 ,  48 ) n(n3)
         ,(VALUES 0 ,  64 , 128 , 192 ) n(n4)
         ,(VALUES 0 , 256             ) n(n5)
     WHERE NOT EXISTS
          (SELECT 0
            FROM  SESSION.natural_zero
            WHERE n = 0
          )
    ;
    
    SET m = LENGTH(s)
      , n = LENGTH(t)
    ;
    
    DELETE FROM SESSION.d;
    INSERT INTO SESSION.d
    SELECT i , i , 0
     FROM  SESSION.natural_zero AS n(i)
     WHERE i <= m
    UNION ALL
    SELECT j , 0 , j
     FROM  SESSION.natural_zero AS n(j)
     WHERE j BETWEEN 1 AND n
    ;
    
    SET k = 2;
    WHILE k <= m + n DO
       INSERT INTO SESSION.d
       SELECT CASE
              WHEN s1 =  t1 THEN
                   MIN(d)
              ELSE MIN(d) + 1
              END
            , f.i , f.j
        FROM  SESSION.d            AS d
        INNER JOIN
              SESSION.natural_zero AS f(i)
         ON   f.i BETWEEN MAX(k - n , 1) AND MIN(k - 1 , m)
        CROSS JOIN
              LATERAL
             (SELECT j
                   , SUBSTR(s , f.i , 1) , SUBSTR(t , f.j , 1)
               FROM  LATERAL
                    (VALUES k - f.i ) AS f(j)
             ) AS f(j , s1 , t1)
        WHERE s1 =  t1
          AND d.i = f.i - 1 AND d.j = f.j - 1
          OR  s1 <> t1
          AND (d.i , d.j) IN (VALUES (f.i - 1 , f.j    )
                                   , (f.i     , f.j - 1)
                                   , (f.i - 1 , f.j - 1)
                             )
        GROUP BY
              f.i , f.j
            , s1  , t1
        ;
    
        SET k = k + 1;
    END WHILE;
    
    SET distance
      =(SELECT d.d
         FROM  SESSION.d AS d
         WHERE i = m AND j = n
       )
      , row_count
      =(SELECT COUNT(*) FROM SESSION.d)
    ;
    
    END@
    ------------------------------------------------------------------------------
    DB20000I  The SQL command completed successfully.
    Code:
    ------------------------------ Commands Entered ------------------------------
    CREATE OR REPLACE FUNCTION Levenshtein
    ( IN  s VARCHAR(255) , IN  t VARCHAR(255) )
    RETURNS TABLE( distance SMALLINT , row_count INTEGER)
    MODIFIES SQL DATA
    DETERMINISTIC
    NO EXTERNAL ACTION
    
    BEGIN ATOMIC
    DECLARE distance  SMALLINT;
    DECLARE row_count INTEGER;
    
    CALL Levenshtein(s , t , distance , row_count);
    
    RETURN
    VALUES (distance , row_count)
    ;
    END@
    ------------------------------------------------------------------------------
    DB20000I  The SQL command completed successfully.
    Note: ROW_COUNT is a matrix size, and must be (m + 1) * (n + 1) where m = LENGTH(first parameter), n = LENGTH(second parameter).
    It was returned mainly for debug purpose.

    Some test results
    Code:
    ------------------------------ Commands Entered ------------------------------
    SELECT *
     FROM (VALUES ('testing' , 'testing one two three') ) AS p(s , t)
         , TABLE( Levenshtein(s , t) ) L@
    ------------------------------------------------------------------------------
    
    S       T                     DISTANCE ROW_COUNT  
    ------- --------------------- -------- -----------
    testing testing one two three       14         176
    
      1 record(s) selected.
    
    
    ------------------------------ Commands Entered ------------------------------
    SELECT *
     FROM (VALUES ( 'testxxxing' , 'testinxg' )
          ) AS p(s , t)
         , TABLE( Levenshtein(s , t) )@
    ------------------------------------------------------------------------------
    
    S          T        DISTANCE ROW_COUNT  
    ---------- -------- -------- -----------
    testxxxing testinxg        4          99
    
      1 record(s) selected.
    
    
    ------------------------------ Commands Entered ------------------------------
    SELECT *
     FROM (VALUES ( 'levenshtein' , 'meilenstein' )
          ) AS p(s , t)
         , TABLE( Levenshtein(s , t) )@
    ------------------------------------------------------------------------------
    
    S           T           DISTANCE ROW_COUNT  
    ----------- ----------- -------- -----------
    levenshtein meilenstein        4         144
    
      1 record(s) selected.
    This case(compare 50 chars and 50 chars) took over 10 seconds on my old laptop.
    Code:
    ------------------------------ Commands Entered ------------------------------
    SELECT *
     FROM (VALUES ( 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx'
                  , 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvzw' )
          ) AS p(s , t)
         , TABLE( Levenshtein(s , t) ) L@
    ------------------------------------------------------------------------------
    
    S                                                  T                                                  DISTANCE ROW_COUNT  
    -------------------------------------------------- -------------------------------------------------- -------- -----------
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvzw        2        2601
    
      1 record(s) selected.

  11. #11
    Join Date
    May 2012
    Posts
    13
    That's interesting - thanks. I wonder if I could speed it up by SoundEx-ing every word and performing the function on the phrase of soundex codes. That is where taking multiple words and their order might become useful, but I would have to play with it.

    Thank you so much for the time you've put into this, it's been very helpful and appreciated!

  12. #12
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    Another approach might be
    (1) Extract each words separated by blanks from strings.
    (2) Evaluate number of similar words.

    For step (1), Example 2: extract_elements UDF in this thread http://www.dbforums.com/db2/1677358-...sible-sql.html
    might be usable.

    For step (2), Some built-in functions(like SOUNDEX, DIFFERENCE) or UDFs(like Example 1, 2 in this thread) might be usable.

    Another my concern for step (2) is which combination of words should be evaluated?
    For example: should lack of words or exchange of words sequence be considered or not?

  13. #13
    Join Date
    Feb 2008
    Location
    Japan
    Posts
    3,483
    An improvement to accelerate the Example 2.

    The idea is to delete unnecesary rows from work table.

    Example 2a:
    Note: row_count returns maximum number of rows kept in the work table.
    Code:
    ------------------------------ Commands Entered ------------------------------
    CREATE OR REPLACE PROCEDURE Levenshtein
    ( IN  s VARCHAR(255)   , IN  t VARCHAR(255)
    , OUT distance SMALLINT , OUT row_count INTEGER
    )
    MODIFIES SQL DATA
    DETERMINISTIC
    NO EXTERNAL ACTION
    
    BEGIN
    DECLARE m , n , k SMALLINT;
    DECLARE max_row_count INTEGER DEFAULT 0;
    
    DECLARE CONTINUE HANDLER FOR SQLSTATE '42710'
     BEGIN /* Already exists? Ignore */ END;
    
    DECLARE GLOBAL TEMPORARY TABLE
    d ( d SMALLINT NOT NULL , i SMALLINT NOT NULL , j SMALLINT NOT NULL )
    NOT LOGGED
    ;
    
    DECLARE GLOBAL TEMPORARY TABLE
    natural_zero ( n SMALLINT NOT NULL )
    NOT LOGGED
    ;
    CREATE INDEX SESSION.natural_zero_n
     ON SESSION.natural_zero (n)
    ;
    
    INSERT INTO SESSION.natural_zero
    SELECT n1 + n2 + n3 + n4 + n5
     FROM (VALUES 0 ,   1 ,   2 ,   3 ) n(n1)
         ,(VALUES 0 ,   4 ,   8 ,  12 ) n(n2)
         ,(VALUES 0 ,  16 ,  32 ,  48 ) n(n3)
         ,(VALUES 0 ,  64 , 128 , 192 ) n(n4)
         ,(VALUES 0 , 256             ) n(n5)
     WHERE NOT EXISTS
          (SELECT 0
            FROM  SESSION.natural_zero
            WHERE n = 0
          )
    ;
    
    SET m = LENGTH(s)
      , n = LENGTH(t)
    ;
    
    DELETE FROM SESSION.d;
    INSERT INTO SESSION.d
    SELECT i , i , 0
     FROM  SESSION.natural_zero AS n(i)
     WHERE i <= m
    UNION ALL
    SELECT j , 0 , j
     FROM  SESSION.natural_zero AS n(j)
     WHERE j BETWEEN 1 AND n
    ;
    
    SET k = 2;
    WHILE k <= m + n DO
       INSERT INTO SESSION.d
       SELECT CASE
              WHEN s1 =  t1 THEN
                   MIN(d)
              ELSE MIN(d) + 1
              END
            , f.i , f.j
        FROM  SESSION.d            AS d
        INNER JOIN
              SESSION.natural_zero AS f(i)
         ON   f.i BETWEEN MAX(k - n , 1) AND MIN(k - 1 , m)
        CROSS JOIN
              LATERAL
             (SELECT j
                   , SUBSTR(s , f.i , 1) , SUBSTR(t , f.j , 1)
               FROM  LATERAL
                    (VALUES k - f.i ) AS f(j)
             ) AS f(j , s1 , t1)
        WHERE s1 =  t1
          AND d.i = f.i - 1 AND d.j = f.j - 1
          OR  s1 <> t1
          AND (d.i , d.j) IN (VALUES (f.i - 1 , f.j    )
                                   , (f.i     , f.j - 1)
                                   , (f.i - 1 , f.j - 1)
                             )
        GROUP BY
              f.i , f.j
            , s1  , t1
        ;
    
        SET max_row_count
          = (SELECT MAX( max_row_count , COUNT(*) )
               FROM SESSION.d
            );
        DELETE FROM SESSION.d
         WHERE i + j <= k - 2
        ;
    
        SET k = k + 1;
    END WHILE;
    
    SET distance
      =(SELECT d.d
         FROM  SESSION.d AS d
         WHERE i = m AND j = n
       )
      , row_count
      = max_row_count
    ;
    
    END@
    ------------------------------------------------------------------------------
    DB20000I  The SQL command completed successfully.
    Function is same.
    Code:
    ------------------------------ Commands Entered ------------------------------
    CREATE OR REPLACE FUNCTION Levenshtein
    ( IN  s VARCHAR(255) , IN  t VARCHAR(255) )
    RETURNS TABLE( distance SMALLINT , row_count INTEGER)
    MODIFIES SQL DATA
    DETERMINISTIC
    NO EXTERNAL ACTION
    
    BEGIN ATOMIC
    DECLARE distance  SMALLINT;
    DECLARE row_count INTEGER;
    
    CALL Levenshtein(s , t , distance , row_count);
    
    RETURN
    VALUES (distance , row_count)
    ;
    END@
    ------------------------------------------------------------------------------
    DB20000I  The SQL command completed successfully.

    Comparison of 50 chars and 50 chars finished about 1 seconds.
    It may be still too long, but more than 10 times faster than before improvements.
    Code:
    VALUES current timestamp
    
    1                         
    --------------------------
    2012-05-19-22.04.53.630000
    
      1 record(s) selected.
    
    
    SELECT * FROM (VALUES ( 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx' , 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvzw' ) ) AS p(s , t) , TABLE( Levenshtein(s , t) )
    
    S                                                  T                                                  DISTANCE ROW_COUNT  
    -------------------------------------------------- -------------------------------------------------- -------- -----------
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvzw        2         151
    
      1 record(s) selected.
    
    
    VALUES current timestamp
    
    1                         
    --------------------------
    2012-05-19-22.04.54.622000
    
      1 record(s) selected.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •