Results 1 to 12 of 12
  1. #1
    Join Date
    Jan 2003
    Posts
    1,636
    Provided Answers: 1

    Unanswered: How to write SQL to verify if there are duplicates in text in one single field/row?

    Hi,
    to simplify the problem. I have table with only single varchar(2000) column and one single row with the following string:
    Code:
                some text
                ID = 1
                some text
                ID = 2
                some text
    Note: Above are 5 rows each row at the end of row have new-line character, so all this 5 rows are actually saved in one column in one single record in DB2 table.

    QUESTION:
    What I really need to check is if string "ID = " is followed with sequential number with no duplicates. Above sample is OK, because "ID = 1 follows with ID = 2". I need to catch the duplicates, so if "ID = 1 follows with ID = 1" or any other duplication like "ID = 1 follows ID = 2 and then repeats ID = 1" (ID = 1 is duplicate). ID in one single field/record should never duplicate. Is there a way to write such an SQL to check for duplicates? I only need true/false return info. If there are duplicates then report "duplicates" if not then it is "OK".

    WHAT I HAVE TRIED:
    I thought to solve this problem using new DB2 v11 regular expression functionality. For example I can count how many "ID = " rows in above single record appears. Bellow is sample:

    Code:
           SELECT
                REGEXP_COUNT('
                some text
                ID = 1
                some text
                ID = 2
                some text
                ', 'ID = ')
            FROM
                sysibm.sysdummy1
    ;
    The result of above SQL is 2, because "ID = " regular expression appears 2 times in one single field/record.

    I don't know (if the data were written in integer data field and each of the row in its own record) I would do the following:
    select sum(distinct number_field), sum(number_field) from table
    if the result match then there is no duplicate.

    But I don't know how to get number 1 and 2 out of this string.

    I know REGEXP_COUNT is not suitable for my problem. I can't find any function to get the regular expression value...

    Any idea how to get this problem solved if like written above sample all of the text is saved in one single column/row?

    My system: DB2 v11.1 fixpack 1 on Linux/Intel.

    Some DB2 regular expression info from DB2 v11:
    http://www.idug.org/p/bl/et/blogid=278&blogaid=605
    https://www.ibm.com/support/knowledg.../r0061491.html
    https://www.ibm.com/support/knowledg.../r0061494.html
    https://www.ibm.com/support/knowledg.../r0061533.html
    Last edited by grofaty; 04-12-17 at 10:34.

  2. #2
    Join Date
    Jul 2016
    Location
    Moscow
    Posts
    294
    Provided Answers: 45
    Hi,

    try this:

    Code:
    select *
    from table(values 'some text
    ID = 1
    some text
    ID = 2
    some text'
    , 'some text
    ID = 1
    some text
    ID = 2
    some text
    ID = 1
    '
    ) t(s)
    where
    xmlcast(
    xmlquery('fn:matches($s, $p, "s")' passing 
      t.s||' ' as "s"
    , '(ID = \d+).*\1\W' as "p"
    )
    as int)=1
    Regards,
    Mark.

  3. #3
    Join Date
    Jan 2003
    Posts
    1,636
    Provided Answers: 1
    Mark,
    your SQL returns:
    Code:
    S
    ---------------------------------------------------
    some text ID = 1 some text ID = 2 some text ID = 1
    I don't understand what are you trying to accomplish. I need value "duplicates" (or some value like 1) IDs are duplicates and "OK" if there are no duplicates (or some value like "0").

  4. #4
    Join Date
    Jul 2016
    Location
    Moscow
    Posts
    294
    Provided Answers: 45
    I tried to show that if the input string has duplicates then the predicate used there returns true and the corresponding string is returned.
    If you want to get for each string something like 'Duplicates' or 'OK', then:

    Code:
    select s, 
    case xmlcast(
    xmlquery('fn:matches($s, $p, "s")' passing 
      t.s||' ' as "s"
    , '(ID = \d+).*\1\W' as "p"
    )
    as int)
    when 1 then 'Duplicates'
    else 'OK'
    end status
    from table(values 'some text
    ID = 1
    some text
    ID = 2
    some text'
    , 'some text
    ID = 1
    some text
    ID = 2
    some text
    ID = 1
    '
    ) t(s)
    Regards,
    Mark.

  5. #5
    Join Date
    Jan 2003
    Posts
    1,636
    Provided Answers: 1
    mark.bb, now I see in your sample data you have changed sample from my one row one column to yours two rows one column to reproduce both cases. I have just overlook the "comma" separator in your SQL (to get two rows), so was totally confused what you are tying to do.

    What I see now you have solved the problem (thank you very much for this) using XML functions, this is excellent is working fine also in DB2 pre-v11. But now I see in DB2v11 (the version I am using) there is simpler solution using regexp_count function. To your last SQL I added my solution (in red color), that replaces your part of SQL (in blue). I just copied exactly the same regular expression and the same flags as you did with XML and bellow is the result, both your part of SQL and mine producing the same result:

    Code:
    select s,
    case xmlcast(
    xmlquery('fn:matches($s, $p, "s")' passing
      t.s||' ' as "s"
    , '(ID = \d+).*\1\W' as "p"
    )
    as int)
    when 1 then 'Duplicates'
    else 'OK'
    end status,
    case regexp_count(t.s, '(ID = \d+).*\1\W', 's')
    when 1 then 'Duplicate'
    else 'OK'
    end as status_new
    from table(values 'some text
    ID = 1
    some text
    ID = 2
    some text'
    , 'some text
    ID = 1
    some text
    ID = 2
    some text
    ID = 1
    '
    ) t(s)
    ;
    The result of above SQL:
    Code:
    S                                                   STATUS     STATUS_NEW
    --------------------------------------------------- ---------- ----------
    some text ID = 1 some text ID = 2 some text         OK         OK
    some text ID = 1 some text ID = 2 some text ID = 1  Duplicates Duplicate
    So in my view the only relevant discussion from now on is discussion about regular expression:
    Code:
    (ID = \d+).*\1\W
    and flags:
    Code:
    's'
    What I am trying to do now, is stripping everything that is not absolutely needed to simplify the SQL to be more understandable (in future when I will be in "repairing mode").
    OK, I removed \W from regular expression and removing flag 's' and the result is the same. I did this in your part of SQL and in my part and result is exactly the same before and after the change.

    From: https://www.ibm.com/support/knowledg.../r0061533.html
    I see \W is "Match a non-word character."
    and from https://www.ibm.com/support/knowledg.../r0061491.html
    I see flag 's': "Specifies that the '.' character in a pattern matches a line terminator in the input string."

    What is the purpose of \W regular expression and what is purpose of 's' flag in SQL? Are this two solving some special corner case scenario or are they redundant?

    Thanks a lot.
    Last edited by grofaty; 04-19-17 at 05:06.

  6. #6
    Join Date
    Jul 2016
    Location
    Moscow
    Posts
    294
    Provided Answers: 45
    Your input string contains new line characters, so, you must specify "s". We use '.' character,m and it must match any character including the 'new line'.
    \W needed to deal with 'some text ID = 1 some text ID = 2 some text ID = 11'. You get 'Duplicates' without it on this string.
    Regards,
    Mark.

  7. #7
    Join Date
    Jan 2003
    Posts
    1,636
    Provided Answers: 1
    @mark.bb, I did several tests and I see both "s" flag and "\W" switch are required. Thanks.

    During testing a new problem pop-ed up. If there is a comment in text I would like to ignore comment before checking duplicates.

    I copied your sample from #4 and added a comment in text (my new text is in blue color).
    Code:
    select s,
    case xmlcast(
    xmlquery('fn:matches($s, $p, "s")' passing
      t.s||' ' as "s"
    , '(ID = \d+).*\1\W' as "p"
    )
    as int)
    when 1 then 'Duplicates'
    else 'OK'
    end status
    from table(values 'some text
    ID = 1
    some text
    ID = 2
    some text'
    , 'some text
    ID = 1
    some text
    ID = 2
    some text
    ID = 1
    ','ID = 1
    -- comment ID = 1
    ID = 2'
    ) t(s)
    ;
    The output is:
    Code:
    S                                                   STATUS
    --------------------------------------------------- ----------
    some text ID = 1 some text ID = 2 some text         OK
    some text ID = 1 some text ID = 2 some text ID = 1  Duplicates
    ID = 1 -- comment ID = 1 ID = 2                     Duplicates
    Notice last line is returned as "Duplicate" because of "-- comment ID = 1"

    Comments can start anywhere in the line and are valid to end of line. Comments start with "--" characters.

    When regular expression is written all comments should be removed before checking the duplicates.

    I have checked web page for "flags": https://www.ibm.com/support/knowledg.../r0061491.html and it looks like "m" flag should be added to make it possible to control each of the line of text (still inside of one DB2 record) with ^ start of line and $ end of line. More about ^, $ characters: https://www.ibm.com/support/knowledg.../r0061533.html

    Regular expression to remove comment would be:
    Code:
    --.*$
    Where:
    -- two dash characters
    . any character
    * zero or more times
    $ to the end of line

    Any idea how to write such regular expression to remove comments before checking duplicates?
    Thanks
    Last edited by grofaty; 04-20-17 at 05:31.

  8. #8
    Join Date
    Jul 2016
    Location
    Moscow
    Posts
    294
    Provided Answers: 45
    Remove comments first with the following:
    Code:
    xmlcast(
    xmlquery('fn:replace($s, $p, $r, "s")' passing 
      t.s||chr(10) as "s"
    , '--.*?\n' as "p"
    , '' as "r"
    )
    as varchar(4000))
    Regards,
    Mark.

  9. #9
    Join Date
    Jan 2003
    Posts
    1,636
    Provided Answers: 1
    mark.bb, it doesn't work correctly.

    Code:
    select s,
    xmlcast(
    xmlquery('fn:replace($s, $p, $r, "s")' passing
      t.s||chr(10) as "s"
    , '--.*?\n' as "p"
    , '' as "r"
    )
    as varchar(400))
    from table(values 'ID = 1
    -- comment ID = 1
    ID = 2'
    ) t(s)
    ;
    result:
    Code:
    S                               2
    ------------------------------- ------------
    ID = 1 -- comment ID = 1 ID = 2 ID = 1
    The result is "ID = 1" (in column 2). The required result is delete only the text from "--" to the end of particular line (red color).
    Last edited by grofaty; 04-21-17 at 02:53.

  10. #10
    Join Date
    Jul 2016
    Location
    Moscow
    Posts
    294
    Provided Answers: 45
    Strange enough indeed.
    This works from my Java client, but doesn't work from db2 CLP.
    Regards,
    Mark.

  11. #11
    Join Date
    Jan 2003
    Posts
    1,636
    Provided Answers: 1
    I have investigated this problem in detail. It looks like db2 CLP eats new line character. If we carefully look at the results in all of above posts (e.g. #9) it is clear text is displayed in one row.

    Lets look at the hexadecimal value of string:
    Code:
    select s,
    hex('1-- ') as hex_value,
    hex(s) as hex_value
    from table(values 'ID = 1
    -- comment ID = 1
    ID = 2'
    ) t(s)
    ;
    result:
    Code:
    S                               HEX_VALUE HEX_VALUE
    ------------------------------- --------- --------------------------------------------------------------
    ID = 1 -- comment ID = 1 ID = 2 312D2D20  4944203D2031202D2D20636F6D6D656E74204944203D2031204944203D2032
    I looked what is between character "1" and characters "-- ". It should be new line character (hex value 13), but it is obvious it is hex value 20 which is space character. It looks to me db2 CLP replaces new line character with space. Interesting...

  12. #12
    Join Date
    Jan 2003
    Posts
    1,636
    Provided Answers: 1
    Actually the above sample with temporally table is just simplified sample, what I need is selecting data from table where I am sure there is new line character stored in table. Lets look at the sample with creating stored procedures (I tried to not change your part of SQL) and reading from syscat.routines table. When creating stored procedure db2 CLP does not eat the new line character. Interesting...
    Code:
    create or replace procedure test ()
    language sql
    specific test
    begin
        declare id int;
        declare sometext char(1);
        set id = 1;
        -- comment id = 1
        set sometext = 'some text';
        set id = 2;
    end@
    
    select
        s
    from
    (select text as s
    from syscat.routines
    where routinename = 'TEST'
    ) as t
    @
    result (command executed by: db2 -td@ -f file.sql):
    Code:
    create or replace procedure test ()
    language sql
    specific test
    begin
        declare id int;
        declare sometext char(1);
        set id = 1;
        -- comment id = 1
        set sometext = 'some text';
        set id = 2;
    end
    in above result it is obvious new line characters were NOT eaten by db2 CLP.

    Your SQL suggestion:
    Code:
    select
    xmlcast(
    xmlquery('fn:replace($s, $p, $r, "s")' passing
      t.s||chr(10) as "s"
    , '--.*?\n' as "p"
    , '' as "r"
    )
    as varchar(400))
    from
    (select text as s
    from syscat.routines
    where routinename = 'TEST'
    ) as t
    @
    result:
    Code:
    create or replace procedure test ()
    language sql
    specific test
    begin
        declare id int;
        declare sometext char(1);
        set id = 1;
            set sometext = 'some text';
        set id = 2;
    end
    it looks fine, but it also looks above SQL removes little bit more then needed. It removes WHOLE line where the comment is located. In this case it is not a big problem, but if comment would be at the end of other command it probably would not work (I haven't tested this, because comments in my case are always at the beginning of line infront only spaces).

    Above is your DB2 pre-11 version, now I found DB2 v11 simplified version using the same regular expressions as in your case:
    Code:
    select
    REGEXP_REPLACE
    (
    t.s
    ,
    '--.*?\n'
    )
    from
    (select text as s
    from syscat.routines
    where routinename = 'TEST'
    ) as t
    @
    and result is the same as in previous sample.

    Now little bit tweaking regular expression (like I suggested yesterday) I got the following:
    Code:
    select
    REGEXP_REPLACE
    (
    t.s
    ,
    '--.+$'
    , '', 1, 0, 'm'
    )
    from
    (select text as s
    from syscat.routines
    where routinename = 'TEST'
    ) as t
    @
    Notice the 'm' flag which treats each of the line (in one single DB2 record) as separate line, so ^ start of line and $ end of line characters can be used.
    Result:
    Code:
    create or replace procedure test ()
    language sql
    specific test
    begin
        declare id int;
        declare sometext char(1);
        set id = 1;
    
        set sometext = 'some text';
        set id = 2;
    end
    in lines between "set id = 1;" and "set sometext = 'some text';" is a new line with four spaces. So this regular expression ONLY removes exactly the comments.

    This is so far I have investigated...

    Thanks for your very valuable help. If I need something more I will dare to ask...
    Last edited by grofaty; 04-21-17 at 04:18.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •