If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > DB2 > Find a sentence with words separated by multiple spaces

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 07-02-08, 11:17
bstjean bstjean is offline
Registered User
 
Join Date: Sep 2002
Location: Montreal, Canada
Posts: 219
Find a sentence with words separated by multiple spaces

Hi everyone,

I am trying to find an efficient way to perform a special query. Let me explain what I want.

Let's say we are looking for all description that match "this is the target". In fact, I want to find records that match those 4 words in this sequence disregarding the number of spaces (I mean spaces, tabs, Cr, Lf, etc) between them.

This has to be done without REGEX (would be too easy!). Besides throwing a bunch of REPLACE(REPLACE(REPLACE())) to strip separators, anyone has a better idea on how to do that in SQL, only in SQL with no UDF, just plain DB2 SQL ???
Reply With Quote
  #2 (permalink)  
Old 07-02-08, 11:39
Cougar8000 Cougar8000 is offline
Registered User
 
Join Date: Nov 2005
Location: IL
Posts: 554
like "this%is%the%target"

I think this should work
__________________
--
IBM Certified DBA on DB2 for Linux, UNIX, and Windows

DB2 v9.1.0.2 os 5.3.0.0
Reply With Quote
  #3 (permalink)  
Old 07-02-08, 11:47
bstjean bstjean is offline
Registered User
 
Join Date: Sep 2002
Location: Montreal, Canada
Posts: 219
Quote:
Originally Posted by Cougar8000
like "this%is%the%target"

I think this should work
No it won't... You'd get a match for something like "thisXisXthe55555target"
Reply With Quote
  #4 (permalink)  
Old 07-02-08, 12:13
Cougar8000 Cougar8000 is offline
Registered User
 
Join Date: Nov 2005
Location: IL
Posts: 554
isn't it what you were looking for?
__________________
--
IBM Certified DBA on DB2 for Linux, UNIX, and Windows

DB2 v9.1.0.2 os 5.3.0.0
Reply With Quote
  #5 (permalink)  
Old 07-02-08, 12:15
n_i n_i is offline
:-)
 
Join Date: Jun 2003
Location: Toronto, Canada
Posts: 4,449
.........
........

Last edited by n_i; 07-02-08 at 12:20.
Reply With Quote
  #6 (permalink)  
Old 07-02-08, 12:26
bstjean bstjean is offline
Registered User
 
Join Date: Sep 2002
Location: Montreal, Canada
Posts: 219
More detailed example

In my example, if the searched string is "this is the target", I'd like to have results like :

"this is the target"
"this is the target"
"this is the target"
etc...

Basically, I'm just ignoring the number of spaces between the words. it could be 1, 2, 3, 1000000 spaces between each word, it doesn't matter. Also, by space I mean separators (carriage return, tabs, line feed, etc).
Reply With Quote
  #7 (permalink)  
Old 07-02-08, 12:43
bstjean bstjean is offline
Registered User
 
Join Date: Sep 2002
Location: Montreal, Canada
Posts: 219
More detailed example

Sorry for the repost... The editor stripped the spaces between words! LOL

To make it more clear, here's a more detailed example.

If the searched string is "this is the target", I would expect results like :


"this[9 spaces]is[1 tab]the[1 carriage return][1 line feed]target"
"this[3 tabs][1 space]is[2 carrage returns]the[2 spaces]target"
"this[1 space]is[16 spaces]the[2 tabs][11 spaces][1 carriage return][1 tab]target"
"this[80 spaces]is[4 line feeds]the[5 spaces]target"
"this[1 space]is[1 space]the[1 space]target"

Well, you get the picture. For practical reasons, we can assume that no word is going to be

separated by more than 80 separator characters. Separators are : space, tab, line feed,

carriage return. The number of separators bewteen each word can be anything <= 80.

As I said, all this has to be done in "plain" SQL, i.e. no UDF, no REGEX, just plain basic

SQL.
Reply With Quote
  #8 (permalink)  
Old 07-02-08, 14:27
stolze stolze is offline
Registered User
 
Join Date: Jan 2007
Location: Jena, Germany
Posts: 2,662
Have a look at this article: http://www.ibm.com/developerworks/db...03stolze1.html

It describes a bit how you can parse strings with pure and simple SQL. You can use the same technique to analyze your strings and tolerate the different amount of spaces.

However, what's wrong with a UDF? Unless this is an exercise, in which case you should try to solve it yourself, I would like to understand the requirement for this restriction.
__________________
Knut Stolze
IBM DB2 Analytics Accelerator
IBM Germany Research & Development
Reply With Quote
  #9 (permalink)  
Old 07-02-08, 16:10
bstjean bstjean is offline
Registered User
 
Join Date: Sep 2002
Location: Montreal, Canada
Posts: 219
Quote:
Originally Posted by stolze
Have a look at this article: http://www.ibm.com/developerworks/db...03stolze1.html

It describes a bit how you can parse strings with pure and simple SQL. You can use the same technique to analyze your strings and tolerate the different amount of spaces.

However, what's wrong with a UDF? Unless this is an exercise, in which case you should try to solve it yourself, I would like to understand the requirement for this restriction.
I know, makes no sense but they (the client), ahem, "don't like" UDF...
Reply With Quote
  #10 (permalink)  
Old 07-03-08, 03:41
stolze stolze is offline
Registered User
 
Join Date: Jan 2007
Location: Jena, Germany
Posts: 2,662
Without UDFs you are lost. Trying to do such parsing in the main SQL statement only results in a convoluted mess. Thus, separating functionality out into UDFs (either SQL PL or external C/C++, Java, ... code) is really helping to make things easier to understand and keep it maintainable.
__________________
Knut Stolze
IBM DB2 Analytics Accelerator
IBM Germany Research & Development
Reply With Quote
  #11 (permalink)  
Old 07-03-08, 08:50
umayer umayer is offline
Registered User
 
Join Date: Dec 2005
Posts: 273
consider a logic like this:

1) Replace any valid separating character with blanc
2) remove all blancs
3) search the result for "thisisthetarget"

this can be done by:

... WHERE REPLACE ( TRANSLATE ( searched-string , X'40404040', X'????????' ) , ' ' ) = 'thisisthetarget'

Where ?? is the hexadecimal representation of the valid separators

Last edited by umayer; 07-03-08 at 09:00.
Reply With Quote
  #12 (permalink)  
Old 07-03-08, 10:52
bstjean bstjean is offline
Registered User
 
Join Date: Sep 2002
Location: Montreal, Canada
Posts: 219
Quote:
Originally Posted by umayer
consider a logic like this:

1) Replace any valid separating character with blanc
2) remove all blancs
3) search the result for "thisisthetarget"

this can be done by:

... WHERE REPLACE ( TRANSLATE ( searched-string , X'40404040', X'????????' ) , ' ' ) = 'thisisthetarget'

Where ?? is the hexadecimal representation of the valid separators
That's what I came up with... A bunch of REPLACE to convert Tabs, CR and LF to spaces and other REPLACE to trim multiple spaces down to a single space... Really really ugly! Looks like Lisp code with all those parentheses!
Reply With Quote
  #13 (permalink)  
Old 07-03-08, 19:40
tonkuma tonkuma is offline
Registered User
 
Join Date: Feb 2008
Location: Japan
Posts: 2,193
Quote:
That's what I came up with... A bunch of REPLACE to convert Tabs, CR and LF to spaces and other REPLACE to trim multiple spaces down to a single space... Really really ugly! Looks like Lisp code with all those parentheses!
I think this strategy is not too ugry(somebody may not agree).
1) TRANSLATE to convert Tab, CR and LF to space.
2) Five repetitive REPLACEs to replace multiple blanks to one blank.
REPLACE n5 blanks to one blank, REPLACE n4 blanks to one blank, ..., REPLACE n2 blanks to one blank, REPLACE two blanks to one blank.
Some combinations of n5,n4,n3,n2 are
Code:
N2          N3          N4          N5          Maximum Length
----------- ----------- ----------- ----------- --------------
          2           3           6          21            460
          2           3           6          22            460
          2           3           7          21            460
          2           3           7          22            460
          2           4           6          21            460
          2           4           6          22            460
          2           4           7          21            460
          2           4           7          22            460
          3           3           6          21            460
          3           3           6          22            460
          3           3           7          21            460
          3           3           7          22            460
          3           4           6          21            460
          3           4           6          22            460
          3           4           7          21            460
          3           4           7          22            460

  16 record(s) selected.
I couldn't find an answer to replace 80 blanks to one blank with 4 REPLACEs.
From two to 40 blanks could be replaced by one blank with 4 REPLACEs.
Code:
N2          N3          N4          M          
----------- ----------- ----------- -----------
          2           3           6          40
          2           4           7          40
          2           4           6          40
          2           3           7          40
          3           3           6          40
          3           3           7          40
          3           4           6          40
          3           4           7          40
          2           3           8          38
          3           3           8          38

  10 record(s) selected.

Last edited by tonkuma; 07-03-08 at 20:59.
Reply With Quote
  #14 (permalink)  
Old 07-03-08, 22:14
tonkuma tonkuma is offline
Registered User
 
Join Date: Feb 2008
Location: Japan
Posts: 2,193
Code:
------------------------------ Commands Entered ------------------------------
WITH Test_data(id, source) AS (
VALUES 
 (1, 'this         is'||x'09'||'the'||x'0d0a'||'target')
,(2, 'this'||x'090909'||' '||'is'||x'0d0d'||'the  target')
,(3, 'this'||x'090909'||' '||'is'||x'0d0d'||'thetarget')
,(4, 'this is                the'||x'0909'||'           '||x'0d09'||'target')
/*        1___5___10____5___20____5___30____5___40____5___50____5___60____5___70____5___80  */
,(5, 'this                                                                                is'||x'0a0a0a0a'||'the     target')
,(6, 'this                                     s                                          is'||x'0a0a0a0a'||'the     target')
,(7, 'this is the target')
)
/* End of Test data */
SELECT id, source
  FROM Test_data
 WHERE REPLACE(
       REPLACE(
       REPLACE(
       REPLACE(
       REPLACE( TRANSLATE(source, '   ', x'090a0d')
              , '                     ', ' ') /* 21 blanks -> one blank */
              , '      ', ' ') /* 6 blanks -> one blank */
              , '   ', ' ') /* 3 blanks -> one blank */
              , '  ', ' ') /* 2 blanks -> one blank */
              , '  ', ' ') /* 2 blanks -> one blank */
       = 'this is the target'
;
------------------------------------------------------------------------------

ID          SOURCE                                                                                                    
----------- ----------------------------------------------------------------------------------------------------------
          1 this         is	the
target                                                                               
          2 this			 isthe  target                                                                                   
          4 this is                the		           	target                                                           
          5 this                                                                                is



the     target  
          7 this is the target                                                                                        

  5 record(s) selected.
Reply With Quote
  #15 (permalink)  
Old 07-04-08, 04:52
umayer umayer is offline
Registered User
 
Join Date: Dec 2005
Posts: 273
No need to replace 5 blancs by 1, and 4 blancs by 1, and 3 blancs by 1 ...

simply replace any blanc by an empty string ( by omitting the third parameter of the REPLACE function )

Then you only need ONE TRANSLATE() and ONE REPLACE()
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On