Results 1 to 11 of 11
  1. #1
    Join Date
    Feb 2005
    Location
    Leesburg, VA
    Posts
    42

    Unanswered: Preferred Method for ANDing Relational Data?

    Hello all,

    I'm curious to know how you guys would handle what I'm trying to do here.

    I don't like Oracle Text, so I've decided to write my own parser/search engine. I have two tables, AD and KEYWORD_SEARCH. Here are the basics of the two tables:

    Code:
    AD
    ---------
    ROWID
    AD_ID NUMBER NOT NULL PRIMARY KEY
    AD_DESC
    
    KEYWORD_SEARCH
    --------------------
    AD_REC ROWID
    TOKEN VARCHAR2(150)
    FIRST_LETTER VARCHAR2(1)
    The KEYWORD_SEARCH table is an IOT formed of a primary key between AD_REC and TOKEN, then partitioned by FIRST_LETTER as a list partition.

    Every word from AD_DESC is parsed out via a java app I wrote and put into the KEYWORD_SEARCH table with an associated AD ROWID as AD_REC. The parsing and writing is fairly quick; I can make multiple threads and get it done very easily.

    The querying is becoming a problem because I now have 100,000 ads parsed with over 5 million tokens total.

    If I'm querying for a single keyword, it does fine; comes back very quickly. However, my manager wants us to be able to do multiple keywords with AND or OR between them. Doing it this way, the only options I can see are to use INTERSECT for AND and UNION for OR. Here are two ways I've been doing this:

    Code:
    select ad_id 
    from ad 
    where rowid in 
    (select ad_rec from keyword_search where token = 'blue' and first_letter = 'B' 
    INTERSECT 
    select ad_rec from keyword_search where token = 'power' and first_letter = 'P');
    This would have been an AND query because it would look for ad records with both criteria. To make it an OR query, I'd replace INTERSECT with UNION.

    The other way was like this:

    Code:
    select a.ad_id 
    from ad a, keyword_search ks 
    where a.rowid = ks.ad_rec and ks.token = 'blue' and ks.first_letter = 'B'
    INTERSECT
    select a.ad_id 
    from ad a, keyword_search ks 
    where a.rowid = ks.ad_rec and ks.token = 'power' and ks.first_letter = 'P'
    Anyways, as you can imagine, doing an intersect with this many records has been really killing our query times; millions of records take a lot of processing to do. Anyone have any other methods they use for querying relational data in this fashion, where you have to do multi-row ANDs or ORs?
    Steven Karam
    Oracle 10g Certified Master
    Web: OrclDBA.com
    Email: steve@orcldba.com

  2. #2
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Provided Answers: 1
    Why not the obvious

    Code:
    select a.ad_id 
    from ad a, keyword_search ks 
    where a.rowid = ks.ad_rec and (ks.token = 'blue' and ks.first_letter = 'B') and (ks.token = 'power' and ks.first_letter = 'P')
    ?

  3. #3
    Join Date
    Jun 2004
    Location
    Liverpool, NY USA
    Posts
    2,509
    Yes, Oracle Text. The database engine knows about it and optimizes the query to use it. You can also do things like NEAR. What do you not like about it?
    Bill
    You do not need a parachute to skydive. You only need a parachute to skydive twice.

  4. #4
    Join Date
    Feb 2005
    Location
    Leesburg, VA
    Posts
    42
    Quote Originally Posted by n_i
    Why not the obvious

    Code:
    select a.ad_id 
    from ad a, keyword_search ks 
    where a.rowid = ks.ad_rec and (ks.token = 'blue' and ks.first_letter = 'B') and (ks.token = 'power' and ks.first_letter = 'P')
    ?
    Because that does not work; each token is a word; the token can never equal power and boat at the same time for the same ROWID.
    Steven Karam
    Oracle 10g Certified Master
    Web: OrclDBA.com
    Email: steve@orcldba.com

  5. #5
    Join Date
    Feb 2005
    Location
    Leesburg, VA
    Posts
    42
    Quote Originally Posted by beilstwh
    Yes, Oracle Text. The database engine knows about it and optimizes the query to use it. You can also do things like NEAR. What do you not like about it?
    Because Oracle Text takes so long to parse data. Even with INDEX_THEMES turned off, it takes about an hour to do 200,000 records; 3,000,000 records with FAR more tokens will take forever, as far as my company is concerned.

    One of the big problems is that the parsing is done ON the database itself; you can't push the parsing onto other machines like most search engines do. My Java app can use any machine with Oracle client to be a parser, and so you can easily get 12 cheap servers to do parsing at the same time. With multiple threads, I can do 3 million records in about 25 minutes.

    Also, we're trying to find a faster query; Oracle Text queries are horribly slow.
    Steven Karam
    Oracle 10g Certified Master
    Web: OrclDBA.com
    Email: steve@orcldba.com

  6. #6
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Provided Answers: 1
    Quote Originally Posted by SteveKaram
    Because that does not work; each token is a word; the token can never equal power and boat at the same time for the same ROWID.
    Doh! Must have been the Friday Afternoon Syndrome...

    How about
    Code:
    select a.ad_id 
    from ad a, (
      select s1.rowid id from keyword_search s1, keyword_search s2
      where s1.rowid=s2.rowid 
      and s1.token = 'blue' and s1.first_letter = 'B'  
      and s2.token = 'power' and s2.first_letter = 'P'
    ) ks 
    where a.rowid = ks.id
    I think this should still be faster than INTERSECT.

  7. #7
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    should it be
    ...
    select s1.ad_rec id from keyword_search s1, keyword_search s2
    where s1.ad_rec = s2.ad_rec
    ...

    Alan

  8. #8
    Join Date
    Sep 2004
    Location
    London, UK
    Posts
    565
    I might be missing something, but it looks as though you are querying the 5 million-row KEYWORD_SEARCH table on the TOKEN column, but you don't have an index on it.

    AD would seem a better candidate for an IOT, although perhaps the size of the AD_DESC column takes it over one block per row, where things get more complicated.

  9. #9
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    Ad is accessed through rowid so making it an IOT would make that lookup slower.

    As for keyword_search I presume the IOT has the primary key on (token,rowid) in that order as that should be the most efficient.

    Alan

  10. #10
    Join Date
    Sep 2004
    Location
    London, UK
    Posts
    565
    Quote Originally Posted by AlanP
    Ad is accessed through rowid so making it an IOT would make that lookup slower.
    I was trying to suggest NOT accessing AD through rowid, but instead using ID. If AD were an IOT with ID as the key, the performance should be equivalent.
    Quote Originally Posted by AlanP
    As for keyword_search I presume the IOT has the primary key on (token,rowid) in that order as that should be the most efficient.
    I agree that would seem a better approach than:
    Quote Originally Posted by SteveKaram
    The KEYWORD_SEARCH table is an IOT formed of a primary key between AD_REC and TOKEN, then partitioned by FIRST_LETTER as a list partition.
    Also for once this might be a good place to look at clustering options.

  11. #11
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    What I meant was that access to a table through ROWID is faster than a key search on an IOT.

    If you have a large tokens you could make the keyword search quicker by including length of string in your keyword_search table and searching on the length of token aswell as the first letter.

    Alan

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •