Unanswered: Preferred Method for ANDing Relational Data?
I'm curious to know how you guys would handle what I'm trying to do here.
I don't like Oracle Text, so I've decided to write my own parser/search engine. I have two tables, AD and KEYWORD_SEARCH. Here are the basics of the two tables:
AD_ID NUMBER NOT NULL PRIMARY KEY
The KEYWORD_SEARCH table is an IOT formed of a primary key between AD_REC and TOKEN, then partitioned by FIRST_LETTER as a list partition.
Every word from AD_DESC is parsed out via a java app I wrote and put into the KEYWORD_SEARCH table with an associated AD ROWID as AD_REC. The parsing and writing is fairly quick; I can make multiple threads and get it done very easily.
The querying is becoming a problem because I now have 100,000 ads parsed with over 5 million tokens total.
If I'm querying for a single keyword, it does fine; comes back very quickly. However, my manager wants us to be able to do multiple keywords with AND or OR between them. Doing it this way, the only options I can see are to use INTERSECT for AND and UNION for OR. Here are two ways I've been doing this:
where rowid in
(select ad_rec from keyword_search where token = 'blue' and first_letter = 'B'
select ad_rec from keyword_search where token = 'power' and first_letter = 'P');
This would have been an AND query because it would look for ad records with both criteria. To make it an OR query, I'd replace INTERSECT with UNION.
The other way was like this:
from ad a, keyword_search ks
where a.rowid = ks.ad_rec and ks.token = 'blue' and ks.first_letter = 'B'
from ad a, keyword_search ks
where a.rowid = ks.ad_rec and ks.token = 'power' and ks.first_letter = 'P'
Anyways, as you can imagine, doing an intersect with this many records has been really killing our query times; millions of records take a lot of processing to do. Anyone have any other methods they use for querying relational data in this fashion, where you have to do multi-row ANDs or ORs?
Yes, Oracle Text. The database engine knows about it and optimizes the query to use it. You can also do things like NEAR. What do you not like about it?
Because Oracle Text takes so long to parse data. Even with INDEX_THEMES turned off, it takes about an hour to do 200,000 records; 3,000,000 records with FAR more tokens will take forever, as far as my company is concerned.
One of the big problems is that the parsing is done ON the database itself; you can't push the parsing onto other machines like most search engines do. My Java app can use any machine with Oracle client to be a parser, and so you can easily get 12 cheap servers to do parsing at the same time. With multiple threads, I can do 3 million records in about 25 minutes.
Also, we're trying to find a faster query; Oracle Text queries are horribly slow.
Because that does not work; each token is a word; the token can never equal power and boat at the same time for the same ROWID.
Doh! Must have been the Friday Afternoon Syndrome...
from ad a, (
select s1.rowid id from keyword_search s1, keyword_search s2
and s1.token = 'blue' and s1.first_letter = 'B'
and s2.token = 'power' and s2.first_letter = 'P'
where a.rowid = ks.id
I think this should still be faster than INTERSECT.