Hello everyone,
First time poster. I'm not a complete noob... but I have only coded for myself. I would say on a "hobby" level, usually finding my own solutions to problems and spending many hours "trying stuff out." I'm not sure if anyone can relate but I'm getting to the point in my life where I would like to reduce the time trying things and increase the time producing things.
I guess only time and work can change that but if anyone can take the time to help it would be greatly appreciated.
I have been working on a site on and off for some time. Essentially it's a large affiliate marketing site with ~1 million products. I came across two problems: how do I write a script to categorize 1million products O_O!!! and how do I allow the user to search through them.
Still working on problem #1 lol but problem #2 is where I have the question. Which, I think, essentially boils down to making a good query. Because I'm going to present part of my schema just wondering if anyone had ideas for improvements... and why. (So I can learn.

)
The tables that are relevant are the product, search token and the lookup table relating the tokens to products:
Code:
CREATE TABLE `products` (
`pid` int(10) unsigned NOT NULL default '0', //Product ID
`vid` int(10) unsigned NOT NULL default '0', //Vendor ID
`title` varchar(255) NOT NULL default '', //Product Title
`ref_fmt` tinyint(4) NOT NULL default '0', //Flags
`thm_url` varchar(255) NOT NULL default '', //Thumbnail URL
`img_url` varchar(255) NOT NULL default '', //Full Image URL
`descript` text NOT NULL, //Product Description
PRIMARY KEY (`pid`,`vid`),
KEY `vid` (`vid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `lu_tokbin` (
`tid` int(10) unsigned NOT NULL default '0', //Token ID
`data` blob NOT NULL, //Product IDs related to `tid`
PRIMARY KEY (`tid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
CREATE TABLE `search_tokens` (
`tid` int(10) unsigned NOT NULL auto_increment, //Token ID
`word` varchar(32) NOT NULL default '', //Character string
PRIMARY KEY (`tid`),
UNIQUE KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
The lu_tokbin table is a modification of the inverter barrels presented in the Sergey/Lawrence paper, "Anatomy of a Search Engine." I don't need to nest multiple levels of indirection in the BLOB just one. The first 4 bytes in the BLOB stores the number of pid's. The remaining bytes store the PID's themselves.
I decided to use the BLOB type because the data would take up less space. As opposed to something like "372486578|32847956|2342|87676|8978".
The queries I'm using are:
Code:
$srch = str_replace(' ',"' OR word LIKE BINARY '",ereg_replace(' {2,}',' ',strtolower(trim(mysql_real_escape_string($_GET['qry'])))));
$qry = "SELECT tid FROM search_tokens WHERE word LIKE BINARY '$srch'";
This takes the user string, converts any multiple ' ' to one space then converts the one space to ' OR word....' to find the token IDs from the token table. Then...
Code:
$srch = implode(' OR tid=',$srch);
$qry = "SELECT data FROM lu_tokbin WHERE tid=$srch";
So far so good (I think). Now I have the PID's for each token. Now I didn't combine the steps (I think I could've) because I wanted to include stemming (i.e. usr puts 'car' in string, script finds 'car' and 'cars').
Okay so now my script creates another query....
Code:
$qry = "SELECT title,thm_url FROM products WHERE pid=";
foreach($arr as $v) {
$n = unpack('L',$v);
$srch[] = implode(' OR pid=',unpack('L/L'.$n[1],$v));
}
$qry .= implode(' OR pid=',$srch);
I think I was testing with 'already' 'sunglasses' and 'about.' This returned over 5,400 results. This is where I have my problem. If you can't follow the PHP basically a 'select' query is made where the conditions are 'pid=n1 or pid=n2 or pid=n3 ... or pid=nx' .... where in this case x is the 5,4xx-th pid. It takes a LONG time for this query to execute.
I just added some code to determine some timing:
constructing the query, ~.08 sec
executing the query, ~3.95 sec
parsing usr input, executing and constructing the first two queries, ~ .005 sec
I have a bunch of work to do with this in many other aspects to fine tune how to combine terms (OR/AND/NOT etc) but the query execution time is abysmal. Google returns the top 10 of 11.7 million documents in 0.13 seconds... surely I have room for improvement.
I did tinker with "LIMIT"... helps with the beginning but when I get to "LIMIT 5000,100" there is no speed gain. Also for chuckles I changed the query to grab the same number of products using 'pid>x AND pid <=y.' Definitely improved the time by 10x.

But the ideal and actual case won't be the same.
Any advice is appreciated on how to construct this last query to speed it up. Do I need to change my schema for the best improvement or is constructing '(pid>X && pid <=y) OR (pid>X1 && pid <=y1)... OR (pid>Xn && pid <=Yn)' the only way?
Thanks a million.