Results 1 to 6 of 6
  1. #1
    Join Date
    Jul 2011
    Posts
    4

    (NOOB) Making Good Queries / Criticize my Scheme

    Hello everyone,

    First time poster. I'm not a complete noob... but I have only coded for myself. I would say on a "hobby" level, usually finding my own solutions to problems and spending many hours "trying stuff out." I'm not sure if anyone can relate but I'm getting to the point in my life where I would like to reduce the time trying things and increase the time producing things.

    I guess only time and work can change that but if anyone can take the time to help it would be greatly appreciated.

    I have been working on a site on and off for some time. Essentially it's a large affiliate marketing site with ~1 million products. I came across two problems: how do I write a script to categorize 1million products O_O!!! and how do I allow the user to search through them.

    Still working on problem #1 lol but problem #2 is where I have the question. Which, I think, essentially boils down to making a good query. Because I'm going to present part of my schema just wondering if anyone had ideas for improvements... and why. (So I can learn. )

    The tables that are relevant are the product, search token and the lookup table relating the tokens to products:

    Code:
    CREATE TABLE `products` (
      `pid` int(10) unsigned NOT NULL default '0',		//Product ID
      `vid` int(10) unsigned NOT NULL default '0',		//Vendor ID
      `title` varchar(255) NOT NULL default '',		//Product Title
      `ref_fmt` tinyint(4) NOT NULL default '0',		//Flags
      `thm_url` varchar(255) NOT NULL default '',		//Thumbnail URL
      `img_url` varchar(255) NOT NULL default '',		//Full Image URL
      `descript` text NOT NULL,				//Product Description
      PRIMARY KEY  (`pid`,`vid`),
      KEY `vid` (`vid`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1
    
    CREATE TABLE `lu_tokbin` (
      `tid` int(10) unsigned NOT NULL default '0',		//Token ID
      `data` blob NOT NULL,					//Product IDs related to `tid`
      PRIMARY KEY  (`tid`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1
    
    CREATE TABLE `search_tokens` (
      `tid` int(10) unsigned NOT NULL auto_increment,	//Token ID
      `word` varchar(32) NOT NULL default '',		//Character string
      PRIMARY KEY  (`tid`),
      UNIQUE KEY `word` (`word`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1
    The lu_tokbin table is a modification of the inverter barrels presented in the Sergey/Lawrence paper, "Anatomy of a Search Engine." I don't need to nest multiple levels of indirection in the BLOB just one. The first 4 bytes in the BLOB stores the number of pid's. The remaining bytes store the PID's themselves.

    I decided to use the BLOB type because the data would take up less space. As opposed to something like "372486578|32847956|2342|87676|8978".

    The queries I'm using are:

    Code:
    	$srch = str_replace(' ',"' OR word LIKE BINARY '",ereg_replace(' {2,}',' ',strtolower(trim(mysql_real_escape_string($_GET['qry'])))));
    	
    	$qry = "SELECT tid FROM search_tokens WHERE word LIKE BINARY '$srch'";
    This takes the user string, converts any multiple ' ' to one space then converts the one space to ' OR word....' to find the token IDs from the token table. Then...

    Code:
    	$srch = implode(' OR tid=',$srch);
    	
    	$qry = "SELECT data FROM lu_tokbin WHERE tid=$srch";
    So far so good (I think). Now I have the PID's for each token. Now I didn't combine the steps (I think I could've) because I wanted to include stemming (i.e. usr puts 'car' in string, script finds 'car' and 'cars').

    Okay so now my script creates another query....
    Code:
    	$qry = "SELECT title,thm_url FROM products WHERE pid=";
    	foreach($arr as $v) {
    		$n = unpack('L',$v);
     		$srch[] = implode(' OR pid=',unpack('L/L'.$n[1],$v));
    	}
    	$qry .= implode(' OR pid=',$srch);
    I think I was testing with 'already' 'sunglasses' and 'about.' This returned over 5,400 results. This is where I have my problem. If you can't follow the PHP basically a 'select' query is made where the conditions are 'pid=n1 or pid=n2 or pid=n3 ... or pid=nx' .... where in this case x is the 5,4xx-th pid. It takes a LONG time for this query to execute.

    I just added some code to determine some timing:
    constructing the query, ~.08 sec
    executing the query, ~3.95 sec
    parsing usr input, executing and constructing the first two queries, ~ .005 sec

    I have a bunch of work to do with this in many other aspects to fine tune how to combine terms (OR/AND/NOT etc) but the query execution time is abysmal. Google returns the top 10 of 11.7 million documents in 0.13 seconds... surely I have room for improvement.

    I did tinker with "LIMIT"... helps with the beginning but when I get to "LIMIT 5000,100" there is no speed gain. Also for chuckles I changed the query to grab the same number of products using 'pid>x AND pid <=y.' Definitely improved the time by 10x. But the ideal and actual case won't be the same.

    Any advice is appreciated on how to construct this last query to speed it up. Do I need to change my schema for the best improvement or is constructing '(pid>X && pid <=y) OR (pid>X1 && pid <=y1)... OR (pid>Xn && pid <=Yn)' the only way?

    Thanks a million.

  2. #2
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    "Affiliate Marketing". Is that the latest euphemism for MLMs these days?
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  3. #3
    Join Date
    Jul 2011
    Posts
    4
    No it isn't the same. Though there are similar elements. I might miss some but essentially the way I see it. Affiliate marketing is usually one tier... meaning you get paid on your results, no one is below you. I get a cut of the sale.... most of the agreements with retailers are for a certain time period. So if the customer buys after that and not before I get $0. The periods range from 30 days to infinity.. most being 90 days.

    I don';t think I need to explain MLMs .... but back to my query...

  4. #4
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    You do if you want my help. I don't supply advice to MLMs.

    From a database best practices viewpoint, I would discourage storing relational data in a BLOB. That is going to be extremely slow to search.

    You could instead use a semi-EAV (Entity, Attribute, Value) schema, or I would recommend storing non-standard product related data in XML. SQL Server 2005+ has built in support for storing and searching XML data.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  5. #5
    Join Date
    Jul 2011
    Posts
    4
    Having difficulties getting online. Good connections something I took for granted in the lower 48. I'm posting 2x... 1st will try and differentiate MLMs and affiliate marketing, 2nd is the response to your post.

    Sorry blindman, I just thought that everyone knew what MLM's were. I'll give a couple examples:

    Affiliate: Bob runs an internet book store. He has found a niche in silver antiquities and but has a limited selection at this time. He has developed a small loyal following and wants to provide them with a better selection of books. So he becomes an Amazon.com affiliate. He mixes both the books from Amazon and his own stock on his site. As his profits increase he gradually weans out Amazon and expands his stock to fill his clients' needs.

    Affiliate: Joanne writes a blog about her Persian kitties. She has gathered several hundred readers over two years and decides to start monetizing her blog. She finds a pet supply business online that has an affiliate program and incorporates some of their products into her site and now earns a few hundred a month in commissions for their purchases.

    Affiliate: Me... I have no niche. I don't have time to blog but I want to run an online store eventually. I've had a few ideas but not the time to implement them (probably because I keep getting more and more of them) or the money to hire people out. So I came up with a "simple idea" at the time (ROFLMAO) to allow users to find products they might be looking for.... kinda like a product-Google. They can search/browse the products on my site all day long and when ready to purchase go to the vendor's page to buy. One aspect I like about my site is that it involves, for the most part, smaller-sized merchants with product brands that may not be typically found... some offering more custom-style products. The downside of course is there is no brand reputation built in. But I looked over the merchants' sites that I had intended on incorporating and if I like their site and product/service they stayed.

    MLM: (sarcasm on)Jon comes over and really hypes up the earning potential of this company selling juice for $100 a bottle. If I sign up just two people and they sign up two people.. and so on. At the tenth level I'll be making >$100k every month.
    1. Make a list of 100 people you know.
    2. Call/email/spam/harass them.
    3. Buy 1000 leads of people looking to start a home based business.
    4. Call/email/spam/harass them. Convince them that, even though you haven't made a dime, they should join your team. Your juice can't be found anywhere else and is 110% worth $100 every month!!!
    5. Repeat #3 & #4 until you make money.
    (sarcasm off)

    There are affiliate programs that resemble MLM payplans because some allow affiliates to recruit other affiliates. This project doesn't involve those plans. I'm more product focused. I haven't developed the interpersonal skills to recruit an army of salespeople. I don't think I'd want the headaches either.

    MLM definitely has the potential for more predatory marketing practices... not saying affiliates don't. It just depends on what the person wants to focus on: quality, value, service... or feeding off the weak.

    Of course once an affiliate program goes to multiple levels .... one can argue whether or not it is an MLM at that point. I'm part of both... making money at neither LOL. But that's my own doing, by sitting on my tootie for extended periods of time. But even in the MLM I'm in it's about the product... not gathering an army of recruits as you can see from my unlaunched unofficial site here.

    (Need to move it to its official URL, pay my $100 to reactivate my membership and write a few scripts to construct 500 PPC ads before it is officially launched.)

  6. #6
    Join Date
    Jul 2011
    Posts
    4
    I'm somewhat platform limited. My development setup (my netbook) is WAMP (Windows, Apache, MySQL, PHP), my production is LAMP (Linux...) They are still using PHP v4-something and MySQL v4-something.

    I am limited to that unfortunately. I'm reading up about EAV now. My mind is much slower than it use to be. I'll hopefully see the connection and repost what I think you mean.

    I could just change the lu_tokbin table from:

    Code:
    CREATE TABLE `lu_tokbin` (
      `tid` int(10) unsigned NOT NULL default '0',		//Token ID
      `data` blob NOT NULL,					//Product IDs related to `tid`
      PRIMARY KEY  (`tid`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1
    to:

    Code:
    CREATE TABLE `lu_tokbin` (
      `tid` int(10) unsigned NOT NULL default '0',		//Token ID
      `pid` int(10) unsigned NOT NULL default '0',		//Product ID continaing token
      UNIQUE KEY  (`tid`,`pid`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1
    That change still would only remove the unpack() calls in my code. The query contruction time is rather insignificant compared to it's execution. The reason for the BLOB was to remove the potentially many duplicate TID entries and conserve space.

    My brain kicked in. Actually, that would greatly decrease the query size and maybe the number of conditions to test. The later I'm assuming is where most of the time is spent. But, and please tell me if I'm wrong, it eliminates the ability to prioritize the products by number of hits out of the total tokens. Example, "aviator sunglasses." Products with both in the title should be displayed before those with only "sunglasses" or "aviator."

    I'm going to test it out and see if any improvements are made as well as continue my reading so I can make a more informed response. Only reason I didn't initially chose that route was the space it would take up. It would essentially double the size of the the table.

    (My continued reading has me grasping the concept but I think I only have two values to store... token id (entity) and product id (value))

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •