If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > MySQL > how to create efficient MySQL query from a pseudo query

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 10-25-06, 11:49
phlype phlype is offline
Registered User
 
Join Date: Oct 2006
Posts: 4
how to create efficient MySQL query from a pseudo query

I'm trying to build a webapplication where users can search for a person having a particular preference for color and material. To store this information I use the following structure (a MySQL dump can be found at the end of this post):
*table person with fields:
-persid: autoincrement id
-name: name of the person
*table material with fields:
-materialid: autoincrement id
-material: name of the material eg "wood"
*table color with fields:
-colorid: autoincrement id
-color: name of the color eg "green"
*table persmaterial with fields:
-persmatid: autoincrement id
-persid: link to table person
-materialid: link to table material
*table perscolor with fields:
-perscolorid: autoincrement id
-persid: link to table person
-colorid: link to table color

In the webapplication the search can be entered by the users as a kind of pseudo query:
(color=red OR color=blue) AND color=green AND material=iron

My question is: how can I automatically transform this pseudo query into an efficient MySQL query?
I have tried out some different options:


Option 1:
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=1 OR pc.colorid=2) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1)) UNION
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=2 OR pc.colorid=3) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1))
Remarks:
*I do not see how to turn a general pseudo query into a query like the one in option 1, except for turning the pseudo query into a sum of products form where the sulms would correspond to the UNIONs. IS there a clever way to obtain such a sum of products form from an arbitrary pseudo query?


Option 2:
SELECT persid FROM person p WHERE
(EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=1 AND p.persid=pc.persid)
OR
EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=3 AND p.persid=pc.persid))
AND
EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=2 AND p.persid=pc.persid)
AND
EXISTS(SELECT * FROM persmaterial pm WHERE pm.materialid=2 AND p.persid=pm.persid)
Remarks:
*very easy to get from pseudo query to MySQL query but what about performance?

Option 3:
SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid
AND
(pc.colorid=1 OR pc.colorid=2 OR pc.colorid=3)
AND p.persid=pm.persid
AND pm.materialid=2
GROUP BY p.persid HAVING
sum(case when pc.colorid in ('1','3') then 1 else 0 end) >= 1
AND
sum(case when pc.colorid='2' then 1 else 0 end)>=1
AND
sum(case when pm.materialid='2' then 1 else 0 end)>=1
Remarks:
*this option requires the pseudo query to be turned into a product of sums form; again is their a clever way to obtain such a form;




Option 4
SELECT DISTINCT pc1.persid FROM perscolor pc1
INNER JOIN perscolor pc2
ON pc1.persid=pc2.persid AND pc2.colorid=2
INNER JOIN persmaterial pm1
ON pc1.persid=pm1.persid AND pm1.materialid=2
LEFT OUTER JOIN perscolor pc3
ON pc1.persid=pc3.persid AND pc3.colorid=1
LEFT OUTER JOIN perscolor pc4
ON pc1.persid=pc4.persid AND pc4.colorid=3
WHERE COALESCE(pc3.persid,pc4.persid) IS NOT NULL
Remarks:
*this option requires the pseudo query to be turned into a product of sums form

Option 5:
SELECT p.persid FROM person p, persmaterial pm,perscolor pc1,perscolor pc2,perscolor pc3 WHERE p.persid=pm.persid AND p.persid=pc1.persid AND p.persid=pc2.persid AND p.persid=pc3.persid AND (pc1.colorid=1 OR pc2.colorid=3) AND pc3.colorid=2 AND pm.materialid=2 GROUP BY p.persid
Remarks:
*very easy to get from pseudo query to MySQL query but what about performance?



-- phpMyAdmin SQL Dump
-- version 2.6.1
-- http://www.phpmyadmin.net
--
-- Host: localhost
-- Generation Time: Oct 19, 2006 at 01:13 PM
-- Server version: 4.1.9
-- PHP Version: 4.3.10
--
-- Database: `aston`
--

-- --------------------------------------------------------

--
-- Table structure for table `color`
--

CREATE TABLE `color` (
`colorid` int(11) NOT NULL auto_increment,
`color` varchar(30) NOT NULL default '',
PRIMARY KEY (`colorid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=5 ;

--
-- Dumping data for table `color`
--

INSERT INTO `color` VALUES (1, 'red');
INSERT INTO `color` VALUES (2, 'green');
INSERT INTO `color` VALUES (3, 'blue');
INSERT INTO `color` VALUES (4, 'yellow');

-- --------------------------------------------------------

--
-- Table structure for table `material`
--

CREATE TABLE `material` (
`materialid` int(11) NOT NULL auto_increment,
`material` varchar(30) NOT NULL default '',
PRIMARY KEY (`materialid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;

--
-- Dumping data for table `material`
--

INSERT INTO `material` VALUES (1, 'wood');
INSERT INTO `material` VALUES (2, 'iron');

-- --------------------------------------------------------

--
-- Table structure for table `perscolor`
--

CREATE TABLE `perscolor` (
`perscolorid` int(11) NOT NULL auto_increment,
`persid` int(11) NOT NULL default '0',
`colorid` int(11) NOT NULL default '0',
PRIMARY KEY (`perscolorid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=7 ;

--
-- Dumping data for table `perscolor`
--

INSERT INTO `perscolor` VALUES (1, 1, 1);
INSERT INTO `perscolor` VALUES (2, 1, 2);
INSERT INTO `perscolor` VALUES (3, 2, 1);
INSERT INTO `perscolor` VALUES (5, 3, 3);
INSERT INTO `perscolor` VALUES (6, 3, 2);

-- --------------------------------------------------------

--
-- Table structure for table `persmaterial`
--

CREATE TABLE `persmaterial` (
`persmatid` int(11) NOT NULL auto_increment,
`persid` int(11) NOT NULL default '0',
`materialid` int(11) NOT NULL default '0',
PRIMARY KEY (`persmatid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=6 ;

--
-- Dumping data for table `persmaterial`
--

INSERT INTO `persmaterial` VALUES (1, 1, 1);
INSERT INTO `persmaterial` VALUES (2, 1, 2);
INSERT INTO `persmaterial` VALUES (3, 2, 1);
INSERT INTO `persmaterial` VALUES (5, 3, 2);

-- --------------------------------------------------------

--
-- Table structure for table `person`
--

CREATE TABLE `person` (
`persid` int(11) NOT NULL auto_increment,
`name` varchar(30) NOT NULL default '',
PRIMARY KEY (`persid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;

--
-- Dumping data for table `person`
--

INSERT INTO `person` VALUES (1, 'john');
INSERT INTO `person` VALUES (2, 'emily');
INSERT INTO `person` VALUES (3, 'liz');
Reply With Quote
  #2 (permalink)  
Old 10-25-06, 12:20
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
wow, what a lot of research

would you, by any chance, have done any volume tests for performance of your 5 options? which one is the fastest?

my guess would be that option 3 gives you -- by far -- the most flexible and yet efficient query

p.s. the persmaterial and perscolor tables should not have an auto_increment primary key

instead, make the primary key a composite key consisting of the two foreign key ids, and then add another index on the two ids in the other order
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #3 (permalink)  
Old 10-25-06, 14:48
phlype phlype is offline
Registered User
 
Join Date: Oct 2006
Posts: 4
performance testing

I added some more persons and preferences to the different tables so that there where about 10000 persons with random color and material preferences in the database; the example query resulted in 1900 persons; here are the timing results in seconds:
option 1: 0.2
option 2: 0.005
option 3: 0.003
option 4: 0.007
option 5: 0.006
option 6: 0.002

where option 6 is a newly added option:
SELECT pm.persid FROM persmaterial pm WHERE pm.materialid=2 AND pm.persid IN (SELECT pc.persid FROM perscolor pc WHERE pc.colorid=2 AND pc.persid IN (SELECT pc.persid FROM perscolor pc WHERE pc.colorid=1 OR pc.colorid=3))

I was a bit surprised by the good timing results for option 2; is there any one who can explain the way MySQL tackles the query in option 2 (I suppose MySQL is not going over all records of table "person" launching a subquery for each record of table "person"?);

I tend to conclude that option 2 is the way to go since it is so easy to translate the pseudo query into the MySQL query and on top of that the resulting MySQL query is rather fast!
Reply With Quote
  #4 (permalink)  
Old 10-25-06, 14:58
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
okay, those are interesting results, but the sample size is probably still too small

and of course if you ran them in that sequence without clearing the mysql cache in between, that explains why option 1 had some overhead (the others benefitted from rows sitting in buffers)

to add a wrinkle...

perhaps your app was not planning to offer negation as a selectable, but consider how your various query options behave for a query like "color either red or blue but not green, with material fabric but not vinyl"

the NOTs toss all the options except option 3 for a big loop
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #5 (permalink)  
Old 10-25-06, 17:47
phlype phlype is offline
Registered User
 
Join Date: Oct 2006
Posts: 4
testing results with NOT

I do not understand why option 3 would perform better with a NOT? Can you explain the rationale behind that?

Furthermore I tried to turn of the MySQL cache by adding the following line to the my.cnf file: query_cache_size=0
Does this turn off the cache completely?

I tried the pseudo query "color either red or blue but not green, with material iron" (see below for the 6 options yielding 3798 records in my particular example)
and here are the timing results in seconds (I don't see from these results that option 3 is particularly better when compared to the situation without a NOT);
option 1: 0.3
option 2: 0.009
option 3: 0.003
option 4: 0.004
option 5: 0.005
option 6: 0.004

Option 1:
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=1 OR pc.colorid=3) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1)) UNION
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=1) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=1 AND count(DISTINCT pm.materialid)=1)) UNION
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=3) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=1 AND count(DISTINCT pm.materialid)=1))

Option 2:
SELECT persid FROM person p WHERE
(EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=1 AND p.persid=pc.persid)
OR
EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=3 AND p.persid=pc.persid))
AND
EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid<>2 AND p.persid=pc.persid)
AND
EXISTS(SELECT * FROM persmaterial pm WHERE pm.materialid=2 AND p.persid=pm.persid)


Option 3:
SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid
AND
(pc.colorid=1 OR pc.colorid<>2 OR pc.colorid=3)
AND p.persid=pm.persid
AND pm.materialid=2
GROUP BY p.persid HAVING
sum(case when pc.colorid in ('1','3') then 1 else 0 end) >= 1
AND
sum(case when pc.colorid<>'2' then 1 else 0 end)>=1
AND
sum(case when pm.materialid='2' then 1 else 0 end)>=1

SELECT DISTINCT pc1.persid FROM perscolor pc1
INNER JOIN perscolor pc2
ON pc1.persid=pc2.persid AND pc2.colorid<>2
INNER JOIN persmaterial pm1
ON pc1.persid=pm1.persid AND pm1.materialid=2
LEFT OUTER JOIN perscolor pc3
ON pc1.persid=pc3.persid AND pc3.colorid=1
LEFT OUTER JOIN perscolor pc4
ON pc1.persid=pc4.persid AND pc4.colorid=3
WHERE COALESCE(pc3.persid,pc4.persid) IS NOT NULL

Option 5:
SELECT p.persid FROM person p, persmaterial pm,perscolor pc1,perscolor pc2,perscolor pc3 WHERE p.persid=pm.persid AND p.persid=pc1.persid AND p.persid=pc2.persid AND p.persid=pc3.persid AND (pc1.colorid=1 OR pc2.colorid=3) AND pc3.colorid<>2 AND pm.materialid=2 GROUP BY p.persid

Option 6:
SELECT pm.persid FROM persmaterial pm WHERE pm.materialid=2 AND pm.persid IN (SELECT pc.persid FROM perscolor pc WHERE pc.colorid<>2 AND pc.persid IN (SELECT pc.persid FROM perscolor pc WHERE pc.colorid=1 OR pc.colorid=3))
Reply With Quote
  #6 (permalink)  
Old 10-25-06, 18:03
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
NOTs certainly are tricky, aren't they?

first of all, i suspect -- please confirm -- that we really aren't talking about colours and materials, right? i'm guessing job applicants and their skills? it's okay if you don't want to say, because colours and materials will do



okay, now to the problem

the search was to find people who have this colour or that colour, but not some other colour

now, you chose in several cases to implement this as

EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid<>2 AND p.persid=pc.persid)

this isn't the same thing at all, because it tests to see if any other colour besides 2 exists

what i was talking about was to make sure that colour 2 does not exist

the partial solution for this, for option 3, is
Code:
select p.persid
  from person p
inner
  join perscolor pc
    on pc.persid = p.persid
group
    by p.persid
having sum(case when pc.colorid in (1,3)
                then 1 else 0 end)       >= 1
   and sum(case when pc.colorid = 2
                then 1 else 0 end)        = 0
all the other solutions would involve a NOT EXISTS subquery

one more remark: since person-colours is one-to-many, and person-materials is one-to-many, it's not a good idea to have the person table joined to both of them in the same query, because of cross join effects, even though the extra multiplicity of intermediate result rows are collapsed using GROUP BY
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #7 (permalink)  
Old 10-25-06, 18:38
phlype phlype is offline
Registered User
 
Join Date: Oct 2006
Posts: 4
Your right about the application

I understand about the NOT now but probably my table is not large enough but these are the timings for option 2 and 3 of the query you proposed:
option 2: 0.0038
option 3: 0.0018

Concerning your last remark, if we go back to a query which has preferences on color AND on material, do you propose to have a query for color and one for material? Can the instersection of the results be done in MySQL?

Thx,

Phlype
Reply With Quote
  #8 (permalink)  
Old 10-25-06, 20:47
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
Quote:
Originally Posted by phlype
Your right about the application
well in that case you probably won't need to worry about the NOTs

i mean, any sort of skill/experience is a good skill/experience, right?

regarding intersection of results, you can always filter the results of two queries like this --
Code:
select persid
  from (
       select p.persid
         from person p
       inner
         join perscolor pc
           on pc.persid = p.persid
       group
           by p.persid
       having sum(case when pc.colorid in (1,3)
                       then 1 else 0 end)       >= 1
          and sum(case when pc.colorid = 2
                       then 1 else 0 end)        = 0
       UNION ALL
       select p.persid
         from person p
       inner
         join persmaterial pm
           on pm.persid = p.persid
       group
           by p.persid
       having similar conditions
       ) as combined
group
    by persid
having count(*) = 2
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On