Results 1 to 8 of 8
  1. #1
    Join Date
    Oct 2006
    Posts
    4

    Unanswered: how to create efficient MySQL query from a pseudo query

    I'm trying to build a webapplication where users can search for a person having a particular preference for color and material. To store this information I use the following structure (a MySQL dump can be found at the end of this post):
    *table person with fields:
    -persid: autoincrement id
    -name: name of the person
    *table material with fields:
    -materialid: autoincrement id
    -material: name of the material eg "wood"
    *table color with fields:
    -colorid: autoincrement id
    -color: name of the color eg "green"
    *table persmaterial with fields:
    -persmatid: autoincrement id
    -persid: link to table person
    -materialid: link to table material
    *table perscolor with fields:
    -perscolorid: autoincrement id
    -persid: link to table person
    -colorid: link to table color

    In the webapplication the search can be entered by the users as a kind of pseudo query:
    (color=red OR color=blue) AND color=green AND material=iron

    My question is: how can I automatically transform this pseudo query into an efficient MySQL query?
    I have tried out some different options:


    Option 1:
    (SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=1 OR pc.colorid=2) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1)) UNION
    (SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=2 OR pc.colorid=3) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1))
    Remarks:
    *I do not see how to turn a general pseudo query into a query like the one in option 1, except for turning the pseudo query into a sum of products form where the sulms would correspond to the UNIONs. IS there a clever way to obtain such a sum of products form from an arbitrary pseudo query?


    Option 2:
    SELECT persid FROM person p WHERE
    (EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=1 AND p.persid=pc.persid)
    OR
    EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=3 AND p.persid=pc.persid))
    AND
    EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=2 AND p.persid=pc.persid)
    AND
    EXISTS(SELECT * FROM persmaterial pm WHERE pm.materialid=2 AND p.persid=pm.persid)
    Remarks:
    *very easy to get from pseudo query to MySQL query but what about performance?

    Option 3:
    SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
    p.persid=pc.persid
    AND
    (pc.colorid=1 OR pc.colorid=2 OR pc.colorid=3)
    AND p.persid=pm.persid
    AND pm.materialid=2
    GROUP BY p.persid HAVING
    sum(case when pc.colorid in ('1','3') then 1 else 0 end) >= 1
    AND
    sum(case when pc.colorid='2' then 1 else 0 end)>=1
    AND
    sum(case when pm.materialid='2' then 1 else 0 end)>=1
    Remarks:
    *this option requires the pseudo query to be turned into a product of sums form; again is their a clever way to obtain such a form;




    Option 4
    SELECT DISTINCT pc1.persid FROM perscolor pc1
    INNER JOIN perscolor pc2
    ON pc1.persid=pc2.persid AND pc2.colorid=2
    INNER JOIN persmaterial pm1
    ON pc1.persid=pm1.persid AND pm1.materialid=2
    LEFT OUTER JOIN perscolor pc3
    ON pc1.persid=pc3.persid AND pc3.colorid=1
    LEFT OUTER JOIN perscolor pc4
    ON pc1.persid=pc4.persid AND pc4.colorid=3
    WHERE COALESCE(pc3.persid,pc4.persid) IS NOT NULL
    Remarks:
    *this option requires the pseudo query to be turned into a product of sums form

    Option 5:
    SELECT p.persid FROM person p, persmaterial pm,perscolor pc1,perscolor pc2,perscolor pc3 WHERE p.persid=pm.persid AND p.persid=pc1.persid AND p.persid=pc2.persid AND p.persid=pc3.persid AND (pc1.colorid=1 OR pc2.colorid=3) AND pc3.colorid=2 AND pm.materialid=2 GROUP BY p.persid
    Remarks:
    *very easy to get from pseudo query to MySQL query but what about performance?



    -- phpMyAdmin SQL Dump
    -- version 2.6.1
    -- http://www.phpmyadmin.net
    --
    -- Host: localhost
    -- Generation Time: Oct 19, 2006 at 01:13 PM
    -- Server version: 4.1.9
    -- PHP Version: 4.3.10
    --
    -- Database: `aston`
    --

    -- --------------------------------------------------------

    --
    -- Table structure for table `color`
    --

    CREATE TABLE `color` (
    `colorid` int(11) NOT NULL auto_increment,
    `color` varchar(30) NOT NULL default '',
    PRIMARY KEY (`colorid`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=5 ;

    --
    -- Dumping data for table `color`
    --

    INSERT INTO `color` VALUES (1, 'red');
    INSERT INTO `color` VALUES (2, 'green');
    INSERT INTO `color` VALUES (3, 'blue');
    INSERT INTO `color` VALUES (4, 'yellow');

    -- --------------------------------------------------------

    --
    -- Table structure for table `material`
    --

    CREATE TABLE `material` (
    `materialid` int(11) NOT NULL auto_increment,
    `material` varchar(30) NOT NULL default '',
    PRIMARY KEY (`materialid`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;

    --
    -- Dumping data for table `material`
    --

    INSERT INTO `material` VALUES (1, 'wood');
    INSERT INTO `material` VALUES (2, 'iron');

    -- --------------------------------------------------------

    --
    -- Table structure for table `perscolor`
    --

    CREATE TABLE `perscolor` (
    `perscolorid` int(11) NOT NULL auto_increment,
    `persid` int(11) NOT NULL default '0',
    `colorid` int(11) NOT NULL default '0',
    PRIMARY KEY (`perscolorid`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=7 ;

    --
    -- Dumping data for table `perscolor`
    --

    INSERT INTO `perscolor` VALUES (1, 1, 1);
    INSERT INTO `perscolor` VALUES (2, 1, 2);
    INSERT INTO `perscolor` VALUES (3, 2, 1);
    INSERT INTO `perscolor` VALUES (5, 3, 3);
    INSERT INTO `perscolor` VALUES (6, 3, 2);

    -- --------------------------------------------------------

    --
    -- Table structure for table `persmaterial`
    --

    CREATE TABLE `persmaterial` (
    `persmatid` int(11) NOT NULL auto_increment,
    `persid` int(11) NOT NULL default '0',
    `materialid` int(11) NOT NULL default '0',
    PRIMARY KEY (`persmatid`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=6 ;

    --
    -- Dumping data for table `persmaterial`
    --

    INSERT INTO `persmaterial` VALUES (1, 1, 1);
    INSERT INTO `persmaterial` VALUES (2, 1, 2);
    INSERT INTO `persmaterial` VALUES (3, 2, 1);
    INSERT INTO `persmaterial` VALUES (5, 3, 2);

    -- --------------------------------------------------------

    --
    -- Table structure for table `person`
    --

    CREATE TABLE `person` (
    `persid` int(11) NOT NULL auto_increment,
    `name` varchar(30) NOT NULL default '',
    PRIMARY KEY (`persid`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;

    --
    -- Dumping data for table `person`
    --

    INSERT INTO `person` VALUES (1, 'john');
    INSERT INTO `person` VALUES (2, 'emily');
    INSERT INTO `person` VALUES (3, 'liz');

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    wow, what a lot of research

    would you, by any chance, have done any volume tests for performance of your 5 options? which one is the fastest?

    my guess would be that option 3 gives you -- by far -- the most flexible and yet efficient query

    p.s. the persmaterial and perscolor tables should not have an auto_increment primary key

    instead, make the primary key a composite key consisting of the two foreign key ids, and then add another index on the two ids in the other order
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Oct 2006
    Posts
    4

    performance testing

    I added some more persons and preferences to the different tables so that there where about 10000 persons with random color and material preferences in the database; the example query resulted in 1900 persons; here are the timing results in seconds:
    option 1: 0.2
    option 2: 0.005
    option 3: 0.003
    option 4: 0.007
    option 5: 0.006
    option 6: 0.002

    where option 6 is a newly added option:
    SELECT pm.persid FROM persmaterial pm WHERE pm.materialid=2 AND pm.persid IN (SELECT pc.persid FROM perscolor pc WHERE pc.colorid=2 AND pc.persid IN (SELECT pc.persid FROM perscolor pc WHERE pc.colorid=1 OR pc.colorid=3))

    I was a bit surprised by the good timing results for option 2; is there any one who can explain the way MySQL tackles the query in option 2 (I suppose MySQL is not going over all records of table "person" launching a subquery for each record of table "person"?);

    I tend to conclude that option 2 is the way to go since it is so easy to translate the pseudo query into the MySQL query and on top of that the resulting MySQL query is rather fast!

  4. #4
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    okay, those are interesting results, but the sample size is probably still too small

    and of course if you ran them in that sequence without clearing the mysql cache in between, that explains why option 1 had some overhead (the others benefitted from rows sitting in buffers)

    to add a wrinkle...

    perhaps your app was not planning to offer negation as a selectable, but consider how your various query options behave for a query like "color either red or blue but not green, with material fabric but not vinyl"

    the NOTs toss all the options except option 3 for a big loop
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  5. #5
    Join Date
    Oct 2006
    Posts
    4

    testing results with NOT

    I do not understand why option 3 would perform better with a NOT? Can you explain the rationale behind that?

    Furthermore I tried to turn of the MySQL cache by adding the following line to the my.cnf file: query_cache_size=0
    Does this turn off the cache completely?

    I tried the pseudo query "color either red or blue but not green, with material iron" (see below for the 6 options yielding 3798 records in my particular example)
    and here are the timing results in seconds (I don't see from these results that option 3 is particularly better when compared to the situation without a NOT);
    option 1: 0.3
    option 2: 0.009
    option 3: 0.003
    option 4: 0.004
    option 5: 0.005
    option 6: 0.004

    Option 1:
    (SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=1 OR pc.colorid=3) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1)) UNION
    (SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=1) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=1 AND count(DISTINCT pm.materialid)=1)) UNION
    (SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE p.persid=pc.persid AND (pc.colorid=3) AND p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING (count(DISTINCT pc.colorid)=1 AND count(DISTINCT pm.materialid)=1))

    Option 2:
    SELECT persid FROM person p WHERE
    (EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=1 AND p.persid=pc.persid)
    OR
    EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=3 AND p.persid=pc.persid))
    AND
    EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid<>2 AND p.persid=pc.persid)
    AND
    EXISTS(SELECT * FROM persmaterial pm WHERE pm.materialid=2 AND p.persid=pm.persid)


    Option 3:
    SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
    p.persid=pc.persid
    AND
    (pc.colorid=1 OR pc.colorid<>2 OR pc.colorid=3)
    AND p.persid=pm.persid
    AND pm.materialid=2
    GROUP BY p.persid HAVING
    sum(case when pc.colorid in ('1','3') then 1 else 0 end) >= 1
    AND
    sum(case when pc.colorid<>'2' then 1 else 0 end)>=1
    AND
    sum(case when pm.materialid='2' then 1 else 0 end)>=1

    SELECT DISTINCT pc1.persid FROM perscolor pc1
    INNER JOIN perscolor pc2
    ON pc1.persid=pc2.persid AND pc2.colorid<>2
    INNER JOIN persmaterial pm1
    ON pc1.persid=pm1.persid AND pm1.materialid=2
    LEFT OUTER JOIN perscolor pc3
    ON pc1.persid=pc3.persid AND pc3.colorid=1
    LEFT OUTER JOIN perscolor pc4
    ON pc1.persid=pc4.persid AND pc4.colorid=3
    WHERE COALESCE(pc3.persid,pc4.persid) IS NOT NULL

    Option 5:
    SELECT p.persid FROM person p, persmaterial pm,perscolor pc1,perscolor pc2,perscolor pc3 WHERE p.persid=pm.persid AND p.persid=pc1.persid AND p.persid=pc2.persid AND p.persid=pc3.persid AND (pc1.colorid=1 OR pc2.colorid=3) AND pc3.colorid<>2 AND pm.materialid=2 GROUP BY p.persid

    Option 6:
    SELECT pm.persid FROM persmaterial pm WHERE pm.materialid=2 AND pm.persid IN (SELECT pc.persid FROM perscolor pc WHERE pc.colorid<>2 AND pc.persid IN (SELECT pc.persid FROM perscolor pc WHERE pc.colorid=1 OR pc.colorid=3))

  6. #6
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    NOTs certainly are tricky, aren't they?

    first of all, i suspect -- please confirm -- that we really aren't talking about colours and materials, right? i'm guessing job applicants and their skills? it's okay if you don't want to say, because colours and materials will do



    okay, now to the problem

    the search was to find people who have this colour or that colour, but not some other colour

    now, you chose in several cases to implement this as

    EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid<>2 AND p.persid=pc.persid)

    this isn't the same thing at all, because it tests to see if any other colour besides 2 exists

    what i was talking about was to make sure that colour 2 does not exist

    the partial solution for this, for option 3, is
    Code:
    select p.persid
      from person p
    inner
      join perscolor pc
        on pc.persid = p.persid
    group
        by p.persid
    having sum(case when pc.colorid in (1,3)
                    then 1 else 0 end)       >= 1
       and sum(case when pc.colorid = 2
                    then 1 else 0 end)        = 0
    all the other solutions would involve a NOT EXISTS subquery

    one more remark: since person-colours is one-to-many, and person-materials is one-to-many, it's not a good idea to have the person table joined to both of them in the same query, because of cross join effects, even though the extra multiplicity of intermediate result rows are collapsed using GROUP BY
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  7. #7
    Join Date
    Oct 2006
    Posts
    4
    Your right about the application

    I understand about the NOT now but probably my table is not large enough but these are the timings for option 2 and 3 of the query you proposed:
    option 2: 0.0038
    option 3: 0.0018

    Concerning your last remark, if we go back to a query which has preferences on color AND on material, do you propose to have a query for color and one for material? Can the instersection of the results be done in MySQL?

    Thx,

    Phlype

  8. #8
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by phlype
    Your right about the application
    well in that case you probably won't need to worry about the NOTs

    i mean, any sort of skill/experience is a good skill/experience, right?

    regarding intersection of results, you can always filter the results of two queries like this --
    Code:
    select persid
      from (
           select p.persid
             from person p
           inner
             join perscolor pc
               on pc.persid = p.persid
           group
               by p.persid
           having sum(case when pc.colorid in (1,3)
                           then 1 else 0 end)       >= 1
              and sum(case when pc.colorid = 2
                           then 1 else 0 end)        = 0
           UNION ALL
           select p.persid
             from person p
           inner
             join persmaterial pm
               on pm.persid = p.persid
           group
               by p.persid
           having similar conditions
           ) as combined
    group
        by persid
    having count(*) = 2
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •