Results 1 to 10 of 10
  1. #1
    Join Date
    Mar 2004
    Posts
    3

    Unanswered: Problems with a rewritten subselect query

    Question about a complex query (rewritten subselect)

    A bit of background: I have a PHP/MySQL site that allows users to search an academic journal's database of related scholarship. Users can search by title, author, categories, and so on. The table structure is setup so that each title can have 0 to inifinte matching records in corresponding tables (authors, categories, etc.).

    My problem is building a query in MySQL that will find items by up to 6 categories using 'AND'. So, the results would be all the items in category 1 and in category 2 and in category 3 and so on. In SQL Server, I would just run a subselect query. But in MySQL 4.0.16, I have to use joins exclusively, so I get a query something like this:

    -------------------

    SELECT t.*

    FROM titlesearch AS t, titlecat AS cat1, titlecat AS cat2, titlecat as cat3, titlecat as cat4, titlecat AS cat5, titlecat AS cat6

    WHERE (t.search LIKE '%searchterms%') AND cat1.catid = 'cat1' AND cat2.catid = 'cat2' AND cat3.catid = 'cat3' AND cat4.catid = 'cat4' AND cat5.catid = 'cat5' AND cat6.catid = 'cat6' AND t.tid = cat1.tid AND t.tid = cat2.tid AND t.tid = cat3.tid AND t.tid = cat4.tid AND t.tid = cat5.tid AND t.tid = cat6.tid

    -------------------

    Naturally, MySQL freaks out when I try to run this query, ultimately just trying to run the query until I kill it, as there are too many joins. It seems like 4 joins is the limit before things begin to go sour.

    So my question is does anyone have any suggestions about how to perform this sort of query? Is the best solution to flatten the table and have a query something like the following:

    -------------------
    SELECT t.*
    FROM titlesearch AS t
    WHERE (t.search LIKE '%searchterms%') AND (t.cat LIKE '%cat1%') AND (t.cat LIKE '%cat2%') AND ...
    -------------------

    The source data for the MySQL data is actually a SQL Server database. SQL server is used for all data entry and storage and then exported to MySQL, therefore denormalizing the MySQL structure isn't a big deal in terms of data integrity.

    I've googled the death out of this issue and have been told to rewrite my subselects as joins (which isn't working), to switch to PostgreSQL (unfortunately not an option at this time), to upgrade to the alpha 4.1 mysql (also not an option for a production system).

    Many thanks in advance for any advice you can offer!

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    I'd like to reiterate the suggestions about PostgreSQL, or better yet using MS-SQL. If you must use MySQL, there is still a solution.

    Assuming that every title has at least one category, and that you can make unnecessary category parameters equal to the Cat1 parameter, then I'd use:
    PHP Code:
    SELECT t.*
       
    FROM titlesearch t
       JOIN titlecat 
    AS t1 ON (t1.tid t.tid)
       
    JOIN titlecat AS t2 ON (t2.tid t.tid)
       
    JOIN titlecat AS t3 ON (t3.tid t.tid)
       
    JOIN titlecat AS t4 ON (t4.tid t.tid)
       
    JOIN titlecat AS t5 ON (t5.tid t.tid)
       
    JOIN titlecat AS t6 ON (t6.tid t.tid)
       
    WHERE  t1.catid 'Cat1'
          
    AND t2.catid 'Cat2'
          
    AND t3.catid 'Cat3'
          
    AND t4.catid 'Cat4'
          
    AND t5.catid 'Cat5'
          
    AND t6.catid 'Cat6' 
    I don't have your test data to test it, but I think it should run pretty nicely.

    -PatP

  3. #3
    Join Date
    Mar 2004
    Posts
    3
    Thanks, Pat! That works great!

    I'm too dependent on subselects for this sort of thing, so I was a bit baffled by how to get the joins to do exactly what I needed in this situation. But you've cleared it up for me--many thanks!

    I was also wondering if you or anyone else knew if too many joins will slow a query down? I vaguely remember being told by a db admin several years ago to keep joins to a minimum of 4 or less.

    <edit>
    About PostgreSQL: I may go ahead and install PostgreSQL on our server this weekend and move over to it or at least give it a go, as it seems to be a better solution in the long run. My only concerns were those that I've heard from others: Postgre is slower and more likely to corrupt data--but perhaps this only applies to older versions? (MySQL 4.1 and 5 seem to be promising developments, if they ever make it into production releases).
    </edit>
    Last edited by cookiemonster; 03-19-04 at 16:16.

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    The number of joins that will slow down a query is very installation specific. In most cases somewhere between 20 and 30 tables is safe if you are running current releases of software on reasonable hardware.

    Obviously that is load dependant too. If you get 5000 people running a 20 table query, the box is likely to start to smoke!

    -PatP

  5. #5
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    number of joins: 1

    PHP Code:
    select t.tid
         
    t.search
      from titlesearch 
    as t
    inner
      join titlecat 
    as c
        on t
    .tid c.tid 
     where t
    .search like '%searchterms%'
       
    and c.catid in ('cat1','cat2','cat3'
                      
    ,'cat4','cat5','cat6')
    group  
        by t
    .tid
         
    t.search 
    having count
    (*) = 
    this solution also lets you easily implement a change in the search criteria such as "must be in at least 4 of the 6 selected categories"
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  6. #6
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Rudy's solution is elegant, and I'd actually thought of proposing it. The problem that I've run into is that sometimes you'll have "dirty" data, where you might have six connectoids between a title and cat1, but no other applicable connectoids. In that case, the simpler code that I proposed works as requested, while the more complex code returns a false positive.

    This is a design choice. The more complex code allows many neat features to be added easily, but it requires you to be much more careful to keep either your data or your logic clean!

    -PatP

  7. #7
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    design choice?

    there is no dirty data if you follow some very simple guidelines, such as declaring a primary key for the "connectoid" so that you cannot connect the same title to the same category more than once

    create table titlecat
    ( tid integer not null
    , foreign key (tid) references titles (tid)
    , catid integer not null
    , foreign key (catid) references categories (catid)
    , primary key (tid,catid)
    )

    of course, many people miss this obvious solution because they've gone and blindly declared the primary key as an autonumber (a totally useless design choice)
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  8. #8
    Join Date
    Mar 2004
    Posts
    3
    Thank you Rudy and Pat for your suggestions and help with this! I really appreciate it.

    I've tested both of the proposed solutions, and they both work just fine. However, it does seem that single join group query is much quicker in retrieving results, as well as being easy to code and maintain.

    The design of titlecat table is exactly as Rudy outlines, so there is no possibility of a title being assigned to the same category multiple times as the primary key is a compound key of the tid and catid.

    So, I think the less joins, the better, as there are still a few more tables that will need to be searched. All of the possible search parameters are as follows:

    words in title,
    author,
    year of publication (2: from and to),
    categories (6: 4 of subject, 2 chronological),
    isbn (1),
    language (1), and
    document type (1).

    (I am having a bit of trouble allowing the user to select a boolean operator between the categories, so titles in cat 1 and cat 2 or cat3. I just can't seem to get my PHP script to put the parenthesis in the right places...)

    I'll play around with it a bit more to see if any thing strange crops up in the results. And if anyone has any other suggestions, I'd be delighted to get them.

    Thanks again for your help!

  9. #9
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Originally posted by cookiemonster
    (I am having a bit of trouble allowing the user to select a boolean operator between the categories, so titles in cat 1 and cat 2 or cat3. I just can't seem to get my PHP script to put the parenthesis in the right places...)
    that would indeed complicated matters

    remember, you can test the existence of certain categories in the WHERE clause, but test the combinations only in the HAVING
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  10. #10
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Originally posted by r937
    design choice?

    there is no dirty data if you follow some very simple guidelines, such as declaring a primary key for the "connectoid" so that you cannot connect the same title to the same category more than once

    create table titlecat
    ( tid integer not null
    , foreign key (tid) references titles (tid)
    , catid integer not null
    , foreign key (catid) references categories (catid)
    , primary key (tid,catid)
    )

    of course, many people miss this obvious solution because they've gone and blindly declared the primary key as an autonumber (a totally useless design choice)
    I agree wholeheartedly that there shouldn't be any reason to worry about bad data, but I have to live in the real world and bad data is often a problem for systems that I'm asked to clean up after the fact.

    Mr. Celko and I have beat the issue of surrogate keys to death, more than once. We finally agreed to call a truce, because the way the I use them meets or exceeds his design criteria in every way, and he couldn't provide a natural key solution that was less than twice as expensive. While surrogate keys aren't for everybody, if used correctly they are sometime the only practical solution.

    As in almost every case where you and I disagree, we are both preaching to the choir from different pulpits. We have the same basic intentions, but we disagree on a few fine points that usually only crop up in very large, complex situations.

    -PatP

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •