Results 1 to 9 of 9
  1. #1
    Join Date
    Mar 2007
    Posts
    212

    Unanswered: Full text search vs normal LIKE operator search

    Hi,

    I have table with the following fields;

    chemical_id int
    formula varchar
    name varchar

    The table contians a million chemical compounds. I need to create a text search for the database. Which method would be the fastest. Using a LIKE operator e.g.

    select chemical_id from chemtable where formula LIKE '%CN1C=NC2=1C(=O)N(C)C(=O)N2C%';

    or doing a full text search e.g.

    select chemical_id from chemtable where match(formula) against('CN1C=NC2=1C(=O)N(C)C(=O)N2C');

    one of the problems with a full text search is that hypenated words are excluded from the full text index which doesnt help much since most of the chemical names in the table are hypenated!

    Any suggestions on which would be the most efficient method?

  2. #2
    Join Date
    Mar 2004
    Posts
    480
    a full text search would be faster than a like search. your like search begins with a % and thus would never use an index. so either way I don't believe an index would be used.

  3. #3
    Join Date
    Mar 2007
    Posts
    212
    Ok but there is just one problem with that. The collation of the formula field is set so that it is case sensitive (latin1_general_cs ) and the collation of the name field isnt. According to the mysql documention the use of multiple character sets within a single table is supported however all columns in a FULLTEXT index must use the same character set and collation.

    So how would you get around the above if i want to set up a fulltext index on both the formula and the name fields?

  4. #4
    Join Date
    Mar 2007
    Location
    636f6d7075746572
    Posts
    770
    Set them both to the same collation...

  5. #5
    Join Date
    Mar 2007
    Posts
    212
    I would but I require only the formula field to be case sensistive becuase uppercase and lower case of the same formula denotes different chemical compounds. I dont need the name field to be case sensistive because it is irrelevant.

    Also is it true that mysql will not use and index on a like operator using the % ?

  6. #6
    Join Date
    Mar 2004
    Posts
    480
    Quote Originally Posted by ozzii
    Also is it true that mysql will not use and index on a like operator using the % ?
    See post #2 above.

  7. #7
    Join Date
    Mar 2007
    Posts
    212
    Quote Originally Posted by guelphdad
    See post #2 above.
    I have seen post #2. But I seek further clarification as to where you got this information from? There is nothing in the MySql documention which mentions using a % with a LIKE operator search negates the index - or at least I cant finding anything to this effect.

    would you mind forwarding a link to where you got this information from?

  8. #8
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Ozzii

    Full text indexes are optimised to work around words and sentences and may have difficulty making any sense out of your formula field containing "CN1C=NC2=1C(=O)N(C)C(=O)N2C". You may find it quicker to search this field using a normal index. If the chemical name field is made up of discrete words then this might be a good candidate for the full text search (assuming you want to search on this).

    However as guelphdad pointed out, searching for "%string%" won't work well with normal indexes either. Indexes are usually organised with B-trees that allow the database to quickly navigate to the right key item - so if you searched for "CN1C%" the database could jump to those key items beginning with C then jump to those starting with CN and quickly get to your answer. If your search string starts with % then the database wouldn't know where to jump to and would instead have to scan each item in the index (or just do a table scan) and this takes far more time. I found this link which covers some of this.

    It really depends on how many chemicals you have in your database - if there aren't that many then you might find it doesn't really matter. Could you try both types of index (and perhaps no index at all) and see which is faster. It would be interesting to see the comparisons.

    If things are still slow then could you create your own caching table where you might want to look for common substrings in the formula field and then extract all the chemical_ids that contain it. Then if a user searches for one of these substrings you can just get the chemical_ids directly from your cache table without any searching. Another method might be to simply cache all the user searches and the ids returned then use this cache if the search string has been used before - you may want to repopulate the matches automatically if there is a lot of change in your database.

    Just my 2cents.

    Mike

    Edited to try and make more sense.
    Last edited by mike_bike_kite; 07-29-07 at 06:22.

  9. #9
    Join Date
    Mar 2007
    Location
    636f6d7075746572
    Posts
    770
    ozzii, even if you can't find any information on what guelphdad was talking about in regards to indexes being searched, just think about it logically. From a searching algorithm point of view, in a tree structure.
    If you say LIKE 'ABC%' and LIKE '%ABC%', which is a) going to be found faster & b) indexed.
    Lets take ABC% as our first example: in alphanum sorting, the character 'A' is going to be near the top of our resultset (and thus probably the left hand-side our tree), thus the index can go :
    IS root_node > A (yes)
    GO LEFT
    IS subnode of root_node > A (yes)
    Keep going left...
    and so on...
    In the second example our first character is in fact %, which is ANYTHING under the sun. So when we start searching we are doing:
    IS root_node > UNKNOWN (it could be)
    GO LEFT AND RIGHT
    Is the second character of our string part an A
    It could be, but it might not be.
    If is is, then check for B and so on...
    If it isn't , then carry onto the third character in that string, is it an A?

    As you can see a LIKE ON '%ABC%' is going to be horribly inefficient, and have to search practically the whole tree to find what it's after. Potentially EVERY node could contain the answer we're looking for.

    Please note: this is a simplistic example of what i'm trying to demonstrate and does not in fact represent what happens in the MySQL algorithms that perform the search.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •