Results 1 to 4 of 4
  1. #1
    Join Date
    Jul 2005
    Posts
    3

    Unanswered: Full text search with html entities

    Hello,

    I have a table with some fields indexed using fulltext but I'm having some issues on matching keywords that include HTML entities. MySQL is matching rows that are doesn't have my keyword but include the html entity:

    HTML Code:
    SELECT *, MATCH(cv.title,cv.intro,cv.text) AGAINST ('colaço') AS score FROM cms_contents ct LEFT JOIN cms_content_versions cv ON cv.id_content=ct.id_content WHERE NOW() BETWEEN ct.date_start AND ct.date_expire AND MATCH(cv.title,cv.intro,cv.text) AGAINST ('colaço') AND ct.id_lang = 1 AND ct.active='1' GROUP BY ct.id_content
    On this query, I'm trying to find matches for the keyword "colaço" which includes an HTML entity.
    When I run this query on the DB, not only it returns results with this keyword but also other keywords that include the entity 'ç'.
    I don't have a clue why is this happening.

    Does mySQL Fulltext index have issues with html entities?

    I'm looking for a solution for this.
    Thanks.

  2. #2
    Join Date
    Jul 2005
    Posts
    3
    Can anyone help me out?
    Thanks

  3. #3
    Join Date
    Apr 2005
    Location
    Baltimore, MD
    Posts
    297
    I feel like somebody posted something like this a few months back. It was also an issue with the '&' character. It may be treated as a stopword (I'm pretty sure the '-' character does this). There are some issues when using full text searches. Try checking the mysql site and googling some more and if I come up with a bright idea, I'll let you know.

  4. #4
    Join Date
    Apr 2005
    Location
    Baltimore, MD
    Posts
    297
    After revisiting this, I think this is why your query wasn't working:

    The parser used the '&' and the ';' characters as delimiters, so you were actually searching against 'cola', 'ccedil', and 'o'. Since 'o' is too short, it was dropped. Now you are just searching for 'cola' and 'ccedil' (not as one word anymore). This would explain why entries with 'ç' were returned. They should have had a lower score as well, right? Similarly, you should have also gotten back results with just 'cola' and not 'ccedil'. I think the trick is to encapsulate the match against in double quotes using boolean mode.

    Code:
    SELECT 
         *
         , MATCH(cv.title,cv.intro,cv.text) AGAINST ('"colaço"' IN BOOLEAN MODE) AS score 
    FROM 
         cms_contents ct 
              LEFT JOIN cms_content_versions cv 
              ON cv.id_content=ct.id_content 
    WHERE 
         NOW() BETWEEN ct.date_start AND ct.date_expire 
         AND MATCH(cv.title,cv.intro,cv.text) AGAINST ('"colaço"' IN BOOLEAN MODE) 
         AND ct.id_lang = 1 AND ct.active='1' 
    GROUP BY ct.id_content
    Let me know if that works. Good luck.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •