Results 1 to 10 of 10
  1. #1
    Join Date
    Jan 2002
    Posts
    189

    Unanswered: matching keyword relevance by count?

    I'm putting a site together that would load snippets of info (ie: quotes, a sentence or two) onto content pages. Snippets would be tagged with keywords so we could match them by relevance: a page sends a list of keywords to the database, the database returns a list of top matching snippets based on the number of those keywords that they match. I think I've got the table schema right, but I'm a little stuck on the SQL required to do this:
    Code:
    tblSnippets:
      id, content, url_to_link_to
    tblKeywords:
      id, keyword
    tblSnippetKeywords:
      snippet_id, keyword_id
    AFAIK, it's not horrendously complicated, right? This is just a bit of inner joining, some count and maybe a bit of grouping by.

    Code:
    SELECT TOP 5 (id) FROM tblSnippets
    INNER JOIN tblSnippetKeywords ON tblSnippetKeywords.snippet_id = tblSnippets.id
    INNER JOIN tblKeywords ON tblSnippetKeywords.keyword_id = tblKeywords .id
    WHERE tblKeywords.keyword = "fish" OR tblKeywords.keyword = "penguins"
    GROUP BY tblSnippets.id
    ORDER BY COUNT(tblKeywords.id)
    Is that getting close? Does it at least indicate what I'm aiming for?

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    TOP doesn't work in mysql

    who assigns the keywords? are they a controlled list or will you accept any new keyword? see http://www.dbforums.com/showthread.php?t=1624898 and replace "hobbies" with "keywords"
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Could you use a score in the keyword table to help order your output, give a higher score to those keywords that don't appear often in the snippets table and a lower score to those that appear everywhere. Update this score once a day. You could calculate it on the fly but it would make the SQL more complex and slower and there's no real advantage.

    I think you could loose the keyword_id field and just use keyword instead. I'd also loose the tbl prefix on all your tables unless you get confused as to what's a table and what isn't. Also the top 5 syntax comes from SQL Server and MySQL uses limit at the bottom.

    Code:
    select s.url_to_link_to, sum( k.score )
    from	SnippetKeywords sk,
    	Keywords k,
    	Snippets s
    where	sk.keyword in ( 'fish','penguins' )
    	and k.keyword = sk.keyword
    	and s.id = sk.snippet_id
    group by s.url_to_link_to
    order by sum( k.score ) desc
    limit 5
    Mike

  4. #4
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    if you lose the keyword_id, then you don't really need the keywords table at all (unless you wanted to restrict the allowed keywords through relational integrity)

    in particular you would not need to reference the keyword table in the query

    but your "score" concept, mike, does not really address the original question

    suppose "fish" has a score of 5, "penguins" has a score of 2, and "antarctica" has a score of 1

    then a snippet which has only "fish" gets a SUM which is greater than a snippet which has "penguins" and "antarctica" -- but the original poster clearly stated (with emphasis, i should point out) that it should be "based on the number of those keywords that they match"
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  5. #5
    Join Date
    Jan 2002
    Posts
    189
    Damn. I knew this was going to end up tricker than I'd thought.

    The keywords will be assigned by me. It will not be updated by any user or process other than me going in directly and adding some new ones every few months.

    I don't get how that two-table approach works, though. Doesn't that create a schema where one keyword attached to a particular snippet is intrinsically different from the same keyword attached to a different snippet? No, of course it doesn't. God, that sound stupid even as I type it.

    In fact, it's probably just what I want:

    Code:
    create table snippets
    ( id   integer     not null primary key
    , content text not null
    );
    
    create table snippetkeywords
    ( snippet_id integer     not null
    , keyword     varchar(99) not null
    , primary key ( snippet_id, keyword )
    , index (keyword , snippet_id)
    );
    Ok, so.... I've never seen anything like this before. Is there a name for it? Something I can bash into Google and find out what the index bit is for and how you make a primary key out of two fields? Like, a "database design for complete planks" website or something?

    Then all I need to do is figure out how to query it

    (ps. the TOP thing - is the answer simply to ORDER BY the descending keyword count and then LIMIT)?


    ETA: Erk. You all replied while I wasn't looking...

  6. #6
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    r397 - You're quite right - [joking]I only read the subject line and speed read the rest[/joking].

    I still think the OP should consider the idea of using a score though. If you searched for snippets that contained "computer", "linux" and "EEEPC" then the score would allow the search to highlight snippets that include the least used word (EEEPC) first rather than just pulling loads of spurious links on computers and linux.

  7. #7
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by Spudhead
    ... find out what the index bit is for
    the PK is an index, and because it's an index it will give optimum performance for queries which retrieve all the keywords for a given snippet

    the INDEX, which contains the two columns in the other order, will give optimum performance for queries which retrieve all the snippets for a given keyword

    each of them is actually a covering index (search that), so the query will not even access the table rows at all!

    Quote Originally Posted by Spudhead
    ... and how you make a primary key out of two fields?
    this is just basic syntax, and you do it just like the example shows
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  8. #8
    Join Date
    Jan 2002
    Posts
    189
    Ok, cool. I will Google, thankyou. But in the meantime... how do I query it? How do I get a list of snippets that match a given set of keywords, ordered by the number of keywords that each snippet matched?

    Code:
    SELECT s.id, s.content, COUNT(k.keyword)
    FROM snippets s, keywords k
    WHERE k.snippet_id = s.id
    AND k.keyword IN ('buckets', 'sealions')
    GROUP BY s.id, k.keyword
    ORDER BY COUNT(k.keyword) DESC
    It's nothing like that, is it?

  9. #9
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    actually, it is something like that
    Code:
    SELECT s.id
         , s.content
         , COUNT(*)
      FROM keywords k
    INNER
      JOIN snippets s
        ON s.id =  k.snippet_id
     WHERE k.keyword IN ('buckets', 'sealions')
     GROUP 
        BY s.id
    ORDER 
        BY COUNT(*) DESC
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  10. #10
    Join Date
    Jan 2002
    Posts
    189
    You guys rock Thank you very much.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •