I'm putting a site together that would load snippets of info (ie: quotes, a sentence or two) onto content pages. Snippets would be tagged with keywords so we could match them by relevance: a page sends a list of keywords to the database, the database returns a list of top matching snippets based on the number of those keywords that they match. I think I've got the table schema right, but I'm a little stuck on the SQL required to do this:
AFAIK, it's not horrendously complicated, right? This is just a bit of inner joining, some count and maybe a bit of grouping by.
SELECT TOP 5 (id) FROM tblSnippets
INNER JOIN tblSnippetKeywords ON tblSnippetKeywords.snippet_id = tblSnippets.id
INNER JOIN tblKeywords ON tblSnippetKeywords.keyword_id = tblKeywords .id
WHERE tblKeywords.keyword = "fish" OR tblKeywords.keyword = "penguins"
GROUP BY tblSnippets.id
ORDER BY COUNT(tblKeywords.id)
Is that getting close? Does it at least indicate what I'm aiming for?
Could you use a score in the keyword table to help order your output, give a higher score to those keywords that don't appear often in the snippets table and a lower score to those that appear everywhere. Update this score once a day. You could calculate it on the fly but it would make the SQL more complex and slower and there's no real advantage.
I think you could loose the keyword_id field and just use keyword instead. I'd also loose the tbl prefix on all your tables unless you get confused as to what's a table and what isn't. Also the top 5 syntax comes from SQL Server and MySQL uses limit at the bottom.
select s.url_to_link_to, sum( k.score )
from SnippetKeywords sk,
where sk.keyword in ( 'fish','penguins' )
and k.keyword = sk.keyword
and s.id = sk.snippet_id
group by s.url_to_link_to
order by sum( k.score ) desc
if you lose the keyword_id, then you don't really need the keywords table at all (unless you wanted to restrict the allowed keywords through relational integrity)
in particular you would not need to reference the keyword table in the query
but your "score" concept, mike, does not really address the original question
suppose "fish" has a score of 5, "penguins" has a score of 2, and "antarctica" has a score of 1
then a snippet which has only "fish" gets a SUM which is greater than a snippet which has "penguins" and "antarctica" -- but the original poster clearly stated (with emphasis, i should point out) that it should be "based on the number of those keywords that they match"
Damn. I knew this was going to end up tricker than I'd thought.
The keywords will be assigned by me. It will not be updated by any user or process other than me going in directly and adding some new ones every few months.
I don't get how that two-table approach works, though. Doesn't that create a schema where one keyword attached to a particular snippet is intrinsically different from the same keyword attached to a different snippet? No, of course it doesn't. God, that sound stupid even as I type it.
In fact, it's probably just what I want:
create table snippets
( id integer not null primary key
, content text not null
create table snippetkeywords
( snippet_id integer not null
, keyword varchar(99) not null
, primary key ( snippet_id, keyword )
, index (keyword , snippet_id)
Ok, so.... I've never seen anything like this before. Is there a name for it? Something I can bash into Google and find out what the index bit is for and how you make a primary key out of two fields? Like, a "database design for complete planks" website or something?
Then all I need to do is figure out how to query it
(ps. the TOP thing - is the answer simply to ORDER BY the descending keyword count and then LIMIT)?
ETA: Erk. You all replied while I wasn't looking...
r397 - You're quite right - [joking]I only read the subject line and speed read the rest[/joking].
I still think the OP should consider the idea of using a score though. If you searched for snippets that contained "computer", "linux" and "EEEPC" then the score would allow the search to highlight snippets that include the least used word (EEEPC) first rather than just pulling loads of spurious links on computers and linux.
Ok, cool. I will Google, thankyou. But in the meantime... how do I query it? How do I get a list of snippets that match a given set of keywords, ordered by the number of keywords that each snippet matched?
SELECT s.id, s.content, COUNT(k.keyword)
FROM snippets s, keywords k
WHERE k.snippet_id = s.id
AND k.keyword IN ('buckets', 'sealions')
GROUP BY s.id, k.keyword
ORDER BY COUNT(k.keyword) DESC