If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > MySQL > matching keyword relevance by count?

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 11-29-07, 07:50
Spudhead Spudhead is offline
Registered User
 
Join Date: Jan 2002
Posts: 189
matching keyword relevance by count?

I'm putting a site together that would load snippets of info (ie: quotes, a sentence or two) onto content pages. Snippets would be tagged with keywords so we could match them by relevance: a page sends a list of keywords to the database, the database returns a list of top matching snippets based on the number of those keywords that they match. I think I've got the table schema right, but I'm a little stuck on the SQL required to do this:
Code:
tblSnippets:
  id, content, url_to_link_to
tblKeywords:
  id, keyword
tblSnippetKeywords:
  snippet_id, keyword_id
AFAIK, it's not horrendously complicated, right? This is just a bit of inner joining, some count and maybe a bit of grouping by.

Code:
SELECT TOP 5 (id) FROM tblSnippets
INNER JOIN tblSnippetKeywords ON tblSnippetKeywords.snippet_id = tblSnippets.id
INNER JOIN tblKeywords ON tblSnippetKeywords.keyword_id = tblKeywords .id
WHERE tblKeywords.keyword = "fish" OR tblKeywords.keyword = "penguins"
GROUP BY tblSnippets.id
ORDER BY COUNT(tblKeywords.id)
Is that getting close? Does it at least indicate what I'm aiming for?
Reply With Quote
  #2 (permalink)  
Old 11-29-07, 08:31
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
TOP doesn't work in mysql

who assigns the keywords? are they a controlled list or will you accept any new keyword? see Noob question - I keep getting conflicting answers and replace "hobbies" with "keywords"
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #3 (permalink)  
Old 11-29-07, 09:06
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Could you use a score in the keyword table to help order your output, give a higher score to those keywords that don't appear often in the snippets table and a lower score to those that appear everywhere. Update this score once a day. You could calculate it on the fly but it would make the SQL more complex and slower and there's no real advantage.

I think you could loose the keyword_id field and just use keyword instead. I'd also loose the tbl prefix on all your tables unless you get confused as to what's a table and what isn't. Also the top 5 syntax comes from SQL Server and MySQL uses limit at the bottom.

Code:
select s.url_to_link_to, sum( k.score )
from	SnippetKeywords sk,
	Keywords k,
	Snippets s
where	sk.keyword in ( 'fish','penguins' )
	and k.keyword = sk.keyword
	and s.id = sk.snippet_id
group by s.url_to_link_to
order by sum( k.score ) desc
limit 5
Mike
Reply With Quote
  #4 (permalink)  
Old 11-29-07, 09:19
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
if you lose the keyword_id, then you don't really need the keywords table at all (unless you wanted to restrict the allowed keywords through relational integrity)

in particular you would not need to reference the keyword table in the query

but your "score" concept, mike, does not really address the original question

suppose "fish" has a score of 5, "penguins" has a score of 2, and "antarctica" has a score of 1

then a snippet which has only "fish" gets a SUM which is greater than a snippet which has "penguins" and "antarctica" -- but the original poster clearly stated (with emphasis, i should point out) that it should be "based on the number of those keywords that they match"
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #5 (permalink)  
Old 11-29-07, 09:42
Spudhead Spudhead is offline
Registered User
 
Join Date: Jan 2002
Posts: 189
Damn. I knew this was going to end up tricker than I'd thought.

The keywords will be assigned by me. It will not be updated by any user or process other than me going in directly and adding some new ones every few months.

I don't get how that two-table approach works, though. Doesn't that create a schema where one keyword attached to a particular snippet is intrinsically different from the same keyword attached to a different snippet? No, of course it doesn't. God, that sound stupid even as I type it.

In fact, it's probably just what I want:

Code:
create table snippets
( id   integer     not null primary key
, content text not null
);

create table snippetkeywords
( snippet_id integer     not null
, keyword     varchar(99) not null
, primary key ( snippet_id, keyword )
, index (keyword , snippet_id)
);
Ok, so.... I've never seen anything like this before. Is there a name for it? Something I can bash into Google and find out what the index bit is for and how you make a primary key out of two fields? Like, a "database design for complete planks" website or something?

Then all I need to do is figure out how to query it

(ps. the TOP thing - is the answer simply to ORDER BY the descending keyword count and then LIMIT)?


ETA: Erk. You all replied while I wasn't looking...
Reply With Quote
  #6 (permalink)  
Old 11-29-07, 09:48
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
r397 - You're quite right - [joking]I only read the subject line and speed read the rest[/joking].

I still think the OP should consider the idea of using a score though. If you searched for snippets that contained "computer", "linux" and "EEEPC" then the score would allow the search to highlight snippets that include the least used word (EEEPC) first rather than just pulling loads of spurious links on computers and linux.
Reply With Quote
  #7 (permalink)  
Old 11-29-07, 09:53
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
Quote:
Originally Posted by Spudhead
... find out what the index bit is for
the PK is an index, and because it's an index it will give optimum performance for queries which retrieve all the keywords for a given snippet

the INDEX, which contains the two columns in the other order, will give optimum performance for queries which retrieve all the snippets for a given keyword

each of them is actually a covering index (search that), so the query will not even access the table rows at all!

Quote:
Originally Posted by Spudhead
... and how you make a primary key out of two fields?
this is just basic syntax, and you do it just like the example shows
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #8 (permalink)  
Old 11-29-07, 10:52
Spudhead Spudhead is offline
Registered User
 
Join Date: Jan 2002
Posts: 189
Ok, cool. I will Google, thankyou. But in the meantime... how do I query it? How do I get a list of snippets that match a given set of keywords, ordered by the number of keywords that each snippet matched?

Code:
SELECT s.id, s.content, COUNT(k.keyword)
FROM snippets s, keywords k
WHERE k.snippet_id = s.id
AND k.keyword IN ('buckets', 'sealions')
GROUP BY s.id, k.keyword
ORDER BY COUNT(k.keyword) DESC
It's nothing like that, is it?
Reply With Quote
  #9 (permalink)  
Old 11-29-07, 11:00
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
actually, it is something like that
Code:
SELECT s.id
     , s.content
     , COUNT(*)
  FROM keywords k
INNER
  JOIN snippets s
    ON s.id =  k.snippet_id
 WHERE k.keyword IN ('buckets', 'sealions')
 GROUP 
    BY s.id
ORDER 
    BY COUNT(*) DESC
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #10 (permalink)  
Old 11-29-07, 12:15
Spudhead Spudhead is offline
Registered User
 
Join Date: Jan 2002
Posts: 189
You guys rock Thank you very much.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On