I currently have this working by passing bucket-loads of data to PHP and then doing all the calculations there. I want (and most likely need) to perform all the calculations using MySQL only. I know that it is possible to write functions and procedures in MySQL but I've never done it before and every time I sit down to get this written I confuse myself into a crying mess of self-disappointment. So....
Explanation of database:
Let's start with a database diagram:
http://i.imgur.com/Th3NL.jpg
You can immediately disregard the
users table. We'll not bother with that in this question.
So this database hold information drawn from web pages. You can see the
links table holds the URL and title, as well as the maximum term frequency for any term in that document.
The
object table is just to allow both users and links to be foreign primary keys in the
occurences table. BTW, I am aware of the constant mis-spelling of occurrences as occurences! =)
The
occurences table holds the majority of data. The table has a primary foreign key of
linkoruser (though just ignore user for this). It has
word_id which refers to the
dictionary table,
tag_id which refers to the
tags table and finally
score which is the number of occurrences of that word, of that tag type, in that document.
The
tags table has a tag id, the name of a tag (eg. title, or H1) and its
weight which is normally set between 1 and 10.
The
dictionary table has a word id, the actual word, and the inverse document frequency of that word. If you care what an IDF is then it's the number of web pages, divided by the number of web pages containing that specific word somewhere.
Explanation of the system's function:
Given a link ID (we'll call it BaseID), all other links are compared for similarity (cosine similarity) and the IDs, url and titles of all links are shown to the user, ordered by most to least similar.
Scores for each word are:
((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
ie the word's number of occurrences in the document multiplied by the tag weighting for this particular word's tag, divided my the maximum term frequency for that document. Then that multiplied by the word's inverse document frequency to give you the final score for that word, in that context.
The similarity is done via cosine similarity, best show in the image below.
http://i.imgur.com/n3Bp8.jpg
So each page has it's similarity to the BaseID, and the system shows the user all the page IDs, URLs and Titles, ordered by similarity from most to least similar.
Explanation of the problem:
Currently massive amounts of data is dragged out of MySQL and handed over to PHP which then does all the processing.
This is because I only know my way around basic MySQL. I know it is possible to use functions and procedures, but I am getting mighty tangled.
What I would like to have is for me to pass a BaseID to MySQL and it return the page IDs, URLs and Titles, ordered by similarity from most to least similar.
I know this is a massive question, and nobody's getting paid to sit here and churn out solutions. So I really appreciate that you've even read this far down!
Thanks in advance!
P.S. If you want a download of the database so you can have a fiddle, it's available here:
http://dl.dropbox.com/u/22289145/linksitev2.sql