If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > MySQL > Perform complex calculations entirely within MySQL

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 01-23-12, 09:11
PikaD PikaD is offline
Registered User
 
Join Date: Jan 2012
Posts: 2
Perform complex calculations entirely within MySQL

I currently have this working by passing bucket-loads of data to PHP and then doing all the calculations there. I want (and most likely need) to perform all the calculations using MySQL only. I know that it is possible to write functions and procedures in MySQL but I've never done it before and every time I sit down to get this written I confuse myself into a crying mess of self-disappointment. So....

Explanation of database:
Let's start with a database diagram:
http://i.imgur.com/Th3NL.jpg

You can immediately disregard the users table. We'll not bother with that in this question.

So this database hold information drawn from web pages. You can see the links table holds the URL and title, as well as the maximum term frequency for any term in that document.

The object table is just to allow both users and links to be foreign primary keys in the occurences table. BTW, I am aware of the constant mis-spelling of occurrences as occurences! =)

The occurences table holds the majority of data. The table has a primary foreign key of linkoruser (though just ignore user for this). It has word_id which refers to the dictionary table, tag_id which refers to the tags table and finally score which is the number of occurrences of that word, of that tag type, in that document.

The tags table has a tag id, the name of a tag (eg. title, or H1) and its weight which is normally set between 1 and 10.

The dictionary table has a word id, the actual word, and the inverse document frequency of that word. If you care what an IDF is then it's the number of web pages, divided by the number of web pages containing that specific word somewhere.

Explanation of the system's function:
Given a link ID (we'll call it BaseID), all other links are compared for similarity (cosine similarity) and the IDs, url and titles of all links are shown to the user, ordered by most to least similar.

Scores for each word are:
((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
ie the word's number of occurrences in the document multiplied by the tag weighting for this particular word's tag, divided my the maximum term frequency for that document. Then that multiplied by the word's inverse document frequency to give you the final score for that word, in that context.

The similarity is done via cosine similarity, best show in the image below.
http://i.imgur.com/n3Bp8.jpg

So each page has it's similarity to the BaseID, and the system shows the user all the page IDs, URLs and Titles, ordered by similarity from most to least similar.


Explanation of the problem:
Currently massive amounts of data is dragged out of MySQL and handed over to PHP which then does all the processing.

This is because I only know my way around basic MySQL. I know it is possible to use functions and procedures, but I am getting mighty tangled.

What I would like to have is for me to pass a BaseID to MySQL and it return the page IDs, URLs and Titles, ordered by similarity from most to least similar.

I know this is a massive question, and nobody's getting paid to sit here and churn out solutions. So I really appreciate that you've even read this far down!

Thanks in advance!

P.S. If you want a download of the database so you can have a fiddle, it's available here:
http://dl.dropbox.com/u/22289145/linksitev2.sql
Reply With Quote
  #2 (permalink)  
Old 01-23-12, 13:36
it-iss.com it-iss.com is offline
Registered User
 
Join Date: Sep 2009
Location: San Sebastian, Spain
Posts: 620
Just to be clear you are using cosine formula to measure similarity. How is A and B defined in the equation versus the data that you have? Can any of this be done and stored at INSERT or UPDATE? For example, the score for each word? Would this help reduce the recalculation part?

MySQL functions and stored procedures can be powerful and you can also create UDF's (user defined functions) which allows you to write complex processing in C.
__________________
Ronan Cashell
Senior Oracle/MySQL DBA
http://www.it-iss.com
Reply With Quote
  #3 (permalink)  
Old 01-23-12, 13:55
PikaD PikaD is offline
Registered User
 
Join Date: Jan 2012
Posts: 2
Quote:
Originally Posted by it-iss.com View Post
Just to be clear you are using cosine formula to measure similarity. How is A and B defined in the equation versus the data that you have? Can any of this be done and stored at INSERT or UPDATE? For example, the score for each word? Would this help reduce the recalculation part?

MySQL functions and stored procedures can be powerful and you can also create UDF's (user defined functions) which allows you to write complex processing in C.
Using procedures and functions is definitely what I want to do, but it's is entirely new territory for me and I can't even get it clear in pseudocode.

Similarity is as follows:

Words = All words that occur in either DOC1 or DOC 2
For each word as N
DOC1's score for word N = ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
DOC2's score for word N = ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
-this is repeated for all words and those scores for each word are what are used in the cosine similarity.
In the example I showed, the words that are in either doc are abandon, plane, kill, survive and people.
http://i.imgur.com/n3Bp8.jpg
The scores for DOC1 on those words are the set running 0.11 -> 0.65
The scores for DOC2 on those words are the set running 0.00 -> 0.09

I have done all I can do during insert and update. No other pre-calculation can be done because of the nature of the uncalculated variables.

I hope this answers what you were asking and that you can be of further help to me!

Thank you,

Drew
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On