Results 1 to 3 of 3
  1. #1
    Join Date
    Jan 2012
    Posts
    2

    Unanswered: Perform complex calculations entirely within MySQL

    I currently have this working by passing bucket-loads of data to PHP and then doing all the calculations there. I want (and most likely need) to perform all the calculations using MySQL only. I know that it is possible to write functions and procedures in MySQL but I've never done it before and every time I sit down to get this written I confuse myself into a crying mess of self-disappointment. So....

    Explanation of database:
    Let's start with a database diagram:
    http://i.imgur.com/Th3NL.jpg

    You can immediately disregard the users table. We'll not bother with that in this question.

    So this database hold information drawn from web pages. You can see the links table holds the URL and title, as well as the maximum term frequency for any term in that document.

    The object table is just to allow both users and links to be foreign primary keys in the occurences table. BTW, I am aware of the constant mis-spelling of occurrences as occurences! =)

    The occurences table holds the majority of data. The table has a primary foreign key of linkoruser (though just ignore user for this). It has word_id which refers to the dictionary table, tag_id which refers to the tags table and finally score which is the number of occurrences of that word, of that tag type, in that document.

    The tags table has a tag id, the name of a tag (eg. title, or H1) and its weight which is normally set between 1 and 10.

    The dictionary table has a word id, the actual word, and the inverse document frequency of that word. If you care what an IDF is then it's the number of web pages, divided by the number of web pages containing that specific word somewhere.

    Explanation of the system's function:
    Given a link ID (we'll call it BaseID), all other links are compared for similarity (cosine similarity) and the IDs, url and titles of all links are shown to the user, ordered by most to least similar.

    Scores for each word are:
    ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
    ie the word's number of occurrences in the document multiplied by the tag weighting for this particular word's tag, divided my the maximum term frequency for that document. Then that multiplied by the word's inverse document frequency to give you the final score for that word, in that context.

    The similarity is done via cosine similarity, best show in the image below.
    http://i.imgur.com/n3Bp8.jpg

    So each page has it's similarity to the BaseID, and the system shows the user all the page IDs, URLs and Titles, ordered by similarity from most to least similar.


    Explanation of the problem:
    Currently massive amounts of data is dragged out of MySQL and handed over to PHP which then does all the processing.

    This is because I only know my way around basic MySQL. I know it is possible to use functions and procedures, but I am getting mighty tangled.

    What I would like to have is for me to pass a BaseID to MySQL and it return the page IDs, URLs and Titles, ordered by similarity from most to least similar.

    I know this is a massive question, and nobody's getting paid to sit here and churn out solutions. So I really appreciate that you've even read this far down!

    Thanks in advance!

    P.S. If you want a download of the database so you can have a fiddle, it's available here:
    http://dl.dropbox.com/u/22289145/linksitev2.sql

  2. #2
    Join Date
    Sep 2009
    Location
    San Sebastian, Spain
    Posts
    880
    Just to be clear you are using cosine formula to measure similarity. How is A and B defined in the equation versus the data that you have? Can any of this be done and stored at INSERT or UPDATE? For example, the score for each word? Would this help reduce the recalculation part?

    MySQL functions and stored procedures can be powerful and you can also create UDF's (user defined functions) which allows you to write complex processing in C.
    Ronan Cashell
    Certified Oracle DBA/Certified MySQL Expert (DBA & Cluster DBA)
    http://www.it-iss.com
    Follow me on Twitter

  3. #3
    Join Date
    Jan 2012
    Posts
    2
    Quote Originally Posted by it-iss.com View Post
    Just to be clear you are using cosine formula to measure similarity. How is A and B defined in the equation versus the data that you have? Can any of this be done and stored at INSERT or UPDATE? For example, the score for each word? Would this help reduce the recalculation part?

    MySQL functions and stored procedures can be powerful and you can also create UDF's (user defined functions) which allows you to write complex processing in C.
    Using procedures and functions is definitely what I want to do, but it's is entirely new territory for me and I can't even get it clear in pseudocode.

    Similarity is as follows:

    Words = All words that occur in either DOC1 or DOC 2
    For each word as N
    DOC1's score for word N = ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
    DOC2's score for word N = ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
    -this is repeated for all words and those scores for each word are what are used in the cosine similarity.
    In the example I showed, the words that are in either doc are abandon, plane, kill, survive and people.
    http://i.imgur.com/n3Bp8.jpg
    The scores for DOC1 on those words are the set running 0.11 -> 0.65
    The scores for DOC2 on those words are the set running 0.00 -> 0.09

    I have done all I can do during insert and update. No other pre-calculation can be done because of the nature of the uncalculated variables.

    I hope this answers what you were asking and that you can be of further help to me!

    Thank you,

    Drew

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •