Page 1 of 2 12 LastLast
Results 1 to 15 of 25
  1. #1
    Join Date
    Apr 2007
    Posts
    130

    Unanswered: Which 'select' statement for querying is more efficient?

    Hello ALL,


    There are two tables in my database. Table DocDb has three fields, Doc_ID, representing the number of a Document, Word, representing words appearing in the document, Occ_Freq, representing the occurrence count of a word.

    Another table ExampleDb has two fields, Word and Occ_Freq, having the same meaning as in the table DocDb. Word represents the word appearing in an example document; Occ_Freq represents the count of a word.

    I would like calculate the similarity between the example doc and every doc in the DocDb by summing up occurrence count of matched word, and sort the documents by their similarity scores.

    I have figure out two 'select' statements, but I am not sure which one is more efficient.

    (1)
    char ExampleDb_Word[3][20];
    strcpy(ExampleDb_Word[0], "Water");
    strcpy(ExampleDb_Word[0], "Plant");
    strcpy(ExampleDb_Word[0], "Sun");

    for(i= 0; i < 10000; i++)
    {
    for(j=0; j<3; j++)
    {

    select Occ_Freq from DocDb WHERE Word= ExampleDb_Word[j];

    //do summing up here.

    }
    }
    I suspect this method would be time consuming as there are two for loops.

    (2)
    select DocDb.Occ_Freq, ExampleDb.Occ_Freq from DocDb,ExampleDb WHERE DocDb.Word=ExampleDb.Word


    Would anyone can tell me which method is better and Why?

    any help and suggestion will be appreciated very much.


    Table DocDb

    Doc_ID Word Occ_Freq
    ------------------------------
    1 Water 9
    1 Bottle 6
    1 Milk 3
    2 Lamp 7
    2 Desk 8
    2 Food 5
    3 Water 4
    3 Plant 3
    3 Green 2
    4 House 8
    4 Biulding 2
    4 Renovation 5
    ....... ......... ..
    10000 Sea 7
    10000 Temperatur 6
    10000 Wave 3



    Table ExampleDb
    Word Occ_Freq
    -------------------
    Water 3
    Plant 2
    Sun 2

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    any time you execute thousands of queries to get an answer is going to be ~way~ slower than one query

    i ~think~ you can do this in one query, but i don't know for sure what "summing up occurrence count of matched word, and sort the documents by their similarity scores" means

    perhaps you could illustrate this by continuing your water-plant-sun example and showing us how you actually calculate the similarity of each of the first 4 documents
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Apr 2007
    Posts
    130
    Quote Originally Posted by r937
    any time you execute thousands of queries to get an answer is going to be ~way~ slower than one query

    i ~think~ you can do this in one query, but i don't know for sure what "summing up occurrence count of matched word, and sort the documents by their similarity scores" means

    perhaps you could illustrate this by continuing your water-plant-sun example and showing us how you actually calculate the similarity of each of the first 4 documents

    Doc 1 share one word 'Water' with the example doc, so the similarity score is (9 + 3) = 12. Doc 3 share two words, 'Water' and 'Plant', so the simi score is (4+3) +(3+2) = 12. Other docs have a 0 score as they do not share any word with the example doc. This evaluation method for document similiarity is just for illustration purpose. The actual method would be a complicated one.


    So do you have any further specific suggestions.
    Last edited by cy163; 06-06-08 at 19:51.

  4. #4
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Code:
    SELECT d.Doc_ID
         , SUM(d.Occ_Freq + e.Occ_Freq) AS score
      FROM DocDb AS d
    LEFT OUTER
      JOIN ExampleDb AS e
        ON e.Word = d.Word
    GROUP
        BY d.Doc_ID
    ORDER
        BY score DESC
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  5. #5
    Join Date
    Apr 2007
    Posts
    130
    Quote Originally Posted by r937
    Code:
    SELECT d.Doc_ID
         , SUM(d.Occ_Freq + e.Occ_Freq) AS score
      FROM DocDb AS d
    LEFT OUTER
      JOIN ExampleDb AS e
        ON e.Word = d.Word
    GROUP
        BY d.Doc_ID
    ORDER
        BY score DESC

    Thanks again r937.

    the real picture is a bit complicated. the are another field for the two tables,'TitleWord_YesNo', indicating whether the word is a word in the document title. Matching on words in the document title will be put more emphasis (usually original score will be multiplyed by a coefficient, say 2.0) than matching on non title words. Can I use 'if' statement in a SQL statement.

    Doc 1 share one word 'Water' with the example doc, so the similarity score is (9 + 3) *2.0 = 24, if 'Water' is a title word both in Doc1 and the example doc

  6. #6
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by cy163
    Can I use 'if' statement in a SQL statement.
    yes

    it's called CASE

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  7. #7
    Join Date
    Apr 2007
    Posts
    130
    Quote Originally Posted by r937
    yes

    it's called CASE

    I really learn a lot by communicating with you.

    have a nice weekend

  8. #8
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    ta very much

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  9. #9
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    If you are interested, I have an Access database I set up to do this, for tracking error messages and solutions to other technical issues.
    I found that simply looking for word matches was a poor algorithm, and that I had to weight each word according to the frequency with which it had successfully been matched to prior solutions.
    The math was a fun challenge, which you could probably dig out of the formulas I came up with.
    It actually works pretty dang well.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  10. #10
    Join Date
    Jul 2003
    Location
    San Antonio, TX
    Posts
    3,662
    And the solution is...where is it? Don't be stingy, I post my code (even if it is ugly )
    "The data in a record depends on the Key to the record, the Whole Key, and
    nothing but the Key, so help me Codd."

  11. #11
    Join Date
    Apr 2007
    Posts
    130
    Quote Originally Posted by blindman
    If you are interested, I have an Access database I set up to do this, for tracking error messages and solutions to other technical issues.
    I found that simply looking for word matches was a poor algorithm, and that I had to weight each word according to the frequency with which it had successfully been matched to prior solutions.
    The math was a fun challenge, which you could probably dig out of the formulas I came up with.
    It actually works pretty dang well.

    Could you please present the formulas and more detailed information

  12. #12
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    When matching articles, I parse each word of the articles into a WordList table.
    I calculate the "Selectivity" value (from 0 to 1) of each word based upon how frequently it is used vs how frequently it has successfully matched articles in the past.
    Then I calculate the match between two articles as the sum product of the matching word selectivities. To calculate the sum product, I use LOG functions:

    Match = 1-Exp(Sum(Log(WordSelectivity)))
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  13. #13
    Join Date
    Apr 2007
    Posts
    130
    Quote Originally Posted by r937
    yes

    it's called CASE

    Code:
    SELECT d.Doc_ID
         , SUM(CASE d.TitleWord='1' AND e.TitleWord='1'  THEN 2*(d.Occ_Freq + e.Occ_Freq), 
                  CASE d.TitleWord='0' OR e.TitleWord='0'  THEN d.Occ_Freq + e.Occ_Freq) AS score
      FROM DocDb AS d
    LEFT OUTER
      JOIN ExampleDb AS e
        ON e.Word = d.Word
    GROUP
        BY d.Doc_ID
    ORDER
        BY score DESC
    It seems that the above statement is wrong and does not work. What is the correct way to differentiate cases in the SUM clause based on table fields.

    Actually, what i want to implement is sum up weight * (d.Occ_Freq + e.Occ_Freq), where the coefficient 'weight' depends on a condition expression consisting of some table fields.
    if condition_expression = 1 then weight =2;
    if condition_expression = 2 then weight =3;
    if condition_expression = 3 then weight =4;
    ......
    Last edited by cy163; 08-29-08 at 20:57. Reason: correction

  14. #14
    Join Date
    Apr 2007
    Posts
    130

    How use CASE in a special situation

    Hello,

    I made this thread to learn an efficient way to query a database using SELECT statement. Friend r937 suggested using SUM and CASE. However, I do not exactly know how to use them in my case. So, please take a look and help me.

    Felix.
    Last edited by r937; 08-31-08 at 21:54.

  15. #15
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Code:
    SUM(
    CASE WHEN condition_expression = 1 THEN 2
         WHEN condition_expression = 2 THEN 3
         WHEN condition_expression = 3 THEN 4
         ELSE NULL END
     * (d.Occ_Freq + e.Occ_Freq)  
       )
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •