The project we're working on is a document retrieval system
I have 3 tables as follows. Table 1 contains words from a example doc. Table 2 contains words from documents in a database. Table3 contain only one column DocID.
I would like to search the database to find those docs similar to the example doc by calculating similarity score between the example doc and each doc in the database. Specifically, the simi score is calculated by adding up the sum of word frequencies of all matched word. Additionally, I am only interested in the those documents whose DocID appear in table3.
DocID Word Freq
1 Book 3
1 English 2
1 Math 1
2 Desk 2
2 Machine 5
2 Power 2
3 Desk 3
3 Teacher 3
3 Class 2
4 Building 1
4 Tower 2
I am not sure the following statement is correct or not
DocID, table1.Freq+ table2.Freq
table1, table2, table3
table1.Word = table2.Word AND table2.DocID = table3.DocID;
Ok, I will try my statement and report the result here tomorrow.
But, as you know, in some cases, a statement is not necessarily CORRECT even though it works well with simple man-made testing data. On the other hand, for real life large databases,it is often difficult for one to have a correct answer which can be used as criterion to evaluate a query result. If this happens, it is definately a nightmare for a project. Hence, I often visit this forum to post my own statements to seek help from experts like you.
I judged that the statement may be correct by testing it with the small dataset. I am still not sure whether it is REALLY correct for real data ( I cannot know the correct answer for real data in advance). Also, I am not sure it is the best one in terms of efficiency. SO, I would like have your opinion.
Why don't you just expand your test data to include some of the issues you're worried about. If your current data tests nothing then it's hardly test data. Then run your code on this new test data to see if it works - if it doesn't then you know you have something to fix. The good thing about having better test data is that you can use it going forwards to test any future SQL code you might write.