If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Database Concepts & Design > Word frequency database

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 02-19-11, 13:23
arahant7 arahant7 is offline
Registered User
 
Join Date: Feb 2011
Posts: 3
Word frequency database

Hi,

I have to build a database of words and their frequencies from a bunch of documents (about a million of them right now).

After each document read, I'll send the words (and their frequency in that document) to the database and if the word already exists (which it mostly will), I update the frequency, else I insert the new word.

Right now, I just have 2 column table (word varchar(255) pk, count int). Should I use the an integer as pk instead? Or is there some other nice way?
Since I am a DB novice, I would like to know what is the best way to structure the database for this task.

Basically I want to optimize 2 things:
1) The check if the word exists
2) The insert or update of word

Since the English vocabulary is about 1 million, I think I'll take that as the safe upper limit for the number of records.

Any leads would be appreciated.
Thanks
Reply With Quote
  #2 (permalink)  
Old 02-19-11, 13:57
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,524
Quote:
Originally Posted by arahant7 View Post
Right now, I just have 2 column table (word varchar(255) pk, count int).
Should I use the an integer as pk instead?
no !!!

which database system are you using?
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #3 (permalink)  
Old 02-19-11, 14:29
arahant7 arahant7 is offline
Registered User
 
Join Date: Feb 2011
Posts: 3
MySql 5.1
Though I can use any database.
Reply With Quote
  #4 (permalink)  
Old 02-19-11, 14:50
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,524
mysql is fine

there is a special sql option available in mysql which you can/should take advantage of...

INSERT... ON DUPLICATE KEY UPDATE ...

this allows you simply to insert the new word (with a count of 1), and if it already exists, the UPDATE portion of the statement lets you update the existing count by 1

neat, eh?

__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #5 (permalink)  
Old 02-19-11, 17:43
arahant7 arahant7 is offline
Registered User
 
Join Date: Feb 2011
Posts: 3
Actually I was wondering if a 4 byte integer index column might be faster than a variable length character index.
I have no reason to believe so, just a doubt.

P.S. thanks for the INSERT... ON DUPLICATE KEY UPDATE ... tip
Reply With Quote
  #6 (permalink)  
Old 02-19-11, 22:21
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,524
Quote:
Originally Posted by arahant7 View Post
Actually I was wondering if a 4 byte integer index column might be faster than a variable length character index.
sure, it might be, but (1) not by a measurable amount, and (2) under exactly what circumstances would you know the integer of a new word before you looked it up? i mean, how would you look up a new word? when would you actually use the integer?
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On