Hi,
I have to build a database of words and their frequencies from a bunch of documents (about a million of them right now).
After each document read, I'll send the words (and their frequency in that document) to the database and if the word already exists (which it mostly will), I update the frequency, else I insert the new word.
Right now, I just have 2 column table (word varchar(255) pk, count int). Should I use the an integer as pk instead? Or is there some other nice way?
Since I am a DB novice, I would like to know what is the best way to structure the database for this task.
Basically I want to optimize 2 things:
1) The check if the word exists
2) The insert or update of word
Since the English vocabulary is about 1 million, I think I'll take that as the safe upper limit for the number of records.
Any leads would be appreciated.
Thanks