I have to build a database of words and their frequencies from a bunch of documents (about a million of them right now).
After each document read, I'll send the words (and their frequency in that document) to the database and if the word already exists (which it mostly will), I update the frequency, else I insert the new word.
Right now, I just have 2 column table (word varchar(255) pk, count int). Should I use the an integer as pk instead? Or is there some other nice way?
Since I am a DB novice, I would like to know what is the best way to structure the database for this task.
Basically I want to optimize 2 things:
1) The check if the word exists
2) The insert or update of word
Since the English vocabulary is about 1 million, I think I'll take that as the safe upper limit for the number of records.
Actually I was wondering if a 4 byte integer index column might be faster than a variable length character index.
sure, it might be, but (1) not by a measurable amount, and (2) under exactly what circumstances would you know the integer of a new word before you looked it up? i mean, how would you look up a new word? when would you actually use the integer?