Results 1 to 6 of 6
  1. #1
    Join Date
    Feb 2011
    Posts
    3

    Word frequency database

    Hi,

    I have to build a database of words and their frequencies from a bunch of documents (about a million of them right now).

    After each document read, I'll send the words (and their frequency in that document) to the database and if the word already exists (which it mostly will), I update the frequency, else I insert the new word.

    Right now, I just have 2 column table (word varchar(255) pk, count int). Should I use the an integer as pk instead? Or is there some other nice way?
    Since I am a DB novice, I would like to know what is the best way to structure the database for this task.

    Basically I want to optimize 2 things:
    1) The check if the word exists
    2) The insert or update of word

    Since the English vocabulary is about 1 million, I think I'll take that as the safe upper limit for the number of records.

    Any leads would be appreciated.
    Thanks

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by arahant7 View Post
    Right now, I just have 2 column table (word varchar(255) pk, count int).
    Should I use the an integer as pk instead?
    no !!!

    which database system are you using?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Feb 2011
    Posts
    3
    MySql 5.1
    Though I can use any database.

  4. #4
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    mysql is fine

    there is a special sql option available in mysql which you can/should take advantage of...

    INSERT... ON DUPLICATE KEY UPDATE ...

    this allows you simply to insert the new word (with a count of 1), and if it already exists, the UPDATE portion of the statement lets you update the existing count by 1

    neat, eh?

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  5. #5
    Join Date
    Feb 2011
    Posts
    3
    Actually I was wondering if a 4 byte integer index column might be faster than a variable length character index.
    I have no reason to believe so, just a doubt.

    P.S. thanks for the INSERT... ON DUPLICATE KEY UPDATE ... tip

  6. #6
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by arahant7 View Post
    Actually I was wondering if a 4 byte integer index column might be faster than a variable length character index.
    sure, it might be, but (1) not by a measurable amount, and (2) under exactly what circumstances would you know the integer of a new word before you looked it up? i mean, how would you look up a new word? when would you actually use the integer?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •