Results 1 to 13 of 13
  1. #1
    Join Date
    Sep 2012
    Posts
    9

    SQL Data Redundancy vs Query Performance

    Imagine a site like Stackoverflow, where there are millions of users. Each user has his own data in the server, like questions he asked, answers he gave, reputation, edits, bounties, etc. Now the server should have a master database which contains, say, all questions asked in all tags combined, where each row might have the associated tag-id to identify which tags it relates to. What I want to know is how to store the individual info for a user, say for the record of all questions asked by him as seen on his user-page, do I query the master-table for all questions asked by that particular user, or do I create a separate table for each user, create a trigger in the master table that whenever a new record is added, a copy of the record (or perhaps, only a portion, like the question-id and the tag-id) should be updated in the individual user questions table. Then that way, whenever the records for that user have to be retrieved, I can join that table with the master table to fetch the relevant records.

    So which method would be better? The first one is direct, there is no data redundancy, but the number of questions might exceed billions, it may be be time and resource consuming to query the entire table for 50 records. On the other hand, the 2nd method will perhaps reduce the time because the records can be fetched directly (though the join may take some time, I think it won't be as much as the time taken to query the master table), but will involve substantial data redundancy (even when only one or two keys are stored) when the number of users become huge.

  2. #2
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    separate user table for each user is a no no, a stupid idea.
    I'd rather be riding on the Tiger 800 or the Norton

  3. #3
    Join Date
    Sep 2012
    Posts
    9
    Then what to do? Creating just a master table for each forum will take heavy time to query, because it will have gargantuan amounts of data.

  4. #4
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    why do you think it will take a heavy time to query?
    use of an appropriate index will require additional processing but hardly gargatuam...
    I'd rather be riding on the Tiger 800 or the Norton

  5. #5
    Join Date
    Sep 2012
    Posts
    9
    Well, a forum will have a large number of topics, say 200. Each forum will have lots of questions by lots of users. So won't that data set become huge in no time? Creating a separate table for each user will reduce that time drastically. Add to it the fact that if the page wants to deliver auto-updation (perhaps whether any question asked by me has just been answered, or somebody has posted a question in a topic I am following), the best bet would be long-polling, say at an interval of 1 seconds, upto a mzximum of 30 seconds before sending the data (if any) to the client. If the query itself takes substantial time, won't this updation also suffer? What is the issue with having tables for each user?

  6. #6
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    have you got any evidence to back up your assertion
    Creating a separate table for each user will reduce that time drastically
    if not get some metrics. create some table(s) populate with data, run soem tests that either prove or disprove your assertion.

    Bear in mind this forum has nearly 300,000 registered users and over 2000 "active" users.
    What happens if you, say, want to search existing topics to find if someone has already started a similar topic.

    Get some evidence to prove your assumptions.
    I'd rather be riding on the Tiger 800 or the Norton

  7. #7
    Join Date
    Sep 2012
    Posts
    9
    Obviously. If there are a total of 300,000 topics, only a small number, say, 300-1000 questions may have been posted by a user on average. The tables are populated automatically, since when the question is submitted by the user, it gets to this table first, and then through a succession of triggers, gets populated in the higher tables also until it reflects in the master-table. So won't querying this table to display a list of the questions asked by him be much faster than querying the entire table?

  8. #8
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    No. Not much faster.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  9. #9
    Join Date
    Sep 2012
    Posts
    9
    Why not? Stackoverflow has say 2 billion posts (it's possible. At least 20 questions are posted every second!), won't it be significantly faster to query the 500 posts for a user instead of all the 2 billion rows?

  10. #10
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    do you understand the reason for indexes
    do you understand how use of indexes affects how the query is processed
    I'd rather be riding on the Tiger 800 or the Norton

  11. #11
    Join Date
    Sep 2012
    Posts
    9
    Yes. I do.

  12. #12
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    ..show why persist in the one table per user 'design' paradigm
    I'd rather be riding on the Tiger 800 or the Norton

  13. #13
    Join Date
    Sep 2012
    Posts
    9
    Ok. Thanks.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •