Results 1 to 6 of 6
  1. #1
    Join Date
    Sep 2010
    Posts
    3

    High Capacity Database (overload?)

    I've been tasked with designing what is, on the surface, a fairly simple database which essentially stores two things:
    - A blob of data (between 10 KB and 100 MB each)
    - A set of information about each blob (e.g. Name, Date, Context)

    So thus far we have two tables one table of raw data and a second table that references the first and also stores additional information.

    Where it gets "complex" is that it needs to store 500,000 to 1,000,000 of these and frankly I am worried that two tables might collapse.

    Additionally the total raw data size is over 10 gig and it might grow bigger in the future.

    Now I could split this data by context (I have contextual information) but I would only opt to do that if I felt that it would help. I can either split it four different ways or 70,000 different ways (yes, 70K!).

    Has anyone ever ran across a situation like this? Does splitting the data across several tables or even databases help at all?

  2. #2
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    do you need the blob of data to be stored inside the db?
    in terms of size of data 10 gig isn't excessive by modern standards, there a good few terrabyte db's, it "just" comes down to cost/budget constraints

    as to whether splitting the db makes sense that comes down to you and your design. if the blobs of data are not similar or realted or used by the same people then you could make a case to split the db into relevant sections. personally I think you should resist that temptation.
    I'd rather be riding on the Tiger 800 or the Norton

  3. #3
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    500,000 of 10kb is orders of magnitude different to 1,000,000 of 100MB. To plan capacity you will need to zero in on to more accurate numbers.

    I'm afraid I don't know what a collapsing table is.

    What is your RDBMS? Do you have hardware? What is the purpose of the database? Is it pretty static and will only a blob or two at a time be retrieved, or is it a highly transactional system that will have many thousands of blobs affected by each transaction?
    Also, are these blobs currently on a filesystem and, if so, what is the driver to move them into the database? (the reason I ask is that moving the blobs in to the database is a design decision that is not explained by your post).
    Testimonial:
    pootle flump
    ur codings are working excelent.

  4. #4
    Join Date
    Sep 2010
    Posts
    3
    Quote Originally Posted by healdem View Post
    do you need the blob of data to be stored inside the db?
    Right now each blob is stored as 1(!) file on the filesystem and having that many files is causing masses of issues. Doing anything with them has a huge amount of overhead. Plus backup is almost impossible (it takes three days per cycle).

    We're hoping to reduce the overhead by moving away from files entirely and moving into nice neat tables - or at least that is the theory.

    Quote Originally Posted by healdem View Post
    ]as to whether splitting the db makes sense that comes down to you and your design. if the blobs of data are not similar or realted or used by the same people then you could make a case to split the db into relevant sections. personally I think you should resist that temptation.
    In terms of database design, it doesn't make sense to split them. They're all generic. Splitting them would be artificial and would only be done in order to maintain the stability of the database.

  5. #5
    Join Date
    Sep 2010
    Posts
    3
    Quote Originally Posted by pootle flump View Post
    500,000 of 10kb is orders of magnitude different to 1,000,000 of 100MB. To plan capacity you will need to zero in on to more accurate numbers.
    You're absolutely right. But unfortunately it really does vary just that much. Right now we're consuming 10 GB, but in the future I'd expect that to raise. It is very hard to pin down just due to the kind of data that comes in.

    Quote Originally Posted by pootle flump View Post
    I'm afraid I don't know what a collapsing table is.
    I guess that is worded wrong. What I want to avoid is having the database throw an exception because the table has exceeded its capacity.

    Quote Originally Posted by pootle flump View Post
    What is your RDBMS?
    Likely Postgres.

    Quote Originally Posted by pootle flump View Post
    Do you have hardware?
    A mid-range dedicated 2008 server.

    Quote Originally Posted by pootle flump View Post
    What is the purpose of the database?
    To expose the data (and meta data) to a web-service which in term exposes it internally. The web-service would relay the raw data and meta data depending on the request.

    Quote Originally Posted by pootle flump View Post
    Is it pretty static and will only a blob or two at a time be retrieved, or is it a highly transactional system that will have many thousands of blobs affected by each transaction?
    Fairly static. I'd say 90% of them won't change in any one day, most requests will be for only one blob of data, but for multiple piece of meta data.

    So for example you might have a request that downloads one hundred meta data rows, but then only downloads one blob of data as a end result.

    Quote Originally Posted by pootle flump View Post
    Also, are these blobs currently on a filesystem and, if so, what is the driver to move them into the database? (the reason I ask is that moving the blobs in to the database is a design decision that is not explained by your post).
    We have hundreds of thousands of little files that are almost impossible to backup (due to the overhead created by moving each one file). We also want to expose this data as a web-service which, granted, we can do with the file system, it would be easier to implement if we can write queries against a dataset rather than having to use the file system's structure.

  6. #6
    Join Date
    Nov 2003
    Posts
    2,933
    Quote Originally Posted by UnoriginalGuy View Post
    Likely Postgres.
    I'm pretty confident that Postgres is be able to cope with that size.
    Provided you properly tune it and have a decent RAID system (e.g. a RAID 10 with a battery backed controlled)

    We have hundreds of thousands of little files that are almost impossible to backup (due to the overhead created by moving each one file). We also want to expose this data as a web-service which, granted, we can do with the file system, it would be easier to implement if we can write queries against a dataset rather than having to use the file system's structure.
    While I usually prefer to store BLOBs in the database don't forget that backing up the database will be quite time consuming due to the size of it. Doing incremental backups with Postgres will require you to setup WAL archiving (which is probably a good idea anyway...)

    The normal pattern to deal with such a large number of files in the file system, is to distribute them over several directories, e.g. by hashing the filename or a different attribute.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •