View Poll Results: Best approach?

Voters
0. You may not vote on this poll
  • One large sparse table

    0 0%
  • Index table(s) and dynamically created data tables

    0 0%
  • Something else

    0 0%
Results 1 to 7 of 7
  1. #1
    Join Date
    Dec 2005
    Posts
    74

    Question Best database design

    Introduction
    Hello, this is my first post on this forum. I am not new to databases but have never dealt with complex designs or used powerful RDBMS. I currently have a task that I need to do and although it isn't very complex at all, it will be part of a very important software package, and I don't want to get it wrong. This is also the first time I will be programming all the data access operations into the software code, instead of using RDBMS tools. Which also happens to be my first encounter with .Net technologies.

    Background
    I am trying to expand a legacy software package to support database access. Namely, to replace its current data-logging capability from saving a single ASCII-text file (containing a single table of data with some free-style headers) to a more powerful, versatile, and user-friendly database format. Hopefully, without having to implement code for each native provider. Our customers have various restrictions and I cannot limit them to using a specific DBMS - some will want an Excel spreadsheet, others - an Access *.mdb, while there certainly will be those wanting a full-blown SQL server. Not to mention the few who will expect it to connect to their corporate Oracle server... Thus, I plan to use ODBC to provide access to multiple RDBMS.

    The existing product is a nasty mixture of 3-generations-old LabVIEW code, plain C with Windows API calls, together with C++ with MFC. Now I'm about to add some .Net stuff to complete the mess. There are currently three separate processes communicating with each other using various inter-process communication methods. I'll be adding a fourth. Looks ugly but once you consider the alternatives, it's not nearly as bad.

    My question
    The challenge is in structuring the new database and that is where I would like to get your input. Briefly, there is a conflict between having one large, sparse table (i.e., with a lot of NULL values in it) or a large number of smaller tables in the database. Here are the details. The application generates periodically data that can be represented as one string and one table with many rows (say, from hundreds to millions). The good thing is that all the columns in the table could be represented with the same numeric type (the first two are strings but they can be converted). The bad news is the number of columns and their heading names will vary. They are likely to group, i.e., a specific schema will occur multiple times but the number of schemas is not limited.

    With such a task I appear to have the choice of either creating one large data table, with a number of columns that is big enough to fit the largest single data set it would need to store, or an index table and to create new tables every time a new schema is encountered. The former method has two problems, that I see. First, it will take a lot of space (plenty of NULL values). Second, it is very difficult to pick a good number of columns, considering you never know how many columns are likely to be generated at run time. The latter solution has the drawback of creating a large number of tables in the database, and having to maintain complex relationships, and having to search multiple tables when writing new data to identify the correct spot to place the new data.

    What do you guys think is the best approach: one large table, or many smaller tables? Or do you see another solution? Thank you in advance.

    Kamen

  2. #2
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    Yikes! There is no "best" solution given only that information. There are a good many variables to consider before choosing a schema. I guess the first question is what does this information represent, in general? Is it something that lends itself well to a relational design at all? Would you be better off storing it as an xml stream (which lends itself well to .NET tranport)? What information do the clients need? How do they need to get it?
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

  3. #3
    Join Date
    Dec 2005
    Posts
    74

    More details

    All right, here are the details. The product is a control and data acquisition system. Each time a test is run, a number of system "channels" are stored to a log. In the expanded version - the one with the database usage that I'll be working on now - different channels may be stored at different sampling rates. As it is now, there are a few lines of text ("comments") stored in the ASCII log as a header (at the top), then there is a line with the channel names, then - a line with the channel units (all strings), and finally - the channel data: all single-precision numbers, with the exception of the first two (DATE and TIME), which are represented as strings. Since the sampling rate is identical for all channels, a single table can hold all of the data, with no duplicates. However, each different log file may be represented by a table with a different number of channels, and/or with different names.

    What I want to do is store all of this in a database instead of a bunch of text files. I could create a massive table with the first column being the session ID, then "comments", then a big string containing the "schema", e.g., a string of concatenated "channel name - channel unit - sample rate" fields. Then, I could have the channel data - enough columns to fit the maximum expected number of channels, thus having a ton of empty entries. Another problem is that the first few columns will repeat the same data many times.

    Alternatively, I could create one fixed "index" table, which contains the "session ID" and the "schema ID". Another table would contain the actual schemas and link to the previous table by the schema ID. Then with every session, an entry would be made in the schema table, unless a matching one already exists. Analogically, a whole new "data" table would be created every time a new schema is generated, and the channel data would be stored there. The link would be by session ID but you'd have to find the right table first. The trouble here is that a complex search, spanning multiple tables would have to be performed with each session, and especially when querying the database for data. Then, there is the issue with potentially having thousands of tables in the database.

    The expected client searches would at least need to yield data similar to what the original log files held. Of course, now that the data is in a nice database, they could do more. What exactly - nobody knows, whatever they please!

    I hope this makes the task clearer and helps in deciding which approach is best.

    Kamen
    Last edited by KamenG; 12-07-05 at 22:09.

  4. #4
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    Quote Originally Posted by KamenG
    Analogically, a whole new "data" table would be created every time a new schema is generated,
    You had me right up to this point. Why would a whole new table need to be generated for each sample? does the format of the data being stored change?
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

  5. #5
    Join Date
    Dec 2005
    Posts
    74
    Quote Originally Posted by Teddy
    You had me right up to this point. Why would a whole new table need to be generated for each sample? does the format of the data being stored change?
    Yes, for instance, the number of channels could easily change (not to mention their names and units, which are just headings/schema but still). In addition, the sample rates could change. For example, one session could have (on top of the standard date/time) the following channels: "speed", "torque", "horsepower", "oil pressure". Then another could drop the "torque" and replace it with "raw torque" and "inertia-corrected torque". To make things more complicated, the sampling rate may be different, for example, in another test, "drive current" and "drive voltage" channels may be added and sampled 100 times the frequency of the other channels. You wouldn't want to have those in the same table, so as you see, you may even have incentives to have multiple data tables within the same test/session.
    Kamen

  6. #6
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    I see...

    How about a table to define meta-data? I've designed somewhat similar systems to what you're talking about wherein actual information being tracked varied somewhat from project to project, but the client wanted to use the same database structure for ALL information.

    I created a set of tables to define the information being tracked. For instance:

    tblMeta
    ---------
    meta_id
    project_id (this could be channel id for you)
    description
    unit_of_measure

    tblAnalysis
    ----------
    analysis_id
    meta_id
    analysis_value

    That is an overly simplified version of the actual implementation, but it is enough to get the basic idea. Entries can be made into tblMeta to define the attributes of a given value being tracked. The actual values then get pushed into tblAnalysis. Naturally you would probably need a few more tables to group your meta definitions into a single band, but hopefully you get the general idea...
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

  7. #7
    Join Date
    Dec 2005
    Posts
    74

    One more question

    All right, I guess you are saying that multiple tables is better than one huge, sparse table. It's what I thought is best. My only concern - actually, you could be very helpful being an Access guru, and everything - how does Access MDB format (and the MS Jet ODBC driver, I guess) scale with an increase in number of tables, with relation to the total data in it? For example, if you will be writing (and then querying) 1GB of data in the MDB, is it better to have, say 1 table with 1000 columns, where each field is 10 Bytes, and the table has 100000 rows (that does add up to 1GB, right?) Or 10000 tables with an average of 50 columns and 2000 rows? My concern isn't as much about querying but mostly for writing, since that will be a real-time process. Actually, my writing would involve querying since every time a session begins, the system would need to find out if the "data schema" matches an existing one and thus one of the "dynamic" tables can be reused (appended to). Unless I decide to create a new table for every session, where I would easily exceed the 10000 tables I gave in the example above. I hope at least if 10000 tables turns out to be a bit too much for Access, I assume even a desktop edition of a database server (e.g., MS SQL Express) could handle it. I just don't want to spend a lot of effort designing a fancy structure only to find out that as soon as it starts filling up, it slows down to a crawl or takes the system down. Any insights will be appreciated.
    Kamen

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •