If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Database Concepts & Design > More columns vs more tables?

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 01-14-07, 13:42
Narwe Narwe is offline
Registered User
 
Join Date: Jan 2007
Posts: 24
More columns vs more tables?

I'm building a large chemical database that will store molecular structures and all descriptive data (descriptors) for each structure. There are a theoretically infinate number of descriptors but I will likely be limited to no more than 1 or 2 thousand (undetermined as of yet). My database will eventually grow to include millions of molecular structures. As you can see, I'm talking about A LOT of data and I'm really not sure how to handle it. Obviously, creating a "descriptors" table with hundreds or thousands of columns isn't going to cut it. And creating a separate table for each descriptor leaves me with some impressively clunky JOIN statements when I need all of the descriptors for a particular selection of molecules. I could organize the descriptors into arbitrary collections and create a table for each collection, but that creates its own issues.

Any tips or advice? Are there any cheminformatics people around that might be able to help me out?

Last edited by Narwe; 01-14-07 at 14:09. Reason: grammar
Reply With Quote
  #2 (permalink)  
Old 01-14-07, 23:09
blindman blindman is offline
World Class Flame Warrior
 
Join Date: Jun 2003
Location: Ohio
Posts: 11,726
The majority of us are not chemists, (though Pat seems to know a little about everything), so you will need to be more descriptive of the entities and relationships involved.
__________________
If it's not practically useful, then it's practically useless.

blindman
www.chess.com: "sqlblindman"
Reply With Quote
  #3 (permalink)  
Old 01-15-07, 00:00
Narwe Narwe is offline
Registered User
 
Join Date: Jan 2007
Posts: 24
For each chemical structure (which is indexed and stored in its own table as a oracle blob) there are hundreds of derived properties. Each of these properties is derived from the specific structure of the compound. For example, the weight of the molecule, its number of atoms, its number of hydrogens, its number of bonds, all are properties derived from the structure of a particular compound. For every compound in my database these properties need to be determined and stored. I'll be dealing with hundreds of thousands of molecules to start, and that number will only continue to grow.

It's worth emphasizing that for each compound there is obviously only one weight, one number of atoms, one number of hydrogens, one number of bonds, etc, so there is a 1 to 1 relationship between each compound and each of its properties. The difficulty is in designing a database whereby retrieval of these properties can be made as efficient as possible. The database actually contains many more tables, but it's the these chemical properties/descriptors that are giving me a hard time.

Last edited by Narwe; 01-15-07 at 00:02. Reason: wording
Reply With Quote
  #4 (permalink)  
Old 01-15-07, 04:38
healdem healdem is offline
Jaded Developer
 
Join Date: Nov 2004
Location: out on a limb
Posts: 9,246
I would of thought that standard properties, ie those common to all compounds could be stored in the main table. Im thinking of things like weight etc.

The rest will probably best best modelled as a generalised property table, with foreign keys in both the compound AND a new entitiy say "PropertyType". That design is getting almost close to the favoured object property odel advocated by OOB enthusiasts.. wonder if there is a true OOdB whihc may work for this application

However It may be smarter if you are going to set up a child table to place any properties there. afterall not all compounds will have Hyrdogen in

Not sure of your application but presumably you may also have problems with molecules that exhibit chiral variations

I dont think you should have a significant join problem. there may well be an issue int he presentation layer of how you flatten the properties, so youmay need some form of sort sequence or property grouping to be include in PropertyType
__________________
I'd rather be riding my Versys or my Tiger 800 let alone the Norton
Reply With Quote
  #5 (permalink)  
Old 01-15-07, 08:49
Narwe Narwe is offline
Registered User
 
Join Date: Jan 2007
Posts: 24
Thanks for the reply.

All of the properties we are using are present for all of the structures in our database. In fact we only use properties that apply to all of our compounds. We have hundreds of these properties, and more are possible. So there is not really any non arbitrary way that I can divide certain of these properties into the main table, and others be stored seperately.

My first approach was to create a generalized property table. I was scared off by the sheer number of rows created in this table based only on the first 256,000 structures I initially processed, which led to the table being 230400000 rows long with 900 properties per structure. I will eventually have millions of structures, so this approach is a little unnerving. However, I'm not an experienced database designer, so I'm really not even entrely sure if my fear is justified.

At the end of the day I'm looking for the most efficient structure that can be realized for such a large amount of data. If it's a damned if you do, damned if you don't kind of scenario, should I incorporate all of my descriptors into a single table, or should I break them up into some reasonable number of tables?

So, to reiterate, here are my fears:

Creating a table with too many columns (hundreds of properties or more).
Creating a table with too many rows (generalized property table)
Creating too many tables (creating a table for each property, or a creating table that holds several properties in an arbitrary collection)

Last edited by Narwe; 01-15-07 at 08:56.
Reply With Quote
  #6 (permalink)  
Old 01-15-07, 12:48
loquin loquin is offline
Super Moderator
 
Join Date: Jun 2004
Location: Arizona, USA
Posts: 1,797
I was involved in a similar discussion at a sister site.

http://www.xtremevbtalk.com/showthread.php?t=217019


Now, personally, I would think that you would want to set up a table of derived properties, which would have a many-to-many relationship with the chemical_structure table.

The many-to-many relationship is created in a relational database by using an intersection table. The intersection table would hold the foreign key to the chemical structure, the foreign key to the derived_properties table, as well as the specific quantity associated with this particular structure/properties link.

For instance, lets address good old Water.

The chemical structure entry might be
Code:
id   Description
172  Water
(note - there would child tables holding the actual chemical structure - ref the link above)

The derived_properties entries could include
Code:
id    derived property
17    Hydrogen Quantity
24    Number of Bonds
25    Number of Atoms
27    Atomic Weight
The intersection table, (call it derived_chemical_properties) could have

Code:
id_cs  id_dp   Qty
172    17      2         ( water - hydrogen quantity)
172    24      2         ( water - Number of bonds quantity)
172    25      3         ( water - Number of Atoms quantity)
172    27      18.02     ( ??? I THINK that's close - it's been a long while since I cracked open a chemistry reference!)
Note that the primary key of the intersection table is a compound key, consisting of BOTH of the foreign keys. This means that you can have as many references to water as needed, you can have as many references to derived properties as needed, but the combination of chemical and property are unique. As it should be. Note that the description shown to the right of the quantity field in the intersection table is there for our explanation only. In the real system, this description can be retrieved from the chemical table's description and the derived quantity tables' description fields as a part of any join when you're retrieving data, if needed

The beauty of a many-to-many relationship is that a single entry in the derived properties table ("Hydrogen Quantity", for instance) can be referenced an infinite number of times by an infinite number of chemicals, but it can only be referenced once by any given chemical.

Note that you will want to keep the information to a minimum in the intersection table, as the number of rows in this table WILL be large - one for each distinct chemical-derivedproperty combination. But, you can keep the database structure clean, and the size to a minimum, as there will only be one distinct entry for each chemical in the chemical structure table, and one entry for each distinct derived property in the derived property table.
__________________
Lou
使大吃一惊
"Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
"I have my standards. They may be low, but I have them!" - Bette Middler
"It's a book about a Spanish guy named Manual. You should read it." - Dilbert


Last edited by loquin; 01-15-07 at 13:09.
Reply With Quote
  #7 (permalink)  
Old 01-15-07, 12:58
Narwe Narwe is offline
Registered User
 
Join Date: Jan 2007
Posts: 24
Thank you for the reply.

The only thing that concerns me is the sheer number of rows that the intersection table would grow to, given that there are hundreds of properties for each chemical structure. For example, if there are 250,000 compounds in my database, and 500 properties for each compound, my intersection table is already 125 million rows long. I suppose though, that I'm better off with millions of rows than hundreds of tables or hundreds of columns within a single table.

Anyway, thanks again to everybody who has replied.
Reply With Quote
  #8 (permalink)  
Old 01-15-07, 13:19
loquin loquin is offline
Super Moderator
 
Join Date: Jun 2004
Location: Arizona, USA
Posts: 1,797
Quote:
Originally Posted by Narwe
Thank you for the reply.

The only thing that concerns me is the sheer number of rows that the intersection table would grow to, given that there are hundreds of properties for each chemical structure. For example, if there are 250,000 compounds in my database, and 500 properties for each compound, my intersection table is already 125 million rows long. I suppose though, that I'm better off with millions of rows than hundreds of tables or hundreds of columns within a single table.

Anyway, thanks again to everybody who has replied.
That's fine. Relational databases and SQL work well with "vertical" data. They don't work so well when the data is spread "horizontally". A table with a small number of rows that has one or more appropriate indexes will retrieve data quickly.

also note that with this approach, you ONLY have the data (and thus the storage requirements) actually assigned for chemical-property matches. And, since the properties are stored separately, one record per potential property, the storage requirements will be minimized. Assuming a long integer chemical and property key, and a double precision quantity, you're looking at the order of 16 bytes of data per chemical-ppoperty assignment. 125 million records would thus require on the order of two gigabytes disk storage. (plus the 250K records for chemical records, and the several thousand distinct property records.
__________________
Lou
使大吃一惊
"Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
"I have my standards. They may be low, but I have them!" - Bette Middler
"It's a book about a Spanish guy named Manual. You should read it." - Dilbert


Last edited by loquin; 01-15-07 at 17:10.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On