Results 1 to 8 of 8
  1. #1
    Join Date
    Jan 2007
    Posts
    24

    More columns vs more tables?

    I'm building a large chemical database that will store molecular structures and all descriptive data (descriptors) for each structure. There are a theoretically infinate number of descriptors but I will likely be limited to no more than 1 or 2 thousand (undetermined as of yet). My database will eventually grow to include millions of molecular structures. As you can see, I'm talking about A LOT of data and I'm really not sure how to handle it. Obviously, creating a "descriptors" table with hundreds or thousands of columns isn't going to cut it. And creating a separate table for each descriptor leaves me with some impressively clunky JOIN statements when I need all of the descriptors for a particular selection of molecules. I could organize the descriptors into arbitrary collections and create a table for each collection, but that creates its own issues.

    Any tips or advice? Are there any cheminformatics people around that might be able to help me out?
    Last edited by Narwe; 01-14-07 at 15:09. Reason: grammar

  2. #2
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    The majority of us are not chemists, (though Pat seems to know a little about everything), so you will need to be more descriptive of the entities and relationships involved.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  3. #3
    Join Date
    Jan 2007
    Posts
    24
    For each chemical structure (which is indexed and stored in its own table as a oracle blob) there are hundreds of derived properties. Each of these properties is derived from the specific structure of the compound. For example, the weight of the molecule, its number of atoms, its number of hydrogens, its number of bonds, all are properties derived from the structure of a particular compound. For every compound in my database these properties need to be determined and stored. I'll be dealing with hundreds of thousands of molecules to start, and that number will only continue to grow.

    It's worth emphasizing that for each compound there is obviously only one weight, one number of atoms, one number of hydrogens, one number of bonds, etc, so there is a 1 to 1 relationship between each compound and each of its properties. The difficulty is in designing a database whereby retrieval of these properties can be made as efficient as possible. The database actually contains many more tables, but it's the these chemical properties/descriptors that are giving me a hard time.
    Last edited by Narwe; 01-15-07 at 01:02. Reason: wording

  4. #4
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    I would of thought that standard properties, ie those common to all compounds could be stored in the main table. Im thinking of things like weight etc.

    The rest will probably best best modelled as a generalised property table, with foreign keys in both the compound AND a new entitiy say "PropertyType". That design is getting almost close to the favoured object property odel advocated by OOB enthusiasts.. wonder if there is a true OOdB whihc may work for this application

    However It may be smarter if you are going to set up a child table to place any properties there. afterall not all compounds will have Hyrdogen in

    Not sure of your application but presumably you may also have problems with molecules that exhibit chiral variations

    I dont think you should have a significant join problem. there may well be an issue int he presentation layer of how you flatten the properties, so youmay need some form of sort sequence or property grouping to be include in PropertyType
    I'd rather be riding on the Tiger 800 or the Norton

  5. #5
    Join Date
    Jan 2007
    Posts
    24
    Thanks for the reply.

    All of the properties we are using are present for all of the structures in our database. In fact we only use properties that apply to all of our compounds. We have hundreds of these properties, and more are possible. So there is not really any non arbitrary way that I can divide certain of these properties into the main table, and others be stored seperately.

    My first approach was to create a generalized property table. I was scared off by the sheer number of rows created in this table based only on the first 256,000 structures I initially processed, which led to the table being 230400000 rows long with 900 properties per structure. I will eventually have millions of structures, so this approach is a little unnerving. However, I'm not an experienced database designer, so I'm really not even entrely sure if my fear is justified.

    At the end of the day I'm looking for the most efficient structure that can be realized for such a large amount of data. If it's a damned if you do, damned if you don't kind of scenario, should I incorporate all of my descriptors into a single table, or should I break them up into some reasonable number of tables?

    So, to reiterate, here are my fears:

    Creating a table with too many columns (hundreds of properties or more).
    Creating a table with too many rows (generalized property table)
    Creating too many tables (creating a table for each property, or a creating table that holds several properties in an arbitrary collection)
    Last edited by Narwe; 01-15-07 at 09:56.

  6. #6
    Join Date
    Jun 2004
    Location
    Arizona, USA
    Posts
    1,848
    I was involved in a similar discussion at a sister site.

    http://www.xtremevbtalk.com/showthread.php?t=217019


    Now, personally, I would think that you would want to set up a table of derived properties, which would have a many-to-many relationship with the chemical_structure table.

    The many-to-many relationship is created in a relational database by using an intersection table. The intersection table would hold the foreign key to the chemical structure, the foreign key to the derived_properties table, as well as the specific quantity associated with this particular structure/properties link.

    For instance, lets address good old Water.

    The chemical structure entry might be
    Code:
    id   Description
    172  Water
    (note - there would child tables holding the actual chemical structure - ref the link above)

    The derived_properties entries could include
    Code:
    id    derived property
    17    Hydrogen Quantity
    24    Number of Bonds
    25    Number of Atoms
    27    Atomic Weight
    The intersection table, (call it derived_chemical_properties) could have

    Code:
    id_cs  id_dp   Qty
    172    17      2         ( water - hydrogen quantity)
    172    24      2         ( water - Number of bonds quantity)
    172    25      3         ( water - Number of Atoms quantity)
    172    27      18.02     ( ??? I THINK that's close - it's been a long while since I cracked open a chemistry reference!)
    Note that the primary key of the intersection table is a compound key, consisting of BOTH of the foreign keys. This means that you can have as many references to water as needed, you can have as many references to derived properties as needed, but the combination of chemical and property are unique. As it should be. Note that the description shown to the right of the quantity field in the intersection table is there for our explanation only. In the real system, this description can be retrieved from the chemical table's description and the derived quantity tables' description fields as a part of any join when you're retrieving data, if needed

    The beauty of a many-to-many relationship is that a single entry in the derived properties table ("Hydrogen Quantity", for instance) can be referenced an infinite number of times by an infinite number of chemicals, but it can only be referenced once by any given chemical.

    Note that you will want to keep the information to a minimum in the intersection table, as the number of rows in this table WILL be large - one for each distinct chemical-derivedproperty combination. But, you can keep the database structure clean, and the size to a minimum, as there will only be one distinct entry for each chemical in the chemical structure table, and one entry for each distinct derived property in the derived property table.
    Last edited by loquin; 01-15-07 at 14:09.
    Lou
    使大吃一惊
    "Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
    "I have my standards. They may be low, but I have them!" - Bette Middler
    "It's a book about a Spanish guy named Manual. You should read it." - Dilbert


  7. #7
    Join Date
    Jan 2007
    Posts
    24
    Thank you for the reply.

    The only thing that concerns me is the sheer number of rows that the intersection table would grow to, given that there are hundreds of properties for each chemical structure. For example, if there are 250,000 compounds in my database, and 500 properties for each compound, my intersection table is already 125 million rows long. I suppose though, that I'm better off with millions of rows than hundreds of tables or hundreds of columns within a single table.

    Anyway, thanks again to everybody who has replied.

  8. #8
    Join Date
    Jun 2004
    Location
    Arizona, USA
    Posts
    1,848
    Quote Originally Posted by Narwe
    Thank you for the reply.

    The only thing that concerns me is the sheer number of rows that the intersection table would grow to, given that there are hundreds of properties for each chemical structure. For example, if there are 250,000 compounds in my database, and 500 properties for each compound, my intersection table is already 125 million rows long. I suppose though, that I'm better off with millions of rows than hundreds of tables or hundreds of columns within a single table.

    Anyway, thanks again to everybody who has replied.
    That's fine. Relational databases and SQL work well with "vertical" data. They don't work so well when the data is spread "horizontally". A table with a small number of rows that has one or more appropriate indexes will retrieve data quickly.

    also note that with this approach, you ONLY have the data (and thus the storage requirements) actually assigned for chemical-property matches. And, since the properties are stored separately, one record per potential property, the storage requirements will be minimized. Assuming a long integer chemical and property key, and a double precision quantity, you're looking at the order of 16 bytes of data per chemical-ppoperty assignment. 125 million records would thus require on the order of two gigabytes disk storage. (plus the 250K records for chemical records, and the several thousand distinct property records.
    Last edited by loquin; 01-15-07 at 18:10.
    Lou
    使大吃一惊
    "Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
    "I have my standards. They may be low, but I have them!" - Bette Middler
    "It's a book about a Spanish guy named Manual. You should read it." - Dilbert


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •