im rather new to database design and im building a sequence and lab management web database for a conservation genetics company.
the main design headache im encountering is modelling the data recorded for the automatic dna sequencer and pcr machine, these have a large number of lanes or positions, usually about 96, that each can store an individual sample that has a number of different fields associated with it e.g. tissue type, elution volume etc etc.
i need to record the data foreach particular run of these machines; so i have a table containing the id number of the run with other general information not specific to a lane/position, and then have alot of further one to one related tables that hold one piece of data foreach of the lanes/positions. consequently my table count is growing huge (to my eyes anyway).
im considering combining data into single fields but from what i understand of db design this is a big no no - but might be the only solution. if anyone has the inclination to discuss this with me, offer comments or point me to some info sources on designing dbs that arent the usual business models or ecom sites, id be hugely appreciative.
From what you have written I would say you have to start a couple of hierarchy levels higher. You create a table for the entire sample, then a table for each of the positions, then a table for each of the tests for each of the positions and then a table for the results for each of the tests. It may be more complicated at the lower end but it seems close.
I think im basically following you, however i don't fully understand how you're envisaging the Lane and Position tables: currently, i have a table for each of the properties, at each position or lane, in a run that need to be recorded, e.g.
But are you suggesting a seperate table for the lanes where each lane is recorded as a row? is that preferable to having a maximum of 96 collums? also i think i misled you, lanes and positions are the same thing - it just depends on whether its the Sequencer or PCR machine.
Again, cheers for running through this stuff with me.
Please excuse my ignorance, but in terms of implementing such a table a set of lanes (rows) would be identified by their RunID? Would there still need to be a seperate LaneID for each row and would that still be the primary key? i would also then simply have all the properties of each lane as collums right? i know this must seem bleedin obvious, but i just need to make sure.
the lane table would begin to grow to a quite large size i would guess, particularly if we use alot of 96 lane runs, at what stage does size become an issue? i read in a previous thread 100,000,000 is big - so i suppose its going to be a while before they reach 1 000 000 sequencing runs, would that be correct (considering each row would have about 5 - 10 attributes of small value)?
First we have to get some of the concepts straight so I will share with you my understanding to save time.
A machine has 96 lanes. period.
Each processing sequence is called a run.
For a run each lane contains a sample with a set of properties.
What I am saying is that you create a lane table containing 96 rows that never gets additional rows. The lanes never change physically neither should the table. It's only purpose is to tell you which lane a sample is going into through a join between lane and sample.
Maybe this would make more sense.
Lane(LaneID, LaneName) -- 96 rows
You have to create an entity relationship diagram. Or barring that you have to write out the rules of you model as I attempted in brief. Tell me the story of a run. What are the properties of a lane. What are the properties of a sample. What are the properties of a run. What else is involved.
Note that it is clear to me that a run, a lane and a sample are separate entities don't mix them up.
So the simplest machine process to describe is the DNA Sequencing run of the Automated Sequencer(AS).
Within the AS is a tray that has 96 lanes.
Each lane of the AS can be loaded with finished PCR product (simply a piece of DNA that has been repeatedly copied to give a stronger signal).
PCRs are produced on a special PCR machine in large runs aswell.
A PCR machine has a tray also with 96 positions. Each position can produce PCR product from one Sample. By Sample I mean the DNA of an individual animal.
For the Sequencing Run we also need to know, for each lane, the PCRRunID, the position of this particular product on the PCR tray and the ID of the Sample at that position aswell.
Additionaly, for the Sequencing Run as a whole, we need a RunID, the Date, StaffID of who carried out the process, ID of which member of the management staff approved it, the location of the AS files (on a serperate computer) and 10 values detailing the various volumes of solutions added to each lane e.g. how much PCR product, buffer, dye and primers.
So thats the information im trying to model for a straight forward sequencing run. Other types of runs are more complicated, for instance, performing Genotyping analysis. This is essentially the same as the sequencing run except that instead of a single pcr (with its associated data) up to 30 can be included (with similar data aswell). Thats my biggest headache.
Each Sample has the following: an ID, an Original code (of those who sent it to us), the ProjectID thats its associated with, Species name, Location, GPS, SampleDate, a description of the sample (eg roadkill), sex of the animal, contact (ie who sent it), its position number in the freezer, DNAExtract storage number and notes about the data inputer and approval.
I have got an ERD but it changes frequently as i've been struggling with how to represent the data. Hence my call for aid.
A further question; im not fully understanding the concept of the Lanes table. If you use it as a join for the sample table where is the data for each sequencing run actually stored?
I cant thankyou enough for your willingness to help - if you ever need a place to stay in Malaysia (where alot of our work is based) then give me a yell.
It's the former, you mix 30 different pcr products (positions) into one lane for DNA fingerprinting, each pcr represents a different section of the genome from one animal. You then place different animals in the other lanes and you can quickly infer relatedness (just incase you where interested).
Above you have the PCRPosition and GenotypeLane tables, are they still static, as in not acquiring anymore rows, like you were suggesting earlier? Couldnt you just have a PCRPositionNumber field in the PCRRunPosition table instead?