Design for documenting data contained within the DB
Thank you once again, to those of you who have helped me with some of my prior questions. This aerospace engineer has learned a lot about databases the last month.
One of the key requirements in the database I'm designing is to be able to capture the source of a given piece of data. For each data source, I want to capture the document name, author, date, etc. Clearly this is a simple table to implement.
The question is how to link a data source to the data it has provided - which is after all the main content of the database. My requirements are as follows:
I know multiple data items might come from a single data source.
The data items will be spread across multiple tables.
The data in one row of any given table could potentially have multiple sources
Sometimes I might have multiple data sources for a single data item (though redundancy is good, I suppose if push came to shove this requirement could be eliminated)
There is no requirement to "easily" see all items that a given data source documents. The requirement is only to see what sources document a given piece of data.
I've toyed around with some ideas, but would love some third-party insights. Some of the concepts I've toyed around with:
Keep all data sources in one single table. I really don't see any other useful concept for storing the data source information.
Each table that has data whose source needs to be documented has a corresponding junction table that joins the data table with the source table in a many-to-may relationship. This has two disadvantages.
By definition, I'll have as many junction tables as real data tables.
All of the contents of a given row will point to the same set of data sources. It is up to the reader to determine that A came from Doc1 and B from Doc2. However, since presumably the columns of each table are closely related, this is probably not awful.
Another alternative is to have a single junction table. As before the junction table would have a FK to the data source table. But instead of having just a FK to the data table, I would have to have some sort of column that serves as a data table identifier. This is in order to retain uniqueness in all of the rows of the junction table. This has the advantage of reducing the number of tables, but it does make the design somewhat more obscure
A third alternative is to basically have a FK column for each and every row in any table that requires documenting the source of that column. This would allow me to document the source on a per column basis,not just a per row basis. On the other hand, I would loose the ability to have multiple documentation sources for a single piece of data. This is probably OK, if it significantly simplifies the design. Having a data source is a requirement, redundant data sources are nice-to-have.
If I went this route, I could see each table having almost twice the number of columns (i.e. one for data, one for its source), or having a "mirror" table in a 1-1 relationship to the data table, and whose contents are just the FKs to the data sources.
At this point I'm leaning toward Options 1 or 3. From a content perspective Option 1 has the advantage of multiple sources, at the expense of all the data in a row having the same set of sources. Option 2 functionally is the same, but seems "ugly". Option 3 has the advantage of sourcing every column, at the expense of having a single data source for each data bit. I'm willing to trade some of these options against ease of front-end implementation and/or performance.
Comments, concerns, or suggestions are appreciated!
What kind of "data" are we talking about here? Is this all data that can be represented by a single table? What kind of real world entity is this data supposed to represent? Also, how would it come to pass that you have multiple sources for a single data item?
It's easy enough to spool up a table of sources with a valid source id. Deciding where, when and how to use that source id is entirely dependent on the business rules/workflow involved with your data. "Piece of data" can mean ANYTHING the way you've described it so far. It will be very difficult to lend any guidance on possible approaches without understanding the core of your model (the data item itself). On the plus side, tracking source of origin is a pretty common requirement in all kinds of random business processes so there's a pretty good chance of having some existing models to work from once you've better defined what it is you're tracking and why...
What kind of "data" are we talking about here? Is this all data that can be represented by a single table? What kind of real world entity is this data supposed to represent?
The database is a consolidate point to store information regarding our products. We make one-of-a-kind products, and we want to capture a broad range of information. We are capturing technical parameters and performance (i.e. mass, power), contractual information (program cost, customer, etc), historical schedule information (i.e. how milestones shifted during the development), events related to each product (start of contract, technical reviews, product delivery, decommissioning), etc. The data is most definitely not in one table. For instance, although each product is unique they might share common attributes: customer, power supply design, delivery information (if they were part of a single acquisition).
Also, how would it come to pass that you have multiple sources for a single data item?
Some of this data is historical, and we have contradicting pieces of information. Other times, the data captured in a "single" data point is derived from different sources. This is largely a result that various programs historically might have reported different flavors of the same data, and for capturing it in the database we want to convert all the information to the same reference frame. A (trivial) example I recently ran into, for instance date for the start of operations. For a number of the products, I noticed that there was a difference of one day between two data sources I had. Turns out one source had UTC, the other local time.
The contradicting information and the different ways of looking at the data is actually one of the primary reasons for creating this database.
It's easy enough to spool up a table of sources with a valid source id. Deciding where, when and how to use that source id is entirely dependent on the business rules/workflow involved with your data. "Piece of data" can mean ANYTHING the way you've described it so far.
As, I mentioned, "piece of data" does mean a number of different things. It is all historical, however. The goal is that when someone pulls up a page of information, they can press a button and be told where all that information came from.
I think a fully normalised, lossless data model (option 4) would be horrific to implement and would contain thousands of tables.
Option 2 is funny - it is like a One True Association Table (if you have come across OTLTs before then you'll understand what I mean).
I reckon option 3 is your best bet. You could make the source columns bitmasks (if you have few enough sources) and can stand to lose referential integrity. This would allow you to store the multiple sources.