I am looking to implement row lineage into a new enterprise database, so I am able to track which process, on which date, etc. was responsible for creating that row... Obviously I have a set of audit tables which store the processes which have run, and I was looking to have each row of data, in each operation table, tagged with an ID from that audit table...
First question : I assume the logical place to store this row lineage is as an extra column on each actual table, therefore I can link to the audit information directly from there when querying, if need be... The only other solution I could think of would be to have a seperate 'Data Registry' table, which stores the Audit ID, the table im talking about, and the primary key structure... This seems unnecessary, as it is much easier to store on the actual table, and there are many variations of key structures across all of the tables anyway, so that would be difficult to store... The only advantage of the 2nd solution would be that I do not have metadata stored on each operational table, which I know some data modellers may believe to be the correct way to go... Im definately leaning towards the first option, but was looking for some opinions please ?
Second Question : If I do end up storing the Audit ID on each operational table, obviously I can only store one at a time against each row... But I could have inserted the data from one process, then updated it from another... I am talking about a normalised data warehouse here, so data is rarely overwritten anyway, as history is stored, but in the case that it is, what do you see as being the correct method... I can see the advantages in both, ie: only store the Audit ID that created the data, any updates do not alter the Audit ID, or every time you perform an update to a row, update with the current Audit ID... Any opinions on that please ?
every operation should get its own current audit id, as well as a datetime stamp
typically in this situation, both the audit id and a datetime stamp become part of the primary key, because they "extend" the primary key so as to allow multiple occurrences of a row for the original primary key
it's hard to see this as a separate table, so i would go for your first solution
thanks.... the current design has the date-time stamp is recorded on the actual audit table ( applicable to the job which was responsible ), therefore is not required on each operational data table... i see what you are saying about including the 'audit_id' as part of the primary key, but that interferes with the normal key structure of the table, which states that say a customer may only exist once ... If i start letting my metadata dictate the table / PK structure, it will compromise the reporting from the table, which is of course the primary objective.... as i said before, this is a data warehouse therefore is history enabled, meaning most tables are not updated, but there are a few instances where this is not the case, say the tables holding the important entities within my organisation ( customers, accounts, products, etc. )
right, well, from your original few paragraphs of explanation i though i understood it, but obviously not well enough, so if you do not want to extend a primary key in a data warehouse then of course you can only store one row with each value so i don't see how you can build up an audit history if you can only store one version of each record
you would need to include the table/schema the change applies to in that Audit table also... but the problem is that the primary key structures across all the different tables is different, some are single attribute, some are multi-part, etc... how would i get around that apart from storing them in a single CHAR column ( with each seperated by commas, for example ), and if i did that, i wouldnt be able to join to the original table any more ?