Results 1 to 8 of 8
  1. #1
    Join Date
    May 2008
    Posts
    33

    Storing Actual history and database history

    Hello,
    I'm creating a table to track real changes of an object and also our knowledge of an object against time. For example, it could be some statistics about a town.

    The important thing is that our version of real history of the object at any point in time is just as valuable as a version of history at any other point in time. So for example the current version of history is just as important as what we thought the history of the object was a year ago.

    The kind of design I've come up with so far is:


    Code:
    --These will be constants that will never change since the town is created in the database
    CREATE TABLE TownSeeds(
    	Id int Identity(1,1) CONSTRAINT PK_TownSeeds Primary Key,
    	EntryDate DateTime NOT NULL DEFAULT Current_TimeStamp
    )
    
    
    
    --stores changing attributes of the town
    CREATE TABLE Towns(
    	Id int CONSTRAINT FK_TownsChanges_TownSeeds FOREIGN KEY REFERENCES dbo.TownSeeds(Id),
    	TownName varchar(255) NOT NULL,
    	Population float,
    	
    	--other attributes...
    	
    	--actual history of the town
    	ChangeDate DateTime NOT NULL,
    	NextChangeDate DateTime NOT NULL,	--created to make queries easier
    	
    	--stores our history of knowledge of the town
    	UpdateDate datetime NOT NULL,
    	NextUpdateDate datetime NOT NULL,
    	CONSTRAINT PK_Towns Primary Key(Id,UpdateDate)
    
    
    )
    Here is some example data being inserted:

    Code:
    --Create town with two points in history where population changes
    INSERT INTO TownSeeds VALUES(DEFAULT,DEFAULT)
    
    DECLARE @Town1Id int
    SET @Town1Id=Scope_Identity()
    DECLARE @EntryDate datetime
    SET @EntryDate=(SELECT EntryDate FROM TownSeeds WHERE Id=@Town1Id)
    
    INSERT INTO Towns
    SELECT
    	Id=@Town1Id,
    	TownName='test town',
    	Population=5000,
    	ChangeDate='19050101',
    	NextChangeDate='19100101',
    	
    	UpdateDate=@EntryDate,
    	NextUpdateDate='99991231'
    		
    
    INSERT INTO Towns
    SELECT
    	Id=@Town1Id,
    	TownName='test town',
    	Population=7000,
    	ChangeDate='19100101',
    	NextChangeDate='99991231',
    	
    	UpdateDate=@EntryDate,
    	NextUpdateDate='99991231'
    	
    
    --Better research shows that the actual population was 5500 in 1905
    
    DECLARE @UpdateDate datetime
    SET @UpdateDate=CURRENT_TIMESTAMP
    
    UPDATE Towns
    SET NextUpdateDate=@UpdateDate
    WHERE Id=@Town1Id AND UpdateDate=@EntryDate
    
    
    INSERT INTO Towns
    SELECT
    	Id=@Town1Id,
    	TownName='test town',
    	Population=5500,
    	ChangeDate='19050101',
    	NextChangeDate='19100101',
    	
    	UpdateDate=@UpdateDate,
    	NextUpdateDate='99991231'
    I'm a little concerned it might be hard to apply constraints to this data with this added complexity.

    I'd really appreciate some feedback on this design. I'm guessing this is a fairly common requirement so I'm hoping someone here will have some helpful feedback from their experience implementing something similar.

    I've actually asked this question a while back here, but would appreciate a little more discussion.

    Thanks,
    Michael

  2. #2
    Join Date
    May 2008
    Posts
    33
    Please let me know if I haven't been clear on the question or you'd like more information. I actually asked the question in a different forum before so was hoping you guys might have more to say. I'd appreciate any response!

    Michael

  3. #3
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    There were some errors in your code - corrected.
    Code:
    IF EXISTS (SELECT NULL FROM sys.tables WHERE object_id = OBJECT_ID('dbo.towns')) 
    BEGIN
        DROP TABLE dbo.towns
    END
    
    IF EXISTS (SELECT NULL FROM sys.tables WHERE object_id = OBJECT_ID('dbo.TownSeeds')) 
    BEGIN
        DROP TABLE dbo.TownSeeds
    END
    
    --These will be constants that will never change since the town is created in the database
    CREATE TABLE TownSeeds(
        Id int Identity(1,1) CONSTRAINT PK_TownSeeds Primary Key,
        EntryDate DateTime NOT NULL DEFAULT Current_TimeStamp
    )
    
    --stores changing attributes of the town
    CREATE TABLE Towns(
        Id int CONSTRAINT FK_TownsChanges_TownSeeds FOREIGN KEY REFERENCES dbo.TownSeeds(Id),
        TownName varchar(255) NOT NULL,
        Population float,
        
        --other attributes...
        
        --actual history of the town
        ChangeDate DateTime NOT NULL,
        NextChangeDate DateTime NOT NULL,    --created to make queries easier
        
        --stores our history of knowledge of the town
        UpdateDate datetime NOT NULL,
        NextUpdateDate datetime NOT NULL,
        CONSTRAINT PK_Towns Primary Key(Id,UpdateDate)
    )
    
    DECLARE @Town1Id INT
    
    --Create town with two points in history where population changes
    INSERT INTO TownSeeds VALUES(DEFAULT)
    
    SET @Town1Id=Scope_Identity()
    DECLARE @EntryDate datetime
    SET @EntryDate=(SELECT EntryDate FROM TownSeeds WHERE Id=@Town1Id)
    
    INSERT INTO Towns
    SELECT
        Id=@Town1Id,
        TownName='test town',
        Population=5000,
        ChangeDate='19050101',
        NextChangeDate='19100101',
        
        UpdateDate=@EntryDate,
        NextUpdateDate='99991231'
            
    
    INSERT INTO Towns
    SELECT
        Id=@Town1Id,
        TownName='test town',
        Population=7000,
        ChangeDate='19100101',
        NextChangeDate='99991231',
        
        UpdateDate=DATEADD(d, 1, @EntryDate),
        NextUpdateDate='99991231'
        
    
    --Better research shows that the actual population was 5500 in 1905
    
    DECLARE @UpdateDate datetime
    SET @UpdateDate=CURRENT_TIMESTAMP
    
    UPDATE Towns
    SET NextUpdateDate=@UpdateDate
    WHERE Id=@Town1Id AND UpdateDate=@EntryDate
    
    
    INSERT INTO Towns
    SELECT
        Id=@Town1Id,
        TownName='test town',
        Population=5500,
        ChangeDate='19050101',
        NextChangeDate='19100101',
        
        UpdateDate=@UpdateDate,
        NextUpdateDate='99991231'

  4. #4
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Actually, I only corrected the compile errors - I think there are logical errors in your data.

  5. #5
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    I don't see any reason to store the surrogate key (TownSeed) in a separate table. Also, I doubt Next...Date will help to speed up queries, while having this columns definitely introduces maintenance problems. Otherwise, it looks like a normal slowly changing dimension to me. Google for "slowly changing dimension" and you'll find plenty of information.
    ---
    "It does not work" is not a valid problem statement.

  6. #6
    Join Date
    May 2008
    Posts
    277
    Your design is pretty much there, but I'd suggest changing your table names to clarify some of the concepts.

    Your 'TownSeeds' table is your time-invariant data, and is the "anchor" for your time-variant attributes (it's also what other tables will probably hold a foreign key to if they need a reference to a "town"). In fact, this table represents your Town, and I'd rename the table as such:
    Town(town id, founded date, coordinates, ...)

    Now we need a new name for time-variant table. I don't know exactly what you're recording here, but since it's census time here in the states, I'll take some inspiration from that:
    TownCensus(town id, row valid date, row superceded date, from date, to date, town name, population, ...)

    Your natural key for this table is the town and the time period during which the row is valid. You'll most likely need some sort of trigger to enforce it -- a simple primary key will be able to enforce unique dates, but can't prevent overlapping time periods, which you most likely don't want (you should only have one valid row at any one time). You'll probably also need to enforce that there's no overlap in your 'from' and 'to' dates for currently-valid rows, and possibly no gaps either.

    One potential drawback to this design is that everytime you enter a change, you must provide values for all columns. If you don't expect changes to occur in tandem (only one attribute changes out of, say, a dozen), then you're going to be double-entering a lot of data to record the one change. Although this might still be desirable if you're entering in some sort of official record -- like census results -- in which case you may be interested in recording that there's been no change.

    The solution to this, as proposed in the other forum, is to move each attribute into its own table, each with their own timestamp columns. That way you can record a population change without having to re-enter everything else that didn't change.

    You can mix-and-match as well. Since I wouldn't expect town names to change very much (if at all), I might do something like:
    Town(town_id, ...)
    TownName(town_id, from_date, to_date, town_name, ...)
    TownCensus(town_id, from_date, to_date, population, ...)

    Unfortunately, there's no way around implementing a bunch of complex triggers to enforce all your various time constraints. Your joins also have the potential to become very complex. You may want to follow up in the appropriate DB-specific forum if you have implementation problems. There's also a very good book (though slightly outdated) that you may want to look at called 'Developing Time-Oriented Database Applications in SQL' by Richard Snodgrass. Handily enough, it's available for free download.

  7. #7
    Join Date
    May 2008
    Posts
    33
    Oh dear, I can't believe I hadn't checked this code. That's probably why I didn't receive as much discussion when I posted this issue last in the other forum. Here is the corrected version:

    Code:
    --These will be constants that will never change since the town is created in the database
    CREATE TABLE TownSeeds(
    	Id int Identity(1,1) CONSTRAINT PK_TownSeeds Primary Key,
    	EntryDate DateTime NOT NULL DEFAULT Current_TimeStamp
    )
    
    
    
    --stores changing attributes of the town
    CREATE TABLE Towns(
    	Id int CONSTRAINT FK_TownsChanges_TownSeeds FOREIGN KEY REFERENCES dbo.TownSeeds(Id),
    	TownName varchar(255) NOT NULL,
    	Population float,
    	
    	--other attributes...
    	
    	--actual history of the town
    	ChangeDate DateTime NOT NULL,
    	NextChangeDate DateTime NOT NULL,	--created to make queries easier
    	
    	--stores our history of knowledge of the town
    	UpdateDate datetime NOT NULL,
    	NextUpdateDate datetime NOT NULL,
    	CONSTRAINT PK_Towns Primary Key(Id,UpdateDate,ChangeDate)
    
    
    )
    
    
    
    --Create town with two points in history where population changes inserted a year ago
    INSERT INTO TownSeeds VALUES(dateadd(year,-1,CURRENT_TIMESTAMP))
    
    DECLARE @Town1Id int
    SET @Town1Id=Scope_Identity()
    DECLARE @EntryDate datetime
    SET @EntryDate=(SELECT EntryDate FROM TownSeeds WHERE Id=@Town1Id)
    
    INSERT INTO Towns
    SELECT
    	Id=@Town1Id,
    	TownName='test town',
    	Population=5000,
    	ChangeDate='19050101',
    	NextChangeDate='19100101',
    	
    	UpdateDate=@EntryDate,
    	NextUpdateDate='99991231'
    		
    
    INSERT INTO Towns
    SELECT
    	Id=@Town1Id,
    	TownName='test town',
    	Population=7000,
    	ChangeDate='19100101',
    	NextChangeDate='99991231',
    	
    	UpdateDate=@EntryDate,
    	NextUpdateDate='99991231'
    	
    
    --Better research now shows that the actual population was 5500 in 1905
    
    DECLARE @UpdateDate datetime
    SET @UpdateDate=CURRENT_TIMESTAMP
    
    UPDATE Towns
    SET NextUpdateDate=@UpdateDate
    WHERE Id=@Town1Id AND UpdateDate=@EntryDate
    
    
    INSERT INTO Towns
    SELECT
    	Id=@Town1Id,
    	TownName='test town',
    	Population=5500,
    	ChangeDate='19050101',
    	NextChangeDate='19100101',
    	
    	UpdateDate=@UpdateDate,
    	NextUpdateDate='99991231'
    The nextupdate and nextchangedate will help speed up queries as a common requirement is to pull up 1 row per town showing it's attributes on a particular date as we believed they were on particular certain date. For instance, we might want the population and the name of all towns for 10 years ago as we had them recorded a year ago. Having the Next... columns makes this easy:

    Code:
    DECLARE @RealHistoryDate datetime
    SET @RealHistoryDate=DATEADD(Year,10,CURRENT_TIMESTAMP)
    
    DECLARE @UpdateHistoryDate datetime
    SET @UpdateHistoryDate=DATEADD(Year,-1,CURRENT_TIMESTAMP)
    
    
    
    SELECT * 
    FROM Towns T
    WHERE 
    	T.ChangeDate<=@RealHistoryDate
    	AND
    	T.NextChangeDate>@RealHistoryDate
    	AND
    	T.UpdateDate<=@UpdateHistoryDate
    	AND
    	T.NextUpdateDate>@UpdateHistoryDate
    Without the next... columns it would be difficult to pull up only one row for each town like in the example.

    The "Seed" table is used for foreign key reference and for new id generation. I know it's not ideal but I thought this would be the least exotic method of doing both these things and also a place to store non-changing attributes. I haven't stored the founding date there as this might of been researched incorrectly and revised at a later date. Other things that might be considered storing in the "Seeds" table are columns that state a relationship that defines an object, altough I can't think of any examples at the moment.

    Almost everything will be a changing attribute, and there are a lot more than the example shows. This is why I have chosen to call the history table "towns", as the seeds table will rarely be used in queries. This is also why it would be inpratical to have a table for each changing attribute, as this would means a table for each one, which will typically be around 20 different attributes and there are other objects that have relationships to towns, such as counties, that also have changing attributes.

    Thanks for the link on the book, I've just downloaded it and will have a good look through.

    Another question; the primary key in this changing attribute table is the Id,UpdateDate and ChangeDate. This is also a clustered index. Any column that relates to another object type with a high number of objects I.e. counties, has a non-clustered index on it. Is this a good way to index this table?

    A few other notes; I am not actually storing information about towns, rather objects in the offshore oil and gas industry, but wanted to use an example that everyone is familiar with. Also, I am using SQL Server 2008, although this shouldn't really be an issue, as I expect the design should be the same on any database engine.

    Thanks very much for your replies!

  8. #8
    Join Date
    May 2008
    Posts
    33
    And on the issue of checking data integrity I had thought of triggers, although they really aren't my favourite things. They just don't seem quite flexible enough and don't always produce expected results.

    Another solution I had thought of is running a scheduled stored procedure that flags data not following certain rules. This means that transaction speed isn't as compromised and rules can be more complex with the scheduled sproc run at times of low database activity. I guess the issue is here that "flagged" data will have to be dealt with in some way. Perhaps the original researcher can be asked to correct it before it is marked as usable.

    Let me know if you think this is a good approach.

    Thank you

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •