Page 1 of 3 123 LastLast
Results 1 to 15 of 39
  1. #1
    Join Date
    Jan 2008
    Posts
    19

    Database Concept - How to set up a database with which I can track changes to an item

    Hey peeps,

    Found you guys/gals thru xtremevbtalk, a forum that recommended this site for the more specific database related questions. Hope I can fall back on you guys when I am in a tight spot concerning databases, and ofcourse hope to contribute back ... Thus far the introduction...Now on-topic:

    I've been working with MSSQL for a while now, and I'm starting to understand how databases work (in a general sense of the word, as well as the T-SQL part of it). I have no experience with programming with cursors. I have had experience with creating databases in MySQL for websites (nothing fancy)

    Now I need to create a program that is able to track "items" and the changes to those "items" - so that means some sort of version control is needed here. And for this, we require a database.

    Let's keep things simple and say that each item has 5 properties:

    - An ID [primary key]
    - A Color [Allowed to be NULL]
    - A Shape [Allowed to be NULL]
    - A Name [Allowed to be NULL]
    - A Weight [Allowed to be NULL]

    A group of people are ment to keep the items up-to-date and will regularly change any of the 4 properties (ID is not changeable). In some cases, not all properties are yet known, so each property, with the exception of ID, could be NULL. My database is connected to the database where the items are stored on, and when the users alter any number of properties and press submit, I can see this. I also have full access to all of the information regarding the old and new property values of the item. This basically means I can go any way I want with the data.

    What I want to do is create a database that I can query to retrieve as much information as possible about changes of items and the status of an item. Some examples of things I'd like to know:

    The property values of an item x at moment y
    "The color of item 10 at 2007-12-23 14:33:13 was black"

    The number of times people will change property x for item y / all items
    "I have noticed that the color of item 10 has changed 45 times...Make up your mind!"
    "It seems that we are having trouble defining the shape of an item: In total we changed the shape of all our items 6000 times... That's 90% of all the changes we made!"

    The time interval between a change of item x
    "It took us two days to figure out that item 4 should be square instead of circlular"

    My question is:
    What is the most efficient way of storing the information?

    In my opinion it is a case of an overload of redundant information versus a database structure that is very hard to maintain/query. To give an idea about what amounts of data we are talking about:

    We have about 20 databases (projects) containing an average of 10.000 items. Each item can have approx. 40-60 properties (columns). Each column has very limited data: no more then 50 characters average. We want to put all of the data in one database, since we want to query te data cross-project. It is possible to cut out some of the columns, but for now I want to go for the full monty

    At first I was thinking about simply copying all of the data from an item into my own database, and give it a timestamp. This would make it easy to obtain the status of an item at time x. It would be harder though to see what the changes are, and also I'd be dealing with a lot of redundant information. (90% of the information of the item is left unchanged).

    Example - changing the shape of item 1:
    Code:
    ID  |  DateTime                 |Color  |Shape  | Name  | Weight   |
    1   |  2007-12-23 16:46:43      | Red   | sqr   | Pete  | 5 kg     |
    1   |  2007-12-23 16:51:48      | Red   | circ  | Pete  | 5 kg     |
    1   |  2007-12-23 16:53:43      | Red   | sqr   | Pete  | 5 kg     |
    • Advantages: easy query to retrieve the status of an item at time x.
    • Disadvantages: No idea how I can query for just the changes?


    Another method was inserting only the changed properties, and leaving the other properties NULL. This would save up on the redundant information and also make it easier to count changes

    Example - changing the shape of item 1:

    Code:
    ID  |  DateTime                 |Color   |Shape  | Name  | Weight   |
    1   |  2007-12-23 16:46:43      | Red    | sqr   | Pete  | 5 kg     |
    1   |  2007-12-23 16:51:48      | NULL   | circ  | NULL  | NULL     |
    1   |  2007-12-23 16:53:43      | NULL   | sqr   | NULL  | NULL     |
    • Advantages: easy query to count the number of changes
    • Disadvantages: No idea how to query for the status of an item at time x


    Last option I could think of was creating a seperate table for each property of an item, and enter data to the table only when changes occur

    Code:
    tbl_color
    ID  |  DateTime                 |Color   |
    1   |  2007-12-23 16:46:43      | Red    |
    
    tbl_shape
    ID  |  DateTime                 |Shape  | 
    1   |  2007-12-23 16:46:43      | sqr   | 
    1   |  2007-12-23 16:51:48      | circ  | 
    1   |  2007-12-23 16:53:43      | sqr   | 
    
    tbl_name
    ID  |  DateTime                 |Name  | 
    1   |  2007-12-23 16:46:43      | Pete  |
    
    tbl_weight
    ID  |  DateTime                 |Weight   |
    1   |  2007-12-23 16:46:43      | 5 kg    |
    • Advantages: All queries are a piece of cake
    • Disadvantages: 60 tables?!??! And what if we add a new property? or Null values?


    These were the only to methods of doing this which I could come up with, but I can imagine that there are more methods of storing the data, or smarter ways to retrieve the data. I also have no feeling about what would be the best way to work (although I guess option 1 should be the worst option)

    It's a lot of information I'm throwing at you all, but I hope my problem and question is clear and that you can help me out...

    Thanks in advance!

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    that has to be one of the biggest first posts i've ever seen

    nice job giving all the background

    off the top of my head (without knowing your actual columns, or your actual frequency of changes) i like the first approach -- just (re)insert the changed information along with the timestamp

    disk space is the cheapest commodity in this entire scenario

    most apps care about performance speed, so complex queries are ruled out (as they tend to be slower)

    most stakeholders care about time to delivery, so taking a couple months to program the complex queries is also ruled out

    thus, the simplest queries win, and (although i could be overlooking some clever scheme here), this means just the one table, multiple rows for each id, the "current" one being the one with the highest timestamp for that id

    disk space is cheap
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Jan 2008
    Posts
    19
    Thanks for the compliment: I like a good prep before actually asking anything.

    I believe that you are right concerning the amount of data, but as I stated at the starting post: I have no idea how I can query for just the changes. Taking a look at the database structure, and adding a second example:

    Code:
    ID  |  DateTime                 |Color  |Shape  | Name  | Weight   |
    1   |  2007-12-23 16:46:43      | Red   | sqr   | Pete  | 5 kg     |
    1   |  2007-12-23 16:51:48      | Red   | circ  | Pete  | 5 kg     |
    1   |  2007-12-23 16:53:43      | Red   | sqr   | Pete  | 5 kg     |
    2   |  2007-12-23 16:46:43      | Red   | rect  | Pete  | 5 kg     |
    2   |  2007-12-23 16:51:48      | Red   | rect  | Pete  | 6 kg     |
    2   |  2007-12-23 16:53:43      | Red   | rect  | Hank  | 6 kg     |
    Now I want to create a query with which I can trace all changes for each ID, so whatever my SQL query looks like, I want it to return the following:

    Code:
    ID  |  Color  |Shape  | Name  | Weight   |
    1   |    0    |  2    |  0   |   0      |
    2   |    0    |  0   |   1   |   1      |
    That's the ideal situation. Now, when using the other method (inserting NULL in unchanged properties) would make this easy. Just use a COUNT(column_name) combined with a GROUP BY ID and substract 1 from the result to get just the number of changes. But with the method of dumping all the info into a row, I have no idea how I should go at it. I guess i'd be looking for an algorithm of some sort that is capable of comparing two rows and NULL-ing all the columns that hold the same value for both rows?

    Ofcourse, I could go lame and just create both tables: one that contains all information in each row, one that only contains the changes. That way I can just choose which table fits the purpose better.

    on the subject of data and performance....
    That really depends on what I want to be logged, so I can play around with that quite a lot. When it comes to changes I guess that a safe average would be 15. So that totals to:


    20 databases, 10.000 items, 15 changes per item, all stuffed in one table: that would give me 3.000.000 rows. I have to add that this is the absolute maximum number of rows we will be needing. Each column contains very little data: a date, a name, a term, a number - but not large amounts of text or attachments. As I said before: worst case, I'd have a column with a string containing 50 characters.

    I have no idea if these are number that I should be scared of or laugh at, what is your view?

  4. #4
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    3 mill is a medium-small sized table for an enterprise rdbms like SS. What flavour (2k or 2005 or 7 or 6.5....)?

  5. #5
    Join Date
    Jan 2008
    Posts
    19
    I am using MSSQL 8.00.760 if I am correct.

  6. #6
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Heh - that's SQL Server 2000.

    Shame - easier in 2k5. Unfortunately comparing one row to the "previous" row is one of the harder things to do in SQL, especially with natural composite keys. I don't have time to whip something up now. Maybe Rudy or another contributer will be around shortly....

  7. #7
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Code:
      FROM archive AS this
    INNER
      JOIN archive AS prev
        ON prev.ID = this.ID
       AND prev.datetimecol =
           ( SELECT MAX(datetimecol)
               FROM archive
              WHERE ID = this.ID 
                AND datetimecol < this.datetimecol )
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  8. #8
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    by the way, 20 separate database (one per project) is questionable

    i don't suppose you're willing to risk your job on no one ever, ever, ever coming back to you to ask for stats across projects...
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  9. #9
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    You haven't stated it explicitly but it is implicit that you have a composite pk of id and timestamp - is this correct? If it is clustered too then Rudy's query should fly along.

  10. #10
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Timestamps
    It's more efficient (ie faster queries) if you have 2 timestamps rather than 1. The end time for the latest record should always be set to some time far in the future (ie '1 jan 9999') and not just null.

    With a single timestamp the server has to pull all the recs for id=1234 then find which rec has the latest timestamp and then pull the complete record for that timestamp. With 2 timestamps it just needs to pull the rec where id=1234 and the end timestamp='1 jan 9999'.

    Number of changes
    You can see how many changes there has been to the colour of an item by using the following SQL (I don't have a MSSQL server to hand but it works in Sybase).
    Code:
    select count(distinct colour) from MyTab where id=1234
    The only problem with this is if the colour swaps back and forth between red and green then this will only count as one change. It's do-able but much more tricky to cope with this issue.

    Efficiency
    It's been mentioned but, if you store the whole record each time a single field changes, then you do have to be carefull if one field changes quite frequently. This means the whole record gets stored over and over again in the database ie if just one field changes 10 times a day for 1000 items then you're storing 3,650,000 new records each year.

    EAV
    I can't resist mentioning this approach, not least cause Rudy hates it, but also cause it can apply in these situations. You'd just have one main table called params:
    Code:
    create table Params(
       id             int,
       param       varchar(20),
       value        varchar(50),
       fromDate   datetime,
       toDate      datetime
    )
    
    insert Params values ( 1234, 'Colour', 'Red', getdate(), '1 jan 9999' )
    It's slightly more complex than this as you'd need to end date the old "current" record if there is one. You'd want tables to hold valid id's, valid params (ie Colour) and valid values for params if applicable (Colour=Red, Green ...). You'd do best having a procedure to validate and add a param and a function to get values from the database :
    Code:
    add_param( 1234, 'Colour', 'Blue' )
    select get_param( 1234,'Colour' )
    It's advantages in this type of app is that you could add new params when ever you want without having to alter the database (ie add new fields or tables). You could add as many params as you want - what would happen in the current design if you add more than 255 fields to your table - does MSSQL cope with that many fields in a table these days (I know there used to be problems)? It's also more efficient on storage space when you're storing lot's of changes to records as only the change will be stored.

    I'm now climbing down into my bunker and putting on my flame retardent suit.

    Mike

  11. #11
    Join Date
    Jan 2008
    Posts
    19
    Wow, thanks for the quick response guys!

    r937: thanks for the query, it seems like the right way to go... Just have to let it sink in for a sec. Regarding your remark on 20 seperate databases: I most likely lied there. problem is that the original database that contains the items I want to keep track of is structured in an unknown way. We on the outside see 20 projects that are not linkable to eachother. Just the term project would have been more appropriate in this context. The whole idea of this new database is to combine all of the data to obtain stats across projects - so I don't have to put my job on the line

    Pootle flump: took a while to understand you ment primary key with pk (hey I'm still a bite new to the db slang), but you are sort of right. In the database I am planning to build I will have the following columns (and more):

    UID, ITEM_ID, DateTimeStamp

    the UID is the real primary key of the table. The ITEM_ID is the ID of the item (in the examples named ID) and DateTimeStamp is the DateTime stamp on which the change has occured. So:

    No, ITEM_ID and DateTimeStamp cannot be pk because technically it is possible for two persons to change an item at exactly the same time (although very, VERY improbable)
    Yes, because I will use the combination of those two to identify which version of the item I am talking about.

    ++edit
    missed your answer mike, reading it now ...

  12. #12
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    actually, it is not possible for two changes to happen at the exact same time -- the database will still perform them one after the other

    mike, stop with the EAV already, okay? it isn't just me who thinks that's a bad, bad, bad idea


    and by the way, if i change red to blue to green to blue to red to green to blue, that's 6 changes, but your COUNT(DISTINCT) returns only three
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  13. #13
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Rudy's query could have problems if there can be dupes for ID and timestamp.

    This isn't a table I see benefitting from a surrogate key (UID).

    Mike - there's a downside to your two timestamps - each property change requires that a row is inserted and then the row "previous" is updated, so what you gain on one hand you lose on another (i.e. selects could be quicker, updates\ inserts are slower). The OP also needs to code up some transaction management in that case too.

    EAV - didn't see that one coming

  14. #14
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Quote Originally Posted by r937
    actually, it is not possible for two changes to happen at the exact same time -- the database will still perform them one after the other
    I've had pk violations on a natural date column key inserting GETDATE()

  15. #15
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Originally Posted by r937
    mike, stop with the EAV already, okay?
    Sorry, can't do that - I said before that EAV only suits a (very) small range of apps but I'm afraid this is one of those apps.
    Originally Posted by r937
    and by the way, if i change red to blue to green to blue to red to green to blue, that's 6 changes, but your COUNT(DISTINCT) returns only three
    Totaly agree with you here - I even said as much in my post. The EAV SQL to do the count correctly (ie return 6) would just be:
    Code:
    select    count(*)
    from      Param
    where    id = 1234
                and Param = 'Colour'
                and fromDate between @startTime and @endTime
    this took me about a minute to write and is 5 lines long - my guess it would take someone at least an hour to write the same SQL to work with the proposed TNF table and I'll bet it'll be 10 times more complicated. As you said - the simplest queries win.

    Originally Posted by pootle flump
    Mike - there's a downside to your two timestamps - each property change requires that a row is inserted and then the row "previous" is updated, so what you gain on one hand you lose on another (i.e. selects could be quicker, updates\ inserts are slower).
    True - I guess it would double the time to insert a given record into the database - it even might go from 1/100 sec to 5/100 sec but I doubt the user would notice.

    The time to select recs will reduce by roughly the average number of changes - so if you had 100 changes for each rec then it would take 100 reads (ignoring caching and blocks etc) to get the rec using one timestamp and just 1 read to get the rec using 2 timestamps. The user will definitely notice a performance increase here, especially if doing any reporting.

    Mike

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •