If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Database Concepts & Design > Database Concept - How to set up a database with which I can track changes to an item

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 01-24-08, 15:33
jf0rce jf0rce is offline
Registered User
 
Join Date: Jan 2008
Posts: 19
Database Concept - How to set up a database with which I can track changes to an item

Hey peeps,

Found you guys/gals thru xtremevbtalk, a forum that recommended this site for the more specific database related questions. Hope I can fall back on you guys when I am in a tight spot concerning databases, and ofcourse hope to contribute back ... Thus far the introduction...Now on-topic:

I've been working with MSSQL for a while now, and I'm starting to understand how databases work (in a general sense of the word, as well as the T-SQL part of it). I have no experience with programming with cursors. I have had experience with creating databases in MySQL for websites (nothing fancy)

Now I need to create a program that is able to track "items" and the changes to those "items" - so that means some sort of version control is needed here. And for this, we require a database.

Let's keep things simple and say that each item has 5 properties:

- An ID [primary key]
- A Color [Allowed to be NULL]
- A Shape [Allowed to be NULL]
- A Name [Allowed to be NULL]
- A Weight [Allowed to be NULL]

A group of people are ment to keep the items up-to-date and will regularly change any of the 4 properties (ID is not changeable). In some cases, not all properties are yet known, so each property, with the exception of ID, could be NULL. My database is connected to the database where the items are stored on, and when the users alter any number of properties and press submit, I can see this. I also have full access to all of the information regarding the old and new property values of the item. This basically means I can go any way I want with the data.

What I want to do is create a database that I can query to retrieve as much information as possible about changes of items and the status of an item. Some examples of things I'd like to know:

The property values of an item x at moment y
"The color of item 10 at 2007-12-23 14:33:13 was black"

The number of times people will change property x for item y / all items
"I have noticed that the color of item 10 has changed 45 times...Make up your mind!"
"It seems that we are having trouble defining the shape of an item: In total we changed the shape of all our items 6000 times... That's 90% of all the changes we made!"

The time interval between a change of item x
"It took us two days to figure out that item 4 should be square instead of circlular"

My question is:
What is the most efficient way of storing the information?

In my opinion it is a case of an overload of redundant information versus a database structure that is very hard to maintain/query. To give an idea about what amounts of data we are talking about:

We have about 20 databases (projects) containing an average of 10.000 items. Each item can have approx. 40-60 properties (columns). Each column has very limited data: no more then 50 characters average. We want to put all of the data in one database, since we want to query te data cross-project. It is possible to cut out some of the columns, but for now I want to go for the full monty

At first I was thinking about simply copying all of the data from an item into my own database, and give it a timestamp. This would make it easy to obtain the status of an item at time x. It would be harder though to see what the changes are, and also I'd be dealing with a lot of redundant information. (90% of the information of the item is left unchanged).

Example - changing the shape of item 1:
Code:
ID  |  DateTime                 |Color  |Shape  | Name  | Weight   |
1   |  2007-12-23 16:46:43      | Red   | sqr   | Pete  | 5 kg     |
1   |  2007-12-23 16:51:48      | Red   | circ  | Pete  | 5 kg     |
1   |  2007-12-23 16:53:43      | Red   | sqr   | Pete  | 5 kg     |
  • Advantages: easy query to retrieve the status of an item at time x.
  • Disadvantages: No idea how I can query for just the changes?

Another method was inserting only the changed properties, and leaving the other properties NULL. This would save up on the redundant information and also make it easier to count changes

Example - changing the shape of item 1:

Code:
ID  |  DateTime                 |Color   |Shape  | Name  | Weight   |
1   |  2007-12-23 16:46:43      | Red    | sqr   | Pete  | 5 kg     |
1   |  2007-12-23 16:51:48      | NULL   | circ  | NULL  | NULL     |
1   |  2007-12-23 16:53:43      | NULL   | sqr   | NULL  | NULL     |
  • Advantages: easy query to count the number of changes
  • Disadvantages: No idea how to query for the status of an item at time x

Last option I could think of was creating a seperate table for each property of an item, and enter data to the table only when changes occur

Code:
tbl_color
ID  |  DateTime                 |Color   |
1   |  2007-12-23 16:46:43      | Red    |

tbl_shape
ID  |  DateTime                 |Shape  | 
1   |  2007-12-23 16:46:43      | sqr   | 
1   |  2007-12-23 16:51:48      | circ  | 
1   |  2007-12-23 16:53:43      | sqr   | 

tbl_name
ID  |  DateTime                 |Name  | 
1   |  2007-12-23 16:46:43      | Pete  |

tbl_weight
ID  |  DateTime                 |Weight   |
1   |  2007-12-23 16:46:43      | 5 kg    |
  • Advantages: All queries are a piece of cake
  • Disadvantages: 60 tables?!??! And what if we add a new property? or Null values?

These were the only to methods of doing this which I could come up with, but I can imagine that there are more methods of storing the data, or smarter ways to retrieve the data. I also have no feeling about what would be the best way to work (although I guess option 1 should be the worst option)

It's a lot of information I'm throwing at you all, but I hope my problem and question is clear and that you can help me out...

Thanks in advance!
Reply With Quote
  #2 (permalink)  
Old 01-24-08, 19:28
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,524
that has to be one of the biggest first posts i've ever seen

nice job giving all the background

off the top of my head (without knowing your actual columns, or your actual frequency of changes) i like the first approach -- just (re)insert the changed information along with the timestamp

disk space is the cheapest commodity in this entire scenario

most apps care about performance speed, so complex queries are ruled out (as they tend to be slower)

most stakeholders care about time to delivery, so taking a couple months to program the complex queries is also ruled out

thus, the simplest queries win, and (although i could be overlooking some clever scheme here), this means just the one table, multiple rows for each id, the "current" one being the one with the highest timestamp for that id

disk space is cheap
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #3 (permalink)  
Old 01-25-08, 05:00
jf0rce jf0rce is offline
Registered User
 
Join Date: Jan 2008
Posts: 19
Thanks for the compliment: I like a good prep before actually asking anything.

I believe that you are right concerning the amount of data, but as I stated at the starting post: I have no idea how I can query for just the changes. Taking a look at the database structure, and adding a second example:

Code:
ID  |  DateTime                 |Color  |Shape  | Name  | Weight   |
1   |  2007-12-23 16:46:43      | Red   | sqr   | Pete  | 5 kg     |
1   |  2007-12-23 16:51:48      | Red   | circ  | Pete  | 5 kg     |
1   |  2007-12-23 16:53:43      | Red   | sqr   | Pete  | 5 kg     |
2   |  2007-12-23 16:46:43      | Red   | rect  | Pete  | 5 kg     |
2   |  2007-12-23 16:51:48      | Red   | rect  | Pete  | 6 kg     |
2   |  2007-12-23 16:53:43      | Red   | rect  | Hank  | 6 kg     |
Now I want to create a query with which I can trace all changes for each ID, so whatever my SQL query looks like, I want it to return the following:

Code:
ID  |  Color  |Shape  | Name  | Weight   |
1   |    0    |  2    |  0   |   0      |
2   |    0    |  0   |   1   |   1      |
That's the ideal situation. Now, when using the other method (inserting NULL in unchanged properties) would make this easy. Just use a COUNT(column_name) combined with a GROUP BY ID and substract 1 from the result to get just the number of changes. But with the method of dumping all the info into a row, I have no idea how I should go at it. I guess i'd be looking for an algorithm of some sort that is capable of comparing two rows and NULL-ing all the columns that hold the same value for both rows?

Ofcourse, I could go lame and just create both tables: one that contains all information in each row, one that only contains the changes. That way I can just choose which table fits the purpose better.

on the subject of data and performance....
That really depends on what I want to be logged, so I can play around with that quite a lot. When it comes to changes I guess that a safe average would be 15. So that totals to:


20 databases, 10.000 items, 15 changes per item, all stuffed in one table: that would give me 3.000.000 rows. I have to add that this is the absolute maximum number of rows we will be needing. Each column contains very little data: a date, a name, a term, a number - but not large amounts of text or attachments. As I said before: worst case, I'd have a column with a string containing 50 characters.

I have no idea if these are number that I should be scared of or laugh at, what is your view?
Reply With Quote
  #4 (permalink)  
Old 01-25-08, 05:42
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
3 mill is a medium-small sized table for an enterprise rdbms like SS. What flavour (2k or 2005 or 7 or 6.5....)?
Reply With Quote
  #5 (permalink)  
Old 01-25-08, 06:01
jf0rce jf0rce is offline
Registered User
 
Join Date: Jan 2008
Posts: 19
I am using MSSQL 8.00.760 if I am correct.
Reply With Quote
  #6 (permalink)  
Old 01-25-08, 06:16
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Heh - that's SQL Server 2000.

Shame - easier in 2k5. Unfortunately comparing one row to the "previous" row is one of the harder things to do in SQL, especially with natural composite keys. I don't have time to whip something up now. Maybe Rudy or another contributer will be around shortly....
Reply With Quote
  #7 (permalink)  
Old 01-25-08, 06:34
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,524
Code:
  FROM archive AS this
INNER
  JOIN archive AS prev
    ON prev.ID = this.ID
   AND prev.datetimecol =
       ( SELECT MAX(datetimecol)
           FROM archive
          WHERE ID = this.ID 
            AND datetimecol < this.datetimecol )
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #8 (permalink)  
Old 01-25-08, 06:35
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,524
by the way, 20 separate database (one per project) is questionable

i don't suppose you're willing to risk your job on no one ever, ever, ever coming back to you to ask for stats across projects...
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #9 (permalink)  
Old 01-25-08, 06:43
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
You haven't stated it explicitly but it is implicit that you have a composite pk of id and timestamp - is this correct? If it is clustered too then Rudy's query should fly along.
Reply With Quote
  #10 (permalink)  
Old 01-25-08, 07:06
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Timestamps
It's more efficient (ie faster queries) if you have 2 timestamps rather than 1. The end time for the latest record should always be set to some time far in the future (ie '1 jan 9999') and not just null.

With a single timestamp the server has to pull all the recs for id=1234 then find which rec has the latest timestamp and then pull the complete record for that timestamp. With 2 timestamps it just needs to pull the rec where id=1234 and the end timestamp='1 jan 9999'.

Number of changes
You can see how many changes there has been to the colour of an item by using the following SQL (I don't have a MSSQL server to hand but it works in Sybase).
Code:
select count(distinct colour) from MyTab where id=1234
The only problem with this is if the colour swaps back and forth between red and green then this will only count as one change. It's do-able but much more tricky to cope with this issue.

Efficiency
It's been mentioned but, if you store the whole record each time a single field changes, then you do have to be carefull if one field changes quite frequently. This means the whole record gets stored over and over again in the database ie if just one field changes 10 times a day for 1000 items then you're storing 3,650,000 new records each year.

EAV
I can't resist mentioning this approach, not least cause Rudy hates it, but also cause it can apply in these situations. You'd just have one main table called params:
Code:
create table Params(
   id             int,
   param       varchar(20),
   value        varchar(50),
   fromDate   datetime,
   toDate      datetime
)

insert Params values ( 1234, 'Colour', 'Red', getdate(), '1 jan 9999' )
It's slightly more complex than this as you'd need to end date the old "current" record if there is one. You'd want tables to hold valid id's, valid params (ie Colour) and valid values for params if applicable (Colour=Red, Green ...). You'd do best having a procedure to validate and add a param and a function to get values from the database :
Code:
add_param( 1234, 'Colour', 'Blue' )
select get_param( 1234,'Colour' )
It's advantages in this type of app is that you could add new params when ever you want without having to alter the database (ie add new fields or tables). You could add as many params as you want - what would happen in the current design if you add more than 255 fields to your table - does MSSQL cope with that many fields in a table these days (I know there used to be problems)? It's also more efficient on storage space when you're storing lot's of changes to records as only the change will be stored.

I'm now climbing down into my bunker and putting on my flame retardent suit.

Mike
Reply With Quote
  #11 (permalink)  
Old 01-25-08, 07:17
jf0rce jf0rce is offline
Registered User
 
Join Date: Jan 2008
Posts: 19
Wow, thanks for the quick response guys!

r937: thanks for the query, it seems like the right way to go... Just have to let it sink in for a sec. Regarding your remark on 20 seperate databases: I most likely lied there. problem is that the original database that contains the items I want to keep track of is structured in an unknown way. We on the outside see 20 projects that are not linkable to eachother. Just the term project would have been more appropriate in this context. The whole idea of this new database is to combine all of the data to obtain stats across projects - so I don't have to put my job on the line

Pootle flump: took a while to understand you ment primary key with pk (hey I'm still a bite new to the db slang), but you are sort of right. In the database I am planning to build I will have the following columns (and more):

UID, ITEM_ID, DateTimeStamp

the UID is the real primary key of the table. The ITEM_ID is the ID of the item (in the examples named ID) and DateTimeStamp is the DateTime stamp on which the change has occured. So:

No, ITEM_ID and DateTimeStamp cannot be pk because technically it is possible for two persons to change an item at exactly the same time (although very, VERY improbable)
Yes, because I will use the combination of those two to identify which version of the item I am talking about.

++edit
missed your answer mike, reading it now ...
Reply With Quote
  #12 (permalink)  
Old 01-25-08, 07:22
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,524
actually, it is not possible for two changes to happen at the exact same time -- the database will still perform them one after the other

mike, stop with the EAV already, okay? it isn't just me who thinks that's a bad, bad, bad idea
and by the way, if i change red to blue to green to blue to red to green to blue, that's 6 changes, but your COUNT(DISTINCT) returns only three
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #13 (permalink)  
Old 01-25-08, 07:27
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Rudy's query could have problems if there can be dupes for ID and timestamp.

This isn't a table I see benefitting from a surrogate key (UID).

Mike - there's a downside to your two timestamps - each property change requires that a row is inserted and then the row "previous" is updated, so what you gain on one hand you lose on another (i.e. selects could be quicker, updates\ inserts are slower). The OP also needs to code up some transaction management in that case too.

EAV - didn't see that one coming
Reply With Quote
  #14 (permalink)  
Old 01-25-08, 07:28
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Quote:
Originally Posted by r937
actually, it is not possible for two changes to happen at the exact same time -- the database will still perform them one after the other
I've had pk violations on a natural date column key inserting GETDATE()
Reply With Quote
  #15 (permalink)  
Old 01-25-08, 09:08
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by r937
mike, stop with the EAV already, okay?
Sorry, can't do that - I said before that EAV only suits a (very) small range of apps but I'm afraid this is one of those apps.
Quote:
Originally Posted by r937
and by the way, if i change red to blue to green to blue to red to green to blue, that's 6 changes, but your COUNT(DISTINCT) returns only three
Totaly agree with you here - I even said as much in my post. The EAV SQL to do the count correctly (ie return 6) would just be:
Code:
select    count(*)
from      Param
where    id = 1234
            and Param = 'Colour'
            and fromDate between @startTime and @endTime
this took me about a minute to write and is 5 lines long - my guess it would take someone at least an hour to write the same SQL to work with the proposed TNF table and I'll bet it'll be 10 times more complicated. As you said - the simplest queries win.
Quote:
Originally Posted by pootle flump
Mike - there's a downside to your two timestamps - each property change requires that a row is inserted and then the row "previous" is updated, so what you gain on one hand you lose on another (i.e. selects could be quicker, updates\ inserts are slower).
True - I guess it would double the time to insert a given record into the database - it even might go from 1/100 sec to 5/100 sec but I doubt the user would notice.

The time to select recs will reduce by roughly the average number of changes - so if you had 100 changes for each rec then it would take 100 reads (ignoring caching and blocks etc) to get the rec using one timestamp and just 1 read to get the rec using 2 timestamps. The user will definitely notice a performance increase here, especially if doing any reporting.
Mike
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On