I am working with a sharp but relatively inexperienced programmer on a PHP/MySQL database project. I'm no database expert but have some knowledge of database design and structure, although much of my knowledge is probably 20 years out of date.
The project we're working on is content management and display of images and text on a website. Integrity and accuracy of the data is crucial. There are a bunch of tables that have to be accessed to display data on a given page, particularly when the user has requested a multi-tiered sort.
I was taught (20 years ago) that data should be fully normalized, i.e., you should not store the same data (birth date or legal name, for example) in multiple tables to eliminate the possibility of synchronization problems (data being updated in one table but not in another, due to scripting error or write problem or whatever.)
The programmer is arguing that most of the arguments for normalization are dated (storage and processing speed is no longer a big issue) and that data integrity is not an issue either.
He says it's easier to code the application, and it's obviously faster to retrieve the data if he can copy data from the master record to other records multiple times. I can see that it will be much less load on the server to open one table and retrieve data and output it than to open 5 or 10 tables to assemble the data to be output.
Speed and processor load is, I think, a relatively minor issue. The server is a fast dedicated box and will have maybe a couple thousand users at a time absolute tops. We are currently using a fully normalized database with several hundred simultaneous users and are not having any speed issues.
My issue is in terms of proper database design. From a data integrity perspective, is it still standard procedure to fully normalize critical data, reducing the likelihood of coding errors that could cause sync problems with the same data in multiple tables? Are a lot of programmers now duplicating data between tables to speed access?
One clarification: He was saying that data integrity wasn't important, he said that there wasn't any decrease in data integrity by writing the same info (and potentially having to update said info) in multiple tables.
While I can agree with that theoretically, I also know that writing the same data in multiple places is introducing more potential points of failure and/or sync problems, more likelihood of scripting bugs, etc., if you have to update multiple places where the data resides, particularly given that there are likely going to be multiple scripts updating the tables.
I was looking to see if my viewpoint on that issue is "outdated" since I haven't really looked at database theory since the course I took 20+ years ago.
theoretically, you can obtain the number of cows in a herd by counting their legs and dividing by four, but i wouldn't do it that way
sounds like you're not going to win any arguments with this guy -- of course, he's right, if you have redundant data all over the place, but always update every bit of it carefully, there won;t be any problem
It is possible to keep information in many places updated. It is possible to never miss a single update in every script you write. Given enough time and sufficient thrust, you can make a pig fly, but that is still the exception rather than the rule.
I insist that databases be normalized first, then only denormalize under a few very unusual circumstances. The programming staff still rails at this, they'd rather have things "easier" with columns duplicated as they see fit, and placed where they need fewer joins. The business analysts struggled at first, but now that they've seen the difference between properly normalized and non-normalized database designs, and have come to realize that there are lots fewer "fire-drills" with the normalized designs, the busines side is actually insisting on normalization even when that increases the up-front coding costs!
Anyone with a modicum of experience with databases knows that data integrity is of utmost importance, and that normalization if the most effective means of maintaining data integrity.
Since your colleague apparently does not know this, he obviously lacks even moderate database experience.
Since he lacks database experience, he has no business dictating how the database will be designed.
If it's not practically useful, then it's practically useless.