Here's a question for those wiser in the database design than I...is there ever justification in bending the rules with regards to First Normal Form?
Here's the story that led me to wondering about this...
I'm designing PHP/MySQL application to help those who look after people with cancer.
The aim is to collect information about what they needed care for. The hope is to build up information over time to inform decisions to improve their care.
I used one of those "code generators" to build the a PHP form to enter the information required via listboxes & checkboxes. This is the (truncated) database it spat out:
So you can see... aside from having a primary key field inthe 1st field, fields 3-4 collects some other basic data.
From Field 6 onwards, we're collecting information on "Where the cancer was" (via the site_* fields). Specifically, at the "PHP front end", these fields are checkboxes. If there was a cancer in say the appendix, it would be stored as "1" in the database. If not, the field will be NULL.
You therefore can have multiple fields with "1" if you have multiple sites with cancer. If you have no sites at all with cancer, all site_* fields will have NULL in them.
The problem I can see in the database design is the site_* recurring fields - this is in voilation of The First Normal Form right? Thing is, if I try to separate this out into a separate "Site" table, it breaks the nice single entry form you get with the schema described above.
There's only so many bits to the human body. So whilst the site_* fields do recur, there won't be any more (or not much) additional site_* columns required in the database design. That being the case, whilst it isn't strictly in the first normal form, would this database still be acceptable in terms of efficiency of operation ie: search & retrieval? Or if I stick with this, will I be asking for trouble when MySQL trolls through thousands of those records querying people who had site_appendix = 1 and site_caecum IS NULL etc?
Looking forward to hearing your insight into this.
Thanks for the reply pootle flump, I appreciate you reading through the post!
By "Normalise to First Normal Form" I interpret that as meaning you think I should have a "Site" table and move all the site_* fields there? My problem with that is that I'm finding it a touch hard to reverse engineer the code to "import" that into the form and summary table (I'm still starting with regards to learning PHP & MySQL)..but that's a problem outside of the scope of this discussion.
What is in scope however is your valid point about - Imagine the query to "Show all the people with cancer". In that case, I would run the following:
site_colon_ascending_colon, site_colon_hepatic_flexure, site_colon_transverse_colon,
site_colon_splenic_flexure, site_colon_descending_colon, site_colon_sigmoid_colon,
site_colon_ascending_colon = 1 OR site_colon_hepatic_flexure = 1 OR
site_colon_transverse_colon = 1 OR site_colon_splenic_flexure = 1 OR
site_colon_descending_colon = 1 OR site_colon_sigmoid_colon = 1 OR
site_colon_rectosigmoid = 1 OR site_rectum = 1
I've tested it, it works..but would that end up choking MySQL server with thousands of records? Typically however, queries wouldn't run into that sort of length as most users will limit it to records within a certain period.
This isn't a performance issue.
You have to write that query. You also have to alter the schema of the table, and any queries, and any forms, whenever a new cancer site is added.
This is also a single query - there will be many more.
The problems of repeating groups are well documented. Database design and presentation are unrelated - your form should be thoroughly decoupled from the database, and certainly should never influence the database design.
I can't help with the PHP.
Just FYI, the way I do this in other FEs is simply a "sub form" (whatever that might be) which is a list of all the sites (drawn from data in a table, not the schema), allowing the user to tick the relevant sites. Adding a new site is trivial (just an entry in a table). Determining who has cancer is trivial. Establishing who has cancer in three or more sites is trivial. etc.
Ok..point taken. Stick to First Normal Form - is the problems with repeating columns isolated to:
1) Having to add a new column when new "repeating fields" are needed and the HDD storage space associated with that?
2) Having to recode the queries to allow for these new columns (though you still have some recoding to do in the case of the new field being a record in a second table instead of a column in the original first table)
Basically I want to know to problems associated with repeating groups vs the time it'll back me to retool the program (this also affects the csv output of the program so it goes deeper than just data entry & table presentation).
BTW..do you mean "Form Entries" in "..other FEs is simply a "sub form".."
The problem I can see in the database design is the site_* recurring fields - this is in voilation of The First Normal Form right?
Not really true. Any SQL table which does not permit nulls and has a key on it can be regarded as being in 1NF (ie it accurately represents some relation).
1NF doesn't prohibit any particular attributes from a table but there are plenty of other reasons why your proposed design probably isn't a good one. The more attributes you use to represent similar information in one table the greater the risk that you'll need redundant code in queries and applications to deal with those attributes. Redundancy of data and resulting anomalies may be a problem too.
Thanks for your input dportas. The more experience on the issue, the better.
Originally Posted by dportas
The more attributes you use to represent similar information in one table the greater the risk that you'll need redundant code in queries and applications to deal with those attributes. Redundancy of data and resulting anomalies may be a problem too.
That's what I was talking about when I wrote about "site_* recurring fields" in the DB design I quoted as being in violation of 1NF. HOWEVER.....
..thanks for the reference. Interesting article, especially the section "The ambiguity of Repeating Groups". Whilst I still hold to the idea that the "site_* recurring fields" is a 1NF violation as I believe you and pootle flump do..that section got me wondering. Anith's example where you have date attributes looking like repeating groups when actually, they were logically applying to different concepts, that's what got me.
In my case I wonder if you could argue: sure, logically the "site_appendix" & "site_caecum" fields are referring conceptually to "sites" for cancer and therefore should be placed in their own "sites" table. However, you could also argue that they are also in some ways independent of each other conceptually since the fields are there to record the presence (value=1) or abscence (value=NULL) of something in an area and you don't necessarily need to think of "sites" as being the thing that those fields share in common?
I think perhaps that the above is just extending Anith's example too far? I think I still hold that I need to have a separate "sites" table.
Thing is though, each record in the admission_details table does need to detail if there was (value=1) or wasn't (value=NULL) a cancer for each & every site. Therefore, if I were to have a "sites" table as well as a "admission_details" table, I will need a "site_positive" table to link the 1st 2 in terms of recording if cancer was present at each & every site?
That is to say that the table "sites" looks like this...
And the "site_positive" table (marks off if a site was noted to have a cancer in a particular admission) will look like this:
idadmission_details (foreign key from "admission_details" table)
idsites (foreign key from "sites" table)
The abscence of any records with idadmission_details=1 in the "site_positive" table indicates that no cancers were present for the admission identified as idadmission_details=1. Similarly, a record in the "site_positive" table with idadmission_details=2 & idsites=1 means that for the admission identified as idadmission_details=2, a cancer in idsite=1 was present.
Whilst I still hold to the idea that the "site_* recurring fields" is a 1NF violation as I believe you and pootle flump do..
My point (and Aniths) is that a set of columns is NOT a repeating group. A repeating group means more than one value occuring in a single attribute position, which is impossible in SQL. Columns in a table therefore do not violate 1NF just because they have similar content or names.
1NF is violated only if the table permits nulls or duplicate rows.
My point (and Aniths) is that a set of columns is NOT a repeating group. A repeating group means more than one value occuring in a single attribute position, which is impossible in SQL.
Right. Your last sentence is the important once since I couldn't see how you could actually have more than one value in a single attribute position - it's impossible No problems there then...
Originally Posted by dportas
1NF is violated only if the table permits nulls or duplicate rows.
I was still going to interpret a NULL value as informative ie: "not present" as opposed to 1 which is "present". Still bad form? If so, what do you think of the 3 table design I propose in my last post?
I was still going to interpret a NULL value as informative ie: "not present" as opposed to 1 which is "present
I don't think that's a good idea at all. Why not represent the information with a value instead of a null? Nulls aren't values. Nulls are generally much more complex to deal with, which is why their use should preferably be minimised or avoided altogether.
Your second, three-table design seems much more sensible to me.
technically, the prod_categories value in this example is a single value, so it does not violate 1NF
in practical terms, this design is b0rked
Likely you are right: with the purpose that I expect you intended for the prod_categories column that's probably not a good design. As I already said, the fact that a table is in 1NF doesn't necessarily mean the design is "good enough" nor does it mean you should abandon common sense in your design.
It's important to be clear about what 1NF means and does not mean. 1NF sets out a most basic underlying principle for relational database design. There's nothing slavish about applying it. In a relational system we wouldn't even have to worry about it but in a SQL system it's important and useful in order to avoid certain serious pitfalls. That's why it doesn't do any good to confuse the issue with irrelevant diversions.
In any case your example isn't much of an explanation. Can you explain concisely just what rule you are applying that makes it undesirable to have the table in your example and others like it? I suspect you'll find a reliable definition very difficult to come up with or explain - genuinely arcane in fact. It may well turn out that you have a useful general principle in mind, but it has nothing to do with Normalization.