I am fairly new to database design and the normalization process, so please forgive me if my question is somewhat trivial. I am keen to adhere to the 'standards of normalization', but I also want to make sure that I do not sacrifice database performance in the process.
My question concerns the potential overhead of having one or more mostly empty columns in a table. Is that okay? Or should these columns be separated out into a different table? Let me illustrate with an example.
Let's say I am creating a 'contacts' table with the following fields:
Now, let us also assume that I will be populating my database with records from another database application. I know that only about 2% of these records have entries for the 'middle_names' field. I also know that the majority of people filling in a Web form to insert records into this 'contacts' table will leave the 'middle_names' field blank. This suggests that I will have a column in the table that will mostly consist of empty values.
In an ideal world, I would like every column in every row in my table to have a value, so I am not saving empty space. But obviously this cannot happen in the real world. So, should I be making another 'middle_names' table, like so?
I could use a join in my queries whenever I needed to return the middle name of a specific contact. Is this the 'structurally right' thing to do? Or is it okay to have a few columns in my table that are mostly empty? If it is, should my values be empty strings or NULLs? What are the advantages and disadvantages of using NULL entries in columns?
My question also applies to the address fields in this table. For example, some people require more than two lines for their address in addition to the 'city', 'state', and 'zip' fields. I am weary of creating an 'address_3' field when only a fraction of the rows in the table will have entries for this column. So, again, should I be splitting out the address fields into an 'address' table?
Thanks for the reply. So, you are saying that I should not worry about using sparsely filled columns. I realize that my queries could get very complicated if I separated out all those columns into distinct tables!
But at what point does one draw the line? At what point do you say, this information should be stored in a separate table? Does it depend on what type of information you are separating? You might say, for example, that 'middle_name' should not be separated into a different table because it is dependent on the 'contact_id'. I realize that the answer to this question might be subjective, but what do people use as a general rule of thumb when designing databases? When, if ever, should a mostly empty column be separated out into a different table?
I am eager to hear what people think!
Thank you in advance, once again, for any feeback.
Separating out data is reserved for the elimination of redundancy not for the absence of values. NULLs take care of absent values.
It is better to ask yourself this: Is my database for processing transactions or is it processing queries?
If you are processing transactions, normalize to get your performance gains.
If you are processing queries, denormalize to get your performance gains. When you're processing queries redundancy is king by having as few joins as possible. However, performance in the better RDBMSs is getting to the point where denormalized databases are beginning to be discouraged. A dimensional database is being considered too inflexible.
As I understand it, I should normalize if my database primarily processes transactions. I think my database will mostly, though not entirely, be used for processing queries so I will be weary of separating out my columns into too many tables. I realize that I should aim to keep the number of joins in my queries to a minimum. Your advice seems sound.
However, what do you mean by a "dimensional database"? Forgive my ignorance if this is a stupid question.
Also, you said that "performance in the better RDBMSs is getting to the point where denormalized databases are beginning to be discouraged." I assume you are referring to Oracle and DB2 as instances of better RDBMSs? I am using MySQL for my project. I don't know if that qualifies as one of the better ones. Still, I will take your advice and try and keep the number of joins down even if that means having columns that are mainly empty.
When I am designing a database, I go mostly by instinct. I realize this may not be helpful when one is looking for hard and fast rules for database design, but I beleive one has to have a good feel for what will be helpful to extract out into a new table, and what will not.
For instance, in the examples you gave, middle name should not be extracted out into a new table. For starters, it's only one field. When you want to normalize information, try to keep like information together. You will never need to ask for middle name without asking for the first and last name, right? So, keep it with them.
Addresses are entirely different. Depending on how you are using the database, is there any chance you will store a person's information without an address? If there is, you have an argument for extracting it into a new table. In many cases, also, you may be asking for address information independently of the peron's personal information - also, a reason to keep addresses in a different table. A third reason to do this is if your end application will potentially need to store more than one (maybe a history of) address for a person.
In general, keep the end use in mind, but make sure you are designing with enough flexibility in case your end needs change. Even during development, project specifications can radically shift, and if your DB won't comply with the needed changes, you will find yourself sinking into a quagmire very quickly.