| |
Welcome to the dBforums forums.
You are currently viewing our boards as a guest which gives you limited access to view most discussions, articles and access our other FREE features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload your own photos and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!
If you have any problems with the registration process or your account login, please contact contact support.
If you prefer not to see double-underlined words and corresponding ads, place your cursor here for ContentLink opt out.
|
 |

06-03-04, 20:55
|
|
Registered User
|
|
Join Date: Jun 2004
Posts: 3
|
|
|
Potential Overhead of Mostly Empty Columns
|
I am fairly new to database design and the normalization process, so please forgive me if my question is somewhat trivial. I am keen to adhere to the 'standards of normalization', but I also want to make sure that I do not sacrifice database performance in the process.
My question concerns the potential overhead of having one or more mostly empty columns in a table. Is that okay? Or should these columns be separated out into a different table? Let me illustrate with an example.
Let's say I am creating a 'contacts' table with the following fields:
contact_id
first_name
middle_names
last_name
address_1
address_2
city
state
zip
...
etc.
Now, let us also assume that I will be populating my database with records from another database application. I know that only about 2% of these records have entries for the 'middle_names' field. I also know that the majority of people filling in a Web form to insert records into this 'contacts' table will leave the 'middle_names' field blank. This suggests that I will have a column in the table that will mostly consist of empty values.
In an ideal world, I would like every column in every row in my table to have a value, so I am not saving empty space. But obviously this cannot happen in the real world. So, should I be making another 'middle_names' table, like so?
middle_name_id
contact_id_fk
middle_name
I could use a join in my queries whenever I needed to return the middle name of a specific contact. Is this the 'structurally right' thing to do? Or is it okay to have a few columns in my table that are mostly empty? If it is, should my values be empty strings or NULLs? What are the advantages and disadvantages of using NULL entries in columns?
My question also applies to the address fields in this table. For example, some people require more than two lines for their address in addition to the 'city', 'state', and 'zip' fields. I am weary of creating an 'address_3' field when only a fraction of the rows in the table will have entries for this column. So, again, should I be splitting out the address fields into an 'address' table?
address_id
contact_id_fk
address_line
address_line_number
Perhaps these are trivial concerns but I want to get off on the right foot. I would appreciate any feedback and/or advice.
Thank you in advance for any help!
|
|

06-03-04, 22:10
|
|
SQL Consultant
|
|
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 13,554
|
|
in the specific instance of addresses, it is far, far better to have columns that can be NULL (note: NULL is not the same thing as empty)
separate tables for those infrequently filled fields is just asking for toruble, both in performance and in query complexity
disk space is cheap
and anyway NULL doesn't take up a whole lot of room
|
|

06-04-04, 05:34
|
|
Registered User
|
|
Join Date: Jun 2004
Posts: 3
|
|
|
Hey Rudy,
Thanks for the reply. So, you are saying that I should not worry about using sparsely filled columns. I realize that my queries could get very complicated if I separated out all those columns into distinct tables!
But at what point does one draw the line? At what point do you say, this information should be stored in a separate table? Does it depend on what type of information you are separating? You might say, for example, that 'middle_name' should not be separated into a different table because it is dependent on the 'contact_id'. I realize that the answer to this question might be subjective, but what do people use as a general rule of thumb when designing databases? When, if ever, should a mostly empty column be separated out into a different table?
I am eager to hear what people think!
Thank you in advance, once again, for any feeback.
|
|

06-04-04, 18:08
|
|
Registered User
|
|
Join Date: Dec 2003
Location: Canada
Posts: 669
|
|
Separating out data is reserved for the elimination of redundancy not for the absence of values. NULLs take care of absent values.
It is better to ask yourself this: Is my database for processing transactions or is it processing queries?
If you are processing transactions, normalize to get your performance gains.
If you are processing queries, denormalize to get your performance gains. When you're processing queries redundancy is king by having as few joins as possible. However, performance in the better RDBMSs is getting to the point where denormalized databases are beginning to be discouraged. A dimensional database is being considered too inflexible.
|
|

06-04-04, 18:52
|
|
Registered User
|
|
Join Date: Jun 2004
Posts: 3
|
|
Thank you for the response, Certus!
As I understand it, I should normalize if my database primarily processes transactions. I think my database will mostly, though not entirely, be used for processing queries so I will be weary of separating out my columns into too many tables. I realize that I should aim to keep the number of joins in my queries to a minimum. Your advice seems sound.
However, what do you mean by a "dimensional database"? Forgive my ignorance if this is a stupid question.
Also, you said that "performance in the better RDBMSs is getting to the point where denormalized databases are beginning to be discouraged." I assume you are referring to Oracle and DB2 as instances of better RDBMSs? I am using MySQL for my project. I don't know if that qualifies as one of the better ones. Still, I will take your advice and try and keep the number of joins down even if that means having columns that are mainly empty.
Thank you, once again, for your help!
|
|

06-04-04, 21:15
|
|
Registered User
|
|
Join Date: Dec 2003
Location: Canada
Posts: 669
|
|
|
|

06-04-04, 21:22
|
|
Registered User
|
|
Join Date: Dec 2003
Location: Canada
Posts: 669
|
|
|
Last edited by certus : 06-04-04 at 21:29.
|

06-11-04, 13:07
|
|
Registered User
|
|
Join Date: Dec 2003
Location: Inland Empire
Posts: 15
|
|
|
Rules of Thumb
When I am designing a database, I go mostly by instinct. I realize this may not be helpful when one is looking for hard and fast rules for database design, but I beleive one has to have a good feel for what will be helpful to extract out into a new table, and what will not.
For instance, in the examples you gave, middle name should not be extracted out into a new table. For starters, it's only one field. When you want to normalize information, try to keep like information together. You will never need to ask for middle name without asking for the first and last name, right? So, keep it with them.
Addresses are entirely different. Depending on how you are using the database, is there any chance you will store a person's information without an address? If there is, you have an argument for extracting it into a new table. In many cases, also, you may be asking for address information independently of the peron's personal information - also, a reason to keep addresses in a different table. A third reason to do this is if your end application will potentially need to store more than one (maybe a history of) address for a person.
In general, keep the end use in mind, but make sure you are designing with enough flexibility in case your end needs change. Even during development, project specifications can radically shift, and if your DB won't comply with the needed changes, you will find yourself sinking into a quagmire very quickly.
|
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|