Welcome to the dBforums forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions, articles and access our other FREE features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload your own photos and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact contact support.

If you prefer not to see double-underlined words and corresponding ads, place your cursor
here for ContentLink opt out.

Go Back  dBforums > General > Database Concepts & Design > Potential Overhead of Mostly Empty Columns

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 06-03-04, 20:55
incomplete1931 incomplete1931 is offline
Registered User
 
Join Date: Jun 2004
Posts: 3
Potential Overhead of Mostly Empty Columns

I am fairly new to database design and the normalization process, so please forgive me if my question is somewhat trivial. I am keen to adhere to the 'standards of normalization', but I also want to make sure that I do not sacrifice database performance in the process.

My question concerns the potential overhead of having one or more mostly empty columns in a table. Is that okay? Or should these columns be separated out into a different table? Let me illustrate with an example.

Let's say I am creating a 'contacts' table with the following fields:

contact_id
first_name
middle_names
last_name
address_1
address_2
city
state
zip
...
etc.

Now, let us also assume that I will be populating my database with records from another database application. I know that only about 2% of these records have entries for the 'middle_names' field. I also know that the majority of people filling in a Web form to insert records into this 'contacts' table will leave the 'middle_names' field blank. This suggests that I will have a column in the table that will mostly consist of empty values.

In an ideal world, I would like every column in every row in my table to have a value, so I am not saving empty space. But obviously this cannot happen in the real world. So, should I be making another 'middle_names' table, like so?

middle_name_id
contact_id_fk
middle_name

I could use a join in my queries whenever I needed to return the middle name of a specific contact. Is this the 'structurally right' thing to do? Or is it okay to have a few columns in my table that are mostly empty? If it is, should my values be empty strings or NULLs? What are the advantages and disadvantages of using NULL entries in columns?

My question also applies to the address fields in this table. For example, some people require more than two lines for their address in addition to the 'city', 'state', and 'zip' fields. I am weary of creating an 'address_3' field when only a fraction of the rows in the table will have entries for this column. So, again, should I be splitting out the address fields into an 'address' table?

address_id
contact_id_fk
address_line
address_line_number

Perhaps these are trivial concerns but I want to get off on the right foot. I would appreciate any feedback and/or advice.

Thank you in advance for any help!
Reply With Quote
  #2 (permalink)  
Old 06-03-04, 22:10
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 13,554
in the specific instance of addresses, it is far, far better to have columns that can be NULL (note: NULL is not the same thing as empty)

separate tables for those infrequently filled fields is just asking for toruble, both in performance and in query complexity

disk space is cheap

and anyway NULL doesn't take up a whole lot of room
__________________
r937.com | rudy.ca

pre-order my book Simply SQL from Amazon
Reply With Quote
  #3 (permalink)  
Old 06-04-04, 05:34
incomplete1931 incomplete1931 is offline
Registered User
 
Join Date: Jun 2004
Posts: 3
Hey Rudy,

Thanks for the reply. So, you are saying that I should not worry about using sparsely filled columns. I realize that my queries could get very complicated if I separated out all those columns into distinct tables!

But at what point does one draw the line? At what point do you say, this information should be stored in a separate table? Does it depend on what type of information you are separating? You might say, for example, that 'middle_name' should not be separated into a different table because it is dependent on the 'contact_id'. I realize that the answer to this question might be subjective, but what do people use as a general rule of thumb when designing databases? When, if ever, should a mostly empty column be separated out into a different table?

I am eager to hear what people think!

Thank you in advance, once again, for any feeback.
Reply With Quote
  #4 (permalink)  
Old 06-04-04, 18:08
certus certus is offline
Registered User
 
Join Date: Dec 2003
Location: Canada
Posts: 669
Separating out data is reserved for the elimination of redundancy not for the absence of values. NULLs take care of absent values.

It is better to ask yourself this: Is my database for processing transactions or is it processing queries?

If you are processing transactions, normalize to get your performance gains.

If you are processing queries, denormalize to get your performance gains. When you're processing queries redundancy is king by having as few joins as possible. However, performance in the better RDBMSs is getting to the point where denormalized databases are beginning to be discouraged. A dimensional database is being considered too inflexible.
__________________
visit: relationary
Reply With Quote
  #5 (permalink)  
Old 06-04-04, 18:52
incomplete1931 incomplete1931 is offline
Registered User
 
Join Date: Jun 2004
Posts: 3
Thank you for the response, Certus!

As I understand it, I should normalize if my database primarily processes transactions. I think my database will mostly, though not entirely, be used for processing queries so I will be weary of separating out my columns into too many tables. I realize that I should aim to keep the number of joins in my queries to a minimum. Your advice seems sound.

However, what do you mean by a "dimensional database"? Forgive my ignorance if this is a stupid question.

Also, you said that "performance in the better RDBMSs is getting to the point where denormalized databases are beginning to be discouraged." I assume you are referring to Oracle and DB2 as instances of better RDBMSs? I am using MySQL for my project. I don't know if that qualifies as one of the better ones. Still, I will take your advice and try and keep the number of joins down even if that means having columns that are mainly empty.

Thank you, once again, for your help!
Reply With Quote
  #6 (permalink)  
Old 06-04-04, 21:15
certus certus is offline
Registered User
 
Join Date: Dec 2003
Location: Canada
Posts: 669
__________________
visit: relationary
Reply With Quote
  #7 (permalink)  
Old 06-04-04, 21:22
certus certus is offline
Registered User
 
Join Date: Dec 2003
Location: Canada
Posts: 669
Oh yes, Teradata, a major player in data warehousing is beginning to argue against dimensional data warehouses.

I don't have the white paper, but here is their site:

http://www.teradata.com/t/

Ah, found the white paper:

http://www.teradatalibrary.com/pdf/eb2406.pdf
__________________
visit: relationary

Last edited by certus : 06-04-04 at 21:29.
Reply With Quote
  #8 (permalink)  
Old 06-11-04, 13:07
Cipherlad Cipherlad is offline
Registered User
 
Join Date: Dec 2003
Location: Inland Empire
Posts: 15
Rules of Thumb

When I am designing a database, I go mostly by instinct. I realize this may not be helpful when one is looking for hard and fast rules for database design, but I beleive one has to have a good feel for what will be helpful to extract out into a new table, and what will not.

For instance, in the examples you gave, middle name should not be extracted out into a new table. For starters, it's only one field. When you want to normalize information, try to keep like information together. You will never need to ask for middle name without asking for the first and last name, right? So, keep it with them.

Addresses are entirely different. Depending on how you are using the database, is there any chance you will store a person's information without an address? If there is, you have an argument for extracting it into a new table. In many cases, also, you may be asking for address information independently of the peron's personal information - also, a reason to keep addresses in a different table. A third reason to do this is if your end application will potentially need to store more than one (maybe a history of) address for a person.

In general, keep the end use in mind, but make sure you are designing with enough flexibility in case your end needs change. Even during development, project specifications can radically shift, and if your DB won't comply with the needed changes, you will find yourself sinking into a quagmire very quickly.
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On