Results 1 to 8 of 8
  1. #1
    Join Date
    Jun 2004
    Posts
    3

    Potential Overhead of Mostly Empty Columns

    I am fairly new to database design and the normalization process, so please forgive me if my question is somewhat trivial. I am keen to adhere to the 'standards of normalization', but I also want to make sure that I do not sacrifice database performance in the process.

    My question concerns the potential overhead of having one or more mostly empty columns in a table. Is that okay? Or should these columns be separated out into a different table? Let me illustrate with an example.

    Let's say I am creating a 'contacts' table with the following fields:

    contact_id
    first_name
    middle_names
    last_name
    address_1
    address_2
    city
    state
    zip
    ...
    etc.

    Now, let us also assume that I will be populating my database with records from another database application. I know that only about 2% of these records have entries for the 'middle_names' field. I also know that the majority of people filling in a Web form to insert records into this 'contacts' table will leave the 'middle_names' field blank. This suggests that I will have a column in the table that will mostly consist of empty values.

    In an ideal world, I would like every column in every row in my table to have a value, so I am not saving empty space. But obviously this cannot happen in the real world. So, should I be making another 'middle_names' table, like so?

    middle_name_id
    contact_id_fk
    middle_name

    I could use a join in my queries whenever I needed to return the middle name of a specific contact. Is this the 'structurally right' thing to do? Or is it okay to have a few columns in my table that are mostly empty? If it is, should my values be empty strings or NULLs? What are the advantages and disadvantages of using NULL entries in columns?

    My question also applies to the address fields in this table. For example, some people require more than two lines for their address in addition to the 'city', 'state', and 'zip' fields. I am weary of creating an 'address_3' field when only a fraction of the rows in the table will have entries for this column. So, again, should I be splitting out the address fields into an 'address' table?

    address_id
    contact_id_fk
    address_line
    address_line_number

    Perhaps these are trivial concerns but I want to get off on the right foot. I would appreciate any feedback and/or advice.

    Thank you in advance for any help!

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    in the specific instance of addresses, it is far, far better to have columns that can be NULL (note: NULL is not the same thing as empty)

    separate tables for those infrequently filled fields is just asking for toruble, both in performance and in query complexity

    disk space is cheap

    and anyway NULL doesn't take up a whole lot of room
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Jun 2004
    Posts
    3
    Hey Rudy,

    Thanks for the reply. So, you are saying that I should not worry about using sparsely filled columns. I realize that my queries could get very complicated if I separated out all those columns into distinct tables!

    But at what point does one draw the line? At what point do you say, this information should be stored in a separate table? Does it depend on what type of information you are separating? You might say, for example, that 'middle_name' should not be separated into a different table because it is dependent on the 'contact_id'. I realize that the answer to this question might be subjective, but what do people use as a general rule of thumb when designing databases? When, if ever, should a mostly empty column be separated out into a different table?

    I am eager to hear what people think!

    Thank you in advance, once again, for any feeback.

  4. #4
    Join Date
    Dec 2003
    Location
    Canada
    Posts
    710
    Separating out data is reserved for the elimination of redundancy not for the absence of values. NULLs take care of absent values.

    It is better to ask yourself this: Is my database for processing transactions or is it processing queries?

    If you are processing transactions, normalize to get your performance gains.

    If you are processing queries, denormalize to get your performance gains. When you're processing queries redundancy is king by having as few joins as possible. However, performance in the better RDBMSs is getting to the point where denormalized databases are beginning to be discouraged. A dimensional database is being considered too inflexible.

  5. #5
    Join Date
    Jun 2004
    Posts
    3
    Thank you for the response, Certus!

    As I understand it, I should normalize if my database primarily processes transactions. I think my database will mostly, though not entirely, be used for processing queries so I will be weary of separating out my columns into too many tables. I realize that I should aim to keep the number of joins in my queries to a minimum. Your advice seems sound.

    However, what do you mean by a "dimensional database"? Forgive my ignorance if this is a stupid question.

    Also, you said that "performance in the better RDBMSs is getting to the point where denormalized databases are beginning to be discouraged." I assume you are referring to Oracle and DB2 as instances of better RDBMSs? I am using MySQL for my project. I don't know if that qualifies as one of the better ones. Still, I will take your advice and try and keep the number of joins down even if that means having columns that are mainly empty.

    Thank you, once again, for your help!

  6. #6
    Join Date
    Dec 2003
    Location
    Canada
    Posts
    710

  7. #7
    Join Date
    Dec 2003
    Location
    Canada
    Posts
    710
    Oh yes, Teradata, a major player in data warehousing is beginning to argue against dimensional data warehouses.

    I don't have the white paper, but here is their site:

    http://www.teradata.com/t/

    Ah, found the white paper:

    http://www.teradatalibrary.com/pdf/eb2406.pdf
    Last edited by certus; 06-04-04 at 21:29.

  8. #8
    Join Date
    Dec 2003
    Location
    Inland Empire
    Posts
    18

    Rules of Thumb

    When I am designing a database, I go mostly by instinct. I realize this may not be helpful when one is looking for hard and fast rules for database design, but I beleive one has to have a good feel for what will be helpful to extract out into a new table, and what will not.

    For instance, in the examples you gave, middle name should not be extracted out into a new table. For starters, it's only one field. When you want to normalize information, try to keep like information together. You will never need to ask for middle name without asking for the first and last name, right? So, keep it with them.

    Addresses are entirely different. Depending on how you are using the database, is there any chance you will store a person's information without an address? If there is, you have an argument for extracting it into a new table. In many cases, also, you may be asking for address information independently of the peron's personal information - also, a reason to keep addresses in a different table. A third reason to do this is if your end application will potentially need to store more than one (maybe a history of) address for a person.

    In general, keep the end use in mind, but make sure you are designing with enough flexibility in case your end needs change. Even during development, project specifications can radically shift, and if your DB won't comply with the needed changes, you will find yourself sinking into a quagmire very quickly.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •