Page 1 of 5 123 ... LastLast
Results 1 to 15 of 66
  1. #1
    Join Date
    Sep 2001
    Location
    Chicago, Illinois, USA
    Posts
    601

    Unanswered: Basic database design theory question . . .

    I come from a more traditional, flat file mindset such that, when I am designing tables, I almost always construct a primary key from an existing column or columns, resulting in a unique key for each record--and that unique key, therefore, has "meaning." The resulting data is usually physically-ordered closer to what I will be looking-for in actual usage, but I am having second-thoughts about this methodology.

    The alternative is, of course, using an identity as the primary key on every table. I've never been a fan of this because I've felt that the physical ordering is more important than the cleanliness of the identity design. I also like the fact that my primary key has "meaning"--it is not just some non-descript number. And finally, for maintenance purposes, if I want to pull down a single table from a customer database and graft it into my development system, there is always a danger that any foreign keys related to that table might be out of sync.

    Can you guys point me to some literature that discusses the advantages and disadvantages of both design methodologies?

    I'm thinking that my methods are old and outdated and I'd like to improve my designs, if it makes sense.

    Thanks. Ken
    Ken

    Maverick Software Design

    (847) 864-3600 x2

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    using natural keys is not "old and outdated"

    using identity keys is not an improvement

    physical arrangement of rows matters not, because sequence is guaranteed only when you use ORDER BY

    go find some other dragon to slay

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Sep 2001
    Location
    Chicago, Illinois, USA
    Posts
    601
    Thanks for making me feel that my ideas are "not old and outdated."

    Concerning your statement "physical arrangement of rows matters not, because sequence is guaranteed only when you use ORDER BY,"--this statement appears to focus on the ordering of the result set. I am less concerned about that than being concerned that the process of selecting the data performs much faster with an identity key than with a natural key.

    In other words, if I have a natural key that is the catenation of, say, six fields, nearing 25 total characters, every index associated with that table carries-along that 25-character overhead (I'm assuming). Even with an index, scanning through a list of 25-characters (multiple bytes per character???), versus scanning through a list of a integers--I assume that the integer scan is much more economical, and performs better, than the character scan.

    Therefore, when you are dealing with large tables, with large natural keys, is there a discernable performace difference between the naturally-keyed table and one with an identity primary key?

    I am reassured by your answer and would prefer to exist in the natural key world, but if there is a different way of doing things that results in a general performance improvement, I'm not afraid to change, even at my elevated age.
    Ken

    Maverick Software Design

    (847) 864-3600 x2

  4. #4
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    Provided Answers: 1
    You'll find raging debates on natural vs. surrogate keys all over the place, you just have to search specifically for those terms.

    You already hit on one of the big arguments. Another argument for surrogate keys where the natural key would have been unwieldy is sheer convenience/ease of use. It's easier to pass around a single integer than to maintain 4-6 fields of various datatypes.

    I'm a fan of both approaches for different reasons. I tend to err on the side of surrogates because I'm lazy.

    Anyway, this is one of those "holy war" debates that you should have no problems digging up...
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

  5. #5
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    Rudy's ideas are old and outdated.
    (Though as a Canadian, he remains huggable and cuddly.)
    There are a lot of reasons to favor surrogate keys over natural keys (You should still define natural keys as unique constrainst on your tables, though).
    You pegged one of the biggest reasons when you pointed this out:
    Quote Originally Posted by PracticalProgram View Post
    if I have a natural key that is the catenation of, say, six fields, nearing 25 total characters, every index associated with that table carries-along that 25-character overhead
    Additionally, your code will need to include all these compound key joins every time you link the table, plus you'll find that in certain schemas you simply cannot enforce cascading updates using compound natural keys.

    Surrogate keys provide consistency and efficiency.

    Now excuse me for five minutes while I go don my asbestos suit in preparation for the upcoming flame war.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  6. #6
    Join Date
    Sep 2001
    Location
    Chicago, Illinois, USA
    Posts
    601
    Apologies for possibly opening-up a sore subject.

    You've all given me some good insight though.

    And I have been able to search on "natural vs. surrogate" and there is quite a lot out there.

    Thanks.

    Ken
    Ken

    Maverick Software Design

    (847) 864-3600 x2

  7. #7
    Join Date
    Sep 2001
    Location
    Chicago, Illinois, USA
    Posts
    601
    I thought I was going to bow-out, but I decided to expand on one of my points . . . sorry . . .

    In reading about natural versus surrogate keys, I understand that there are some very fundamental advantages to using surrogate keys, which is why I am attracted to them. I would prefer to incorporate efficiency into my systems, early-on, and at every step along the way, so that I am never having to come back and optimize the system. Good habits and best practices, implemented on a daily basis, makes for good programs. So I like the idea of surrogate keys.

    However, I know from my experience, that for software maintenance purposes, I cannot rely on surrogate keys. I frequently deal with issues where I want to grab data from a customer site and place it in my local system to reproduce a problem.

    When you have a proliferation of surrogate keys, if you copy an individual table from one system to the next, you have to insure that every surrogate in the data you are copying is "satisfied", which is frequently very difficult to do (virtually impossible).

    So you end up having to copy over the entire database, which means that you have to have the time and the bandwidth and the capacity to accomplish that. If my customer has a 50GB database, there is no way that I am going to be able to copy that entire database over to my location in a timely fashion. At best, we are talking about a FedEx day away. And I have to have the time to deal with reloading that database on my system, and making sure everything is valid.

    Using natural keys most everywhere, I don't have to worry about any of that. Yeah, my system is not operating at optimum, but I can always work on issues on a moment's notice--copy over a small portion of data, fix the problem, and have the customer up and running quickly.

    Again, I really like the idea of surrogate primary keys. The situation I have described above is my greatest obstacle to switching-over to that methodology.
    Ken

    Maverick Software Design

    (847) 864-3600 x2

  8. #8
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    I am of course the "digital pervert" that goes both ways... I define and use both a natural key (if one exists) and always a surrogate key for every table.

    My rationale is that the Natural Key must be declared (and protected by DRI) when it exists. This is the key that makes sense to the "real world" and it is logically the cleanest key available. Unfortunately, it isn't always convenient to use a Natural Key while programming!

    I see the surrogate key as a "programming artifact" that makes it easier for me to create computerized systems to solve or automate the real world problems. In my way of thinking the surrogate key needs to be used by the programs, but never by the users of those programs. My code expectes for table "widget" that there will be a "widgetId" column that uniquely identifies every row in the table. When two pieces of code need to converse about a widget, they do so by passing the widgetId instead of the whole fam damily. My code never explicitly exposes the widgetId, and goes to some length to never expose it indirectly either if I can avoid it.

    The battle of Surrogate Key versus Natural Key has gone on about three minutes less than the distinction between them has existed. The faithful in both the Surrogate and the Natural Key camps will engage in data jihad without notice and without end. It is great fun to watch when this erupts in an otherwise peaceful environment, but I recommend having seats well back from the fray... The participants often get overly enthused and it is easy for innocent bystanders like me to get drug into the mayhem!

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  9. #9
    Join Date
    Mar 2007
    Location
    Holmestrand, Norway
    Posts
    332
    Personally, I tend to be a fan of surrogate keys, and I'll use a few seconds to explain why:
    * A surrogate key has no business value, so all fields except the surrogate key can be altered without having to care about updating foreign keys as well. In quite a few countries for instance, SSN is information bearing data, so if that information changes (like gender) you'll have to change SSN as well.
    * A surrogate key is typically 4 bytes (int). In the case that you on average have one reference to your primary key, you don't waste any space using surrogate keys. Norwegian SSN for instance is 11 chars, so with an average on 1 foreign key reference, we have 3 bytes saved. Saving space gives smaller databases and often better utilization of memory and I/O.
    * It is way faster to compare (and thus sort) integers (binary) than text (which natural keys tends to be).
    * Identity ints are good candiates for clustered indexes, whereas natural keys tends not to be (not ever incrasing nor small).

    So, there are quite a few advantages with ints. Of course there are drawbacks as well, the most common I come across is related to foreign keys, where a natural key in some cases may be the only value that I would like to retrieve with a join. With a natural primary key and a foreign key to this primary key my query would have required one less join. If you require more columns than present in the natural key, you would not gain anything though, it would most likely perform worse than a surrogate key.
    Ole Kristian Velstadbråten Bangås - Virinco - MSSQL.no - Facebook - Twitter

  10. #10
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by blindman View Post
    (You should still define natural keys as unique constrainst on your tables, though).
    this is most important!!!

    if you declare a surrogate key, and then, like many noobs, overlook declaring UNIQUE constraint(s) on your natural keys, havoc will ensue

    so basically, surrogate keys are extra data

    then whether to have a surrogate key simply devolves to a question of how many of your queries will be retrieving rows based on the surrogate key alone, which is the only justification for its existence -- this consideration, of course, includes joins

    joining a child row to a parent row is simpler, and usually more efficient, if the parent key is a surrogate, as then the child foreign key will be small (and the sql will be cleaner)

    note that the oft-repeated "advantage" of being able to change a (part of a) natural key without affecting table relationships is a red herring -- if you have a multi-part natural key, but the child table relies only on a surrogate foreign key, and you then update (part of) the natural key, you will often mess up because the child rows are not longer logically related, although the surrogate foreign key still thinks they are... danger, will robinson!!

    and all the arguments about saving space are likewise feeble... they might have made sense years ago when disk space was a thousand buck per megabyte, but nowadays you can get a terabyte for a hunnert bucks, so in my mind, saving space, especially if it involves needless extra processing, is a non-starter

    finally, roac made a point which also needs emphasizing -- surrogate keys often force additional joins -- you need to make the join in order to retrive the related row's natural key, whereas if the foreign key had been a natural key in the first place, the additional join is not necessary

    so surrogate keys definitely have their downside

    still, i uses dem too
    Last edited by r937; 06-29-11 at 18:37.
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  11. #11
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    Quote Originally Posted by r937 View Post
    note that the oft-repeated "advantage" of being able to change a (part of a) natural key without affecting table relationships is a red herring
    Agree that this is a non-issue in most practical situations, but then so is this:
    Quote Originally Posted by r937 View Post
    if you have a multi-part natural key, but the child table relies only on a surrogate foreign key, and you then update (part of) the natural key, you will often mess up because the child rows are not longer logically related, although the surrogate foreign key still thinks they are
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  12. #12
    Join Date
    Mar 2007
    Location
    Holmestrand, Norway
    Posts
    332
    r937: You are definitely right that disk space is not nearly as expensive as before. It is not the cost of the space I am worried about, but rather the IO cost. With larger rows, you have less rows in each page. Thus, when you are reading quite a few rows, you'll often end up reading more pages, which (if not cached) includes more IO from the disk system. In nearly all the cases I've seen SQL Servers with performance issues, it is due to lack of IO and/or memory.
    Ole Kristian Velstadbråten Bangås - Virinco - MSSQL.no - Facebook - Twitter

  13. #13
    Join Date
    Dec 2007
    Location
    London, UK
    Posts
    741
    if I have a natural key that is the catenation of, say, six fields, nearing 25 total characters, every index associated with that table carries-along that 25-character overhead (I'm assuming).
    No, you'd be wrong to assume that. I think you are referring here to a clustered index. The natural key doesn't have to be part of a clustered index and a clustered index doesn't have to be on a key (natural or otherwise). So this is not an argument for or against natural keys.

  14. #14
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    Then what would you make the clustered index, if not your unique natural key?
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  15. #15
    Join Date
    Sep 2001
    Location
    Chicago, Illinois, USA
    Posts
    601
    dportas, if you go back to my original posting, I am talking about Primary Keys, which are unique and clustered. I am assuming, and please correct me if I am incorrect in assuming this point, that Primary Keys, if defined, are used as the basis of all other indexes. Is that correct, or am I wrong on that?
    Ken

    Maverick Software Design

    (847) 864-3600 x2

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •