Can anyone point me to some guidelines as to when you should use a pseudo-key like an autoincrementing number as the primary key of a table instead of a longer text value (which will also be unique).
My instinct is that use of the pseudo-key makes the database smaller (and perhaps faster) since less space is required for foreign key storage and less data must be compared for joins. It also allows changes to the associated text value without concerns about relationships, so I've gotten into a habit of doing this almost universally.
There is a plenty of discussion on this topic. you may want to search using a good search engine like google. But here is my two cents worth:
A primary key must be part of the data the table is representing. Good examples of a Primary key are: Emp_ID, Product_ID, SS_NO, tel_no, etc. They are naturally unique keys within the enterprise or the domain.
An auto-incrementing ID, though it satisfies the requirement of a primary key, is a wrongway to use it. Those keys do not have any meaning of their own. Worse yet, the value of such PK varies depending on the order of creation! When you export a table and import, you may not be able to use that key at all (it happens on Microsoft products!). Every table in your design must have a natural PK, which is part of the data itself. Period. No need for external intervention.
If you must have a auto-identity-key, it must be a surrogate key, with no bearing on the referential integrity.
Coming to the type of the key. It is tempting to think an INT/LONG is more efficient than a CHAR. Almost all commercial databases have mastered the art of storing/searching string-type index files. So performance should not be a concern. No real world data are Integers!! SS_Nos, Tel_nos, Zip codes, Emp_IDs, Product_Ids,.. none of them are pure numeric. Show me one real world key that is pure numeric. Probably dates, using the julian convention, can be expressed as longs. But you need to use a function to convert them to be of any use in the real world. So, in my opininon, you can never have numeric keys all the time.
You make some valid points, but I think David Cressey's March 25 post here titled "Surrogate keys" describes the real world a little better. There will be some cases where you cannot have a naturally unique key, sometimes caused by the errors of human behavior and sometimes unavoidably by the nature of the data. So how about a hybrid primary key, where you combine a mostly unique natural key with a meaning-free surrogate value.
Beware of the pitfalls of those "naturally unique" keys. Some of them tend to be not unique! E.g. the SS_NO you mentioned: The authorities succeed in assigning the SAME SS_NO to different individuals every now and then. And then you're really screwed if you've designed your database with the SS_NO as PK.
Also note that all of the examples quoted are surrogate keys. Emp_ID is used as a key because there's no natural key that uniquely identifies an individual person. Even phone numbers are just an arbitrarily assigned code.
They can be treated as natural keys because we trust the system that generates them to enforce their uniqueness. In the case of employee or product codes that's because we control the system. If you don't trust the system then you have three alternatives:
a) Use surrogate keys
b) Assume uniqueness but add some error-handling code
c) Assume uniqueness and hope you get another job before anyone finds
I believe that the discussion is ignoring the situation. After all, everything is relative :-)
The whole point of this discussion is what is fast enough and what is fastest. For instance, if you wanted to generate a char(32) key for every row in a table, as opposed to a numeric( 32 ) obvious differences occur when the table exceeds certain row limits, (i.e. index depth becomes greater than 3). If you wanted to use a table with 300 rows and decided that you really didn't want to do an additional surrogate key, this might make a lot of sense. But if you are talking about a million rows, or worse, hundreds of millions, or even worse, a higly scalable design with the capability of partitioning the data amongst multiple databases supporting hundreds of billions of rows, then (as most women will agree) SIZE DOES MATTER!
So when you are facing the impossible (something I specialize in every day) then you have to examine every facet of choosing a key. Ultimately, the volumetrics of the data are a driving factor in a decision about surrogate key usage. My advice is when you have the worst case, use an exact numeric datatype (they will compare faster than a comparably sized character datatype), they are typically 1/2 the size of a equally sized character datatype (numeric(38,0) occupies 17 bytes of storage while char(38) always uses 38 bytes.