# Thread: Relating two tables using a PK and a UK

1. Registered User
Join Date
Apr 2002
Posts
75

## Unanswered: Relating two tables using a PK and a UK

Hi all!

Is it possible to relate two tables (in the same db) using the PK (int) from one table to a UK (int, allownulls=No) from another?

I've tried, but had no success. Both tables are empty, so there is no way of having a null in the UK column that would prevent relating it to a PK. Just wanted to know if such relation is even possible before spending additional time on figuring out how to do it...

Thanks a lot!

2. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
This sounds like an assignment, so I need to hedge my answer a bit.

From a RA (Relational Algebra) perspective no, this is a transgression.

From a practical perspective, usually. It depends on what you expect to achieve.

Please explain the real world problem that you're trying to achieve. Once we understand that, we can give you a better answer (and practical advice).

-PatP

3. Registered User
Join Date
Apr 2002
Posts
75

I have a table with a compound PK in one table; so, instead of bringing the several columns that make up the PK to the other table, I would like to create a Unique key column (identity) and use it for relating the table to other tables. In one particular case, that identity column needs to be part of a compound PK.
Please refer to the attached dbforums.PNG for the diagram.
(All id fields are int, with allownulls=no)

Thanks!

4. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
Is there any reason that you couldn't make your Unique Key into a Surrogate Key? This is basically just a unique column that doesn't allow NULL values which you don't appear to allow anyway. Then the answer becomes an unequivocal "yes" from any perspective.

-PatP

5. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
Hold on a second... The traditional definition for a UK allows NULL values, but your definition does not. I either misread or misunderstood what you meant in your first post.

Yes, you already have a SK (Surrogate Key), so you're fine!

-PatP

6. Registered User
Join Date
Apr 2002
Posts
75
Thanks Pat!
got it fixed: the only thing that was preventing me from creating the relationship was that I needed--besides setting the field to identity and allownull=no--to run ALTER TABLE name ADD UNIQUE (key)...

7. Registered User
Join Date
Jan 2013
Posts
359

## Wrong mindset

I have a table with a compound PK in one table; so, instead of bringing the several columns that make up the PK to the other table, I would like to create a Unique key column [sic] (IDENTITY) and use it for relating the table to other tables. In one particular case, that identity column [sic] needs to be part of a compound PK.
No. You want to build a fake pointer chain just like you did with assembly language programs and pre-relational network databases in the 1960's. This will screw you up if the schema has any changes. IDENTITY cannot ever be a key by definition; it is the count of physical insertion attempts to one physical table on one physical disk. It is also not a column, by definition it is a table property completely separate from the concept of a column (no NULLs, no computations possible).

If the extra typing bothers you, use a macro or a global replace; text editors are good at this.

8. World Class Flame Warrior
Join Date
Jun 2003
Location
Ohio
Posts
12,595
Using a surrogate key does not "This will screw you up if the schema has any changes".
vasilyok, your use of a surrogate key here is both appropriate and efficient.

9. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
At least in my opinion, the Identity property has nothing to do with a PK (Primary Key). The Identity property in MS-SQL neither guarantees a column can be a PK nor forbids that column from being a PK.

Please bear with me, but I need to "speak geek" to make the next few points.

Not all collections of data have an NK (Natural Key). This presents a problem, since RA (Relational Algebra) can't represent that kind of data, without a PK you can't have RA. This is one opportunity for an SK (Surrogate Key), since by adding a member with a purely arbitrary unique value to every tuple (row) the data then becomes manageable using RA.

Even when an NK exists, there are some serious practical reasons that make an SK desirable. Particularly when using application frameworks, having a single column PK that can be determined from the relation (table) name is a HUGE benefit to the overall system complexity. This makes it easy for the framework to deal with any table that it finds, since the framework can quickly infer the PK from the naming standard which makes it easy to write/deploy/manage code and usually allows that code to survive schema modifications as long as those modifications don't change any hard coded references to the schema.

Ok, now I can revert back to English. Thanks for bearing with me!

This means that as long as you enforce any NK for the table with an appropriate unique constraint, you can use a surrogate key as the primary key with no harm from an RA perspective and significant improvements in application robustness and utility, as well as database performance, size, etc.

I'm with blindman, and would use the SK as well as enforcing the NK too.

-PatP

10. Registered User
Join Date
Jan 2013
Posts
359

## Why we do not use IDENTITY

Why would anyone think that an unpredictable count of physical access attempts (not even successes!) on one particular piece of hardware to one particular table in software from one vendor is part of RDBMS and correct data modeling?

Let's define what characteristics you want in an identifier.

1) An identifier has to be unique. Otherwise, it is not an identifier. Ideally, it should be an industry standard (validation and verification are discussed shortly). IDENTITY by itself only generates exact numeric values, but doing math on them make no sense. This comes from UNIX files on 16-bit hardware (read: PDP-11), which were based on magnetic tape file systems. Each physical record gets a record number at the start, then the fields are written in sequence within the records, just as the records are in physical sequence without the file. Since the records are strings of varying length, searches are done by reading the record number and moving the read/write heads as needed. If the records has been dropped, it is set negative but left in place for searching. The algorithms for this kind of searching are based on square roots, Fibonacci numbers that are fun to look at --- like learning how to take the n-th root of a number by hand I digress.

2) An identifier should be created with the entity, or before the entity exists, but not afterward. This is straight OO theory, not RDBMS. Without an identifier, you simply cannot put an entity into the database. As an example of an identifier coming into existence before the entity, think about a book that has not been published yet. It can be assigned an ISBN (International Standard Book Number), a title, an author, a price, and everything else while the publisher is waiting to get the manuscript.

But a future release book does not behave like a real book. You cannot put it in a box and ship it. You cannot get a review of the book either -- at least not an honest review ;-). It is not the same kind of thing as a real published book.

3) It should be verifiable within itself. That means that when I see a particular kind of identifier, I ought to know if it is syntactically correct. For example, I know that ISBN10: 0-486-60028-9 has the correct number of digits and that the check digit is correct for a proper old style International Standard Book Number. Later on I can find out that it identifies the Dover Books edition of AN INVESTIGATION OF THE LAWS OF THOUGHT by George Boole.

4) An identifier should have repeatable verification against the reality that you are trying to capture in your data model. It is not tied to the hardware of one machine.

Exactly what verification means can be a bit fuzzy. At one extreme, prison inmates are moved by taking their fingerprints at control points and courts want DNA evidence for convictions. At the other end of the spectrum, retail stores will accept your check on the assumption that you look like your driver's license photograph. :crying:

IDENTITY is an exposed PHYSICAL locator. What does that phrase mean? The value is created by looking at the internal state of one particular piece of hardware at the time a PHYSICAL record containing a row is inserted into storage. Its purpose is to locate the row without any regard to what the data means.

Think about using a pointer or a track/sector number (Oracle ROWID); same thing but different mechanism. But SQL does not have a pointer data type or the mechanisms to handle pointer operators, garbage collection and housekeeping, so 25+ years ago the original Sybase SQL Server exposed an integer that can map back to the contiguous UNIX storage model used under the covers. It made cursors easier for the programmers and the early SQL engines built on file systems on simple 16-bit machines. It was pretty fast on that hardware, too. But smell the coffee; this is the 21-st century and the hardware is 32 or 64 bit – and it is sophisticated compared to the stuff we had in 1970. Multi-word operations are built-in now.

IDENTITY is not an attribute in the data model and can never be an attribute in the data model because it does not exist in the reality from which you derive your data model. Sure, I lose all the advantages of an abstract data model, SQL set oriented programming, carry extra data, destroy the portability of code and still have to maintain all the data integrity among keys and constraints. But this is easier to write than a real normalized RDBMS schema.

11. Registered User
Join Date
Jan 2013
Posts
359

## Why we do not use IDENTITY Part II; quotes from Codd

A non-SQL programmer too often expected to produce a correct and usable database. The smart ones will get some help and beg for training, But most of them simply start programming SQL as if it were their native language. They grab at GUIDs, IDENTITY, ROWID and other proprietary auto-numbering "features" in SQL products to imitate either a record number (sequential file system mindset), pointers and links (disk file mindset) or OID (OO mindset) since they know that model.

They write code with cursors to mimic record-at-a-time file handling. They write to temp tables to mimic scratch files in a series of procedural steps. They use dynamic SQL and let the user figure out how the system should work on the fly. Etc.

Good database designers look for industry standards for their keys. Try to tell the IRS you do not have anybody's SSN at tax time but "Cindy Lou Who" was the 42-nd employee put into the Personnel table and you have the IDENTITY value to prove it. Try to sell a car without a VIN -- using IDENTITY for this would be like identifying a vehicle by the current local parking space number (think about it - that is exactly what an IDENTITY value is).

Finally, it is not possible to have a table with an IDENTITY PRIMARY KEY in anything higher than 2NF. The IDENTITY has no facts about it being stored, so everything in the table will have a transitive dependency on the candidate key.

We know that we have to use industry standard codes, either de facto or de jure. We also appreciate the fact that we can exchange data with the rest of the world. We appreciate the fact that someone else will maintain and define these codes. A trusted external source is a good thing to have.

If you do not have an industry standard or natural key and have to dign your own codes, it is hard work to do it right. I know that newbie programmers want to start coding first and thinking later. It does not work that way and all the "wish magic" in the world will not change that fact. I have several chapters in my books on how to design encoding schemes, but that is another topic.

The first practical consideration is that IDENTITY is proprietary and non-portable, so you know that you will have maintenance problems when you change releases or port your system to other SQL products. Newbies actually think they will never port code! Perhaps they only work for companies that are failing and will be gone. Perhaps their code is such crap nobody else wants their application. Otherwise, you will port code; you will share data with some other database; data does not exist in isolation.

But let's look at the logical problems. First try to create a table with two columns and try to make them both IDENTITY. If you cannot declare more than one column to be of a certain data type, then that thing is not a data type at all, by definition. It is a table property which belongs to the PHYSICAL table implementation, not the LOGICAL data inside the table.

Next, create a table with one column and make it an IDENTITY. Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition.

Finally create a simple table with one IDENTITY and a few other columns. Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the IDENTITY sequentially numbered them in the PHYSICAL order they were presented. If you delete a row, the gap in the sequence is not filled in and the sequence continues from the highest number that has ever been used in that column in that particular table. This is how we did record numbers in sequential disk files in the 1950's, as I said before. A utility program would "pack" or "compress" the records that were flagged as deleted or unused to move the empty space to the PHYSICAL end of the PHYSICAL file. Files do not have references to other files, so the only worry is that a program will use a hardwired record number.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set which has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time.

There are (n!) ways to number (n) rows, so which one did you pick? Why? The answer has been to use whatever the PHYSICAL order of the result set happened to be. That non-relational phrase "PHYSICAL order" again!

But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result back in a different PHYSICAL order. Indexes and statistics are not part of the logical model.

Can you explain from a logical model why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical.

How do you verify that an entity has the right key when you use an IDENTITY? If I use a VIN for a vehicle, I can go to parking lot and read it off the dashboard. If I use a UPC for a candy bar, I can read the bar code on the wrapper. But why is little 'Cindy Lou Who' employee 42? If I turn her upside, will I find that she has 42 tattooed somewhere? :w00t: If I call an external trusted source, will they know that she is employee 42? No, they will want her driver's license, tax id number or something.

In the Relational Model, you do not invent a key in the storage. You discover a key (and the other attributes) in the real world and model it. If you create your own encoding for a key, then you have to maintain it, provide audit trails and do all the work that an industry standard organization would do for you.

If I lean on a mouse button, I can insert the same data with a new IDENTITY over and over. 'Cindy Lou Who' is now employed two times and none of my reports are right! Now I have to write some procedural code like a trigger or a UNIQUE constraint on her tax identification number to prevent this, thus making the IDENTITY redundant. But we were assuming that we use only IDENTITY as a key, so we are screwed.

Newbies often design tables without bothering to look for a relational key, so they are so surprised when they do a data warehouse and nobody else has any idea what they are doing in their subsystem.

'Cindy Lou Who' now has two rows in Personnel. When we sign her up for the Dental Plan, we get the row with 42. When we sign her up for the Bowling Team, we get the row with 43. We find our error, and delete the row with 42 because we have a row that was created later and we assume it is more current.

Another common way to get this is to have two procedures, one for inserting a new employee to the Dental Plan and one for inserting a new employee to the Bowling Team. Both procedures create a row in Personnel since they use only IDENTITY as a key.

Well, now we have an orphan row in Personnel. In fact, in such systems, you will find a lot of orphans. I worked for a company that used GUIDs for OIDs substitutes and our software fell apart in about a year of actual use by a client.

Another cute way to destroy data integrity:

BEGIN TRANS
DELETE FROM Foobar
WHERE foobar_id = <constant>;
COMMIT;
INSERT INTO Foobar
VALUES (<row expr identical to what was deleted>);
COMMIT;
END;

Logically this should do nothing in an RDBMS, but since IDENTITY has gaps, it trashes the data. When one query uses the IDENTITY and another uses the real key, you are like a man with two watches, you are never sure what time it is.

Finally, an appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.

This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410).

References: Codd, E. (1979), Extending the database relational model
to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

12. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
Originally Posted by Pat Phelan
This means that as long as you enforce any NK for the table with an appropriate unique constraint, you can use a surrogate key as the primary key with no harm from an RA perspective and significant improvements in application robustness and utility, as well as database performance, size, etc.
I understand that you've got a great theoretical argument going for supporting natural keys (when a Natural Key exists). As you can see from my prior post, I support them too!

I don't have the level of trust that you do... Natural Keys break, sometimes by accident and sometimes by design. I can't afford to have my systems down for extended periods of time while I rebuild schemas and then recode applications to support the latest legislative nightmare. By declaring the Natural Key as an AK (Alternate Key) and creating an SK (Surrogate Key) that will be used by my apps for their own purposes, I can dodge the bullet of a broken NK easily. This way my apps have full control of the the PK that they depend upon internally, but the NK is still protected by a UNIQUE CONSTRAINT so that relational integrity is maintained.

You are quite correct that the database engines that support the Identity Property all do it slightly differently. Microsoft has made the Identity a table level property, for good technical reasons. You can use CREATE SEQUENCE and bind the sequences in intriguing ways that shoot that whole concept to poppycock. It is a straw man argument anyway and has no real merit as along as a NK is enforced too.

Any system that relies on any meaning for system-generated unique keys is a disaster waiting to happen in my eyes. Those keys should be unique, and ABSOLUTELY NOTHING ELSE, no formatting, no sequence, no meaning of any kind whatsoever. I used to like GUID values because they were pseudo-randomly distributed. The problem I kept hitting is that they are just so blasted big that they aren't all that practical for what I (or my clients) need. As long as they are unique, I'm Ok with whatever hat the magician pulls the unique values from!

I see IDENTITY or SEQUENCE values as a convenient way to generate SKs. They have a bad habit of being ascending/sequential, but that brings its own benefits as it makes a lot of primitive operations at that database engine more efficient such as the neat way that Microsoft dodged the bullet of the "insertion point hot spot" that severely throttled MS-SQL 6.5. For the performance that it buys me, I can forgive the fact that the numbers are ascending.

BTW, do you have anything that supports the idea of garbage collection, file packing, or anything like that? I haven't seen any of the major database engines ever do that unless you count the VSAM file operations that are done as part of zOS DB2 grooming which I see as just normal file system maintenance.

-PatP

13. Registered User
Join Date
Jan 2013
Posts
359
Natural Keys break, sometimes by accident and sometimes by design.
That is the nice part of industry standards! They do not “break”, they “mutate” ! In the last decade we have seen the old ISBN-10 become the ISBN-13 for books; the 10-digit UPC has become the 13-digit EAN on bar codes in retail (soon to be the 15 digit GTIN). This is the part about verification; there is a trusted source for everyone and usually software for the migrations. And a community finding errors, too.

Compare that to the poor bastards that moved or restored a T-SQL schema and re-numbered the IDENTITY on his tables. Anyone who has worked with T-SQL over 10 years has see it at least once! I think it is called the “Second Law of Backups and Stupid People” (the first law is “Backup? What backup?”). There is no ON UPDATE CASCADE here.

I worked with voter registration for the State of Texas a few years ago. We had a really natural key: (longitude, latitude). We sent out workers with cell phones that had a GPS app. The GIS system sends out the voter rosters. The worker comes to each house, asks some simple questions and pushes a button. This ties into the 9-1-1 database (NOT “nine-eleven”; this is the emergency response system that assigned uniform street addresses). From this we compute the Congressional voting district, water voting district, school board district, state election districts, local municipal precincts and one or two others I cannot remember. They are all different.

14. Resident Curmudgeon
Join Date
Feb 2004
Location
In front of the computer
Posts
15,579
How do you deal with four people who were intentionally assigned the same SSN? Or when the Social Security Administration "fixed" that by adding a suffix of a letter and one or two digits? Then when they fixed their fix by assigning different SSN values so that each of those person's SSN became three different values depending on time?

How do you deal with people that have no SSN? Newborn Infants are one good example, since it takes at least 90 days to get an SSN assigned to them. NRAs (Non-Resident Aliens) are another good example of these. Do we just ignore the services provided to those folks, and if so who will pay for those services?

If you ask me, these are both flat out breaks in the proposed NK (Natural Key) of a national identifier like SSN. These aren't evolution, they are failures.

VIN numbers are another great NK... Until those break too. I used to collect antique cars. I had a dickens of a time registering my Volvo 544-D because it had the same VIN as my Berkelly roadster. Two different vehicles, two different manufacturers, two different years, two different countries of origin, but the same VIN.

I agree that Natural Keys ought to be solid, stable, and secure. They should be tied to the object that they represent. The real world works with that concept a very high percentage of the time, but I need PK (Primary Key) values that always work.

Most of the "high end" database engines have dealt with the "insert hot spot" problem in one way or another. For all of those database engines to get high performance ascending keys are a critical factor. At least in my experience, NK values tend to be much more random than ascending! This pretty much keeps true NK for PK systems out of the high performance part of the market.

New tables are added to the databases that I manage with almost every release. Legacy applications using the frameworks that I've built can cope with those new tables because they can infer the PK from the table name. They can safely insert/update/delete from those tables because the relational integrity is maintained by UNIQUE CONSTRAINTS within the database, not by anything within the application or the framework (although I have considered extending the framework to handle this). I'm not willing to put the work into making that happen when the PK (Primary Key) is an NK!

You have great points Joe, and although a few of them are debatable I agree with most of them. The problem is that I live where the rubber meets the road... Failures in my apps cost lots of money and can cost lives. I build schemas and apps that work exactly the way you think that they ought to, but I have a "fail safe" built in that allows the management team to authorize temporarily overriding the defined NK if/when it breaks and an escalation process that makes that livable for our end users.

-PatP

15. World Class Flame Warrior
Join Date
Jun 2003
Location
Ohio
Posts
12,595