Page 1 of 5 123 ... LastLast
Results 1 to 15 of 72
  1. #1
    Join Date
    Jan 2005
    Posts
    29

    Question Unanswered: Keeping PK's unique across (potentially) disconnected sites.

    Hi All,

    I'd like to throw this idea 'out there' to see if I'm missing something I'll later regret.

    I'm looking to resolve a scalability issue within our point-of-sale program. Currently the PK on transactional tables (sales and orders) is created by the application layer using a 'MAX(PKCol) + 1' mechanism. Obviously this requires that all users of the system, whether they're local or remote, have current data at any time they wish to insert. It's this limitation I'd like to remove. Most sites are using MS SQL Server 2000. No sites use anything specific to a later version.

    By having a PK that can be generated independently of a 'master' database we can overcome this issue. The PK values will need to be unique within a 'group' of shops and able to be generated by a program operating at any level. From 'head office' which manages a number of shops, to the server at a given shop and even the register / till itself should be able to create ID's while disconnected from the server (using a local database).

    It seems there's three main ways to accomplish this:
    - Identities,
    - MachineID, CurrentPK composite.
    - GUID's

    Identities: I've ruled out identities as I believe the administration overhead of dealing with them makes them impractical (there may be several hundred registers and therefore as many ranges to be set up within a group).

    MachineID, CurrentPK composite: The MachineID references a Machine table which has an entry for each ethernet MAC address which connects to the database. The reason I chose to store the MAC in another table rather than simply using it as column is that I'm fetching it from sysprocesses.net_address(nchar(12)) and believe it's computationally cheaper to use an int than a text column. This mechanism means that we can still expose the PK to the user in some cases (eg: InvoiceNumber printed on a receipt). When the local database is not up to date (usually due to network problems) there will be cases where the CurrentPK will be duplicated but kept unique since it's coupled with the new MachineID. The big drawback to this method is that all current code will need to be revised to deal with the composite keys (this will be a significant amount of development).

    GUIDs: Ugly to look at and time-consuming to type. They're not something which you'd expose to a user unmodified so realistically this means altering existing code to use a new 'user friendly' number where the PK is currently exposed to them. The use of GUIDs rule-out the use of clustered indecies on tables they're the PK for lest most inserts cause a page split. The splits would also necessitate more frequent index defrags / rebuilds. Using a non-clustered index incurs a penalty Vs a non-fragmented clustered one (doesn't it?) so while this avoids page-splits it comes at a cost.

    After all that I think the best solution is to use GUIDs with a non-clustered index for each of the PK's. While it might not be the fastest of the options (slower reads/joins Vs composite PK) it will be significantly faster to develop while maintaining acceptable performance.

    Thoughts?

  2. #2
    Join Date
    Feb 2007
    Posts
    62
    What you have is by far the worst way of doing it and good on you for having the guts to actually address it!

    Not sure why the PKs need to be human readable but personally I would go with an identity field combined with PK of register (or group or machine) ID. If you are worried about composite PKs and your code, try a computed column something like this:

    create table a(
    x int identity(1,1) not null,
    regid varchar(10) not null,
    pk as isnull(x+regid,0)primary key
    )

    You can also check out some of the replication features which naturally have to manage PKs & identities. I wouldn't underestimate the cost saving of using in built in features over reams of coding.

  3. #3
    Join Date
    Jan 2005
    Posts
    29
    Thanks for taking the time to read that essay LoztInSpace.

    You think GUID pk's with non-clustered indexes is the worst choice of the options presented? If I've understood you right, what's so bad about that option?

    The idea of a computed column as the PK I think will be very useful in reducing development time in a composite key solution (thanks!).

    My concern with using identities is the management overhead associated with using them. In my (limited) experience with allowing SQL server to do the range management I've run into issues; current range running out while connectivity was down, replication process failing to allocate a new range etc. While these issues may well be a symptom of a lack of understanding on my part I get the impression (from reading about folk with similar issues) that the automatic range management stuff is 'flakey'. I'm also unsure how easy / feasible it would be to configure when the addition or removal of machines within a group of shops would a reasonably common occurrence.

    The keys don't _need_ to be human readable but since they're shown to the user in a few cases currently it adds some work to give the user something to identify the sale, order, etc.

    Looks like the current front-runner then is a composite of MachineID + CurrentPK used to form the new computed PK column.

  4. #4
    Join Date
    Feb 2007
    Posts
    62
    You do mean identity + machine name? I meant that select max(pk)+1 is the worst. I'm on the fence with GUIDs.

    Hope it goes well.
    Last edited by LoztInSpace; 04-18-08 at 01:19.

  5. #5
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Raw, unprocessed GUIDs are most likely your best primary key for many reasons.

    While GUIDs do move the insert "hot spot" around, and at first that causes more page splits than living with IRL and the latch issues associated with it. What little you loose in terms of page splits you can recover many, many times over in terms of nearly obliterating INSERT contention and the massive increase in parallelization of queries.

    GUIDs are ugly to look at, and miserable to type. They strongly contribute to the value of barcode! They also go a LONG way toward preventing many of the common processing errors caused by sequences. Nobody notices a "missing" GUID, and they quickly learn to find other (usually better) ways to ensure that data is not lost (batch totals are a great substitute).

    There are many reasons that Microsoft likes GUIDs so much. The rest of the computing universe seems to be adopting them as well. GUIDs solve so many problems so well that their ugly factor just doesn't matter!

    -PatP

  6. #6
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    Absolutely GUIDs. This is exactly what they were designed to do.

    Quote Originally Posted by karlmdv
    GUIDs: Ugly to look at...
    Like warthogs, they do not find you very attractive either.

    Quote Originally Posted by karlmdv
    ...and time-consuming to type.
    If you find yourself typing in surrogate keys, you are not using them properly.

    Quote Originally Posted by karlmdv
    They're not something which you'd expose to a user
    If you are exposing surrogate keys to the users, you are not using them properly.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  7. #7
    Join Date
    Jan 2005
    Posts
    29
    Quote Originally Posted by LoztInSpace
    You do mean identity + machine name?
    Nope I meant good ol' "requires a round-trip to the DB" Max(ID)+1 and MachineID only because of the issues I've hit with identities in the past. If I can resolve the issues, great, but currently they're show-stoppers.

    Quote Originally Posted by Pat Phelan
    While GUIDs do move the insert "hot spot" around, and at first that causes more page splits than living with IRL and the latch issues associated with it. What little you loose in terms of page splits you can recover many, many times over in terms of nearly obliterating INSERT contention
    Can I assume when you talk about initial page splits you're referring to using a clustered index on the GUID column Pat? I see that once the table is using a large number of pages the number of splits will reduce as the average free space per page increases but doesn't this lead to a highly fragmented, slow index (when compared to a non-clustered)?

    I'd guess that the fragmentation can be dealt with by re-building the index and you can guard against future page splits using a low fillfactor. Is this what you were suggesting? What fillfactor would you use as a starting point in this situation?

    One last thing, what's IRL mean in the context you used it?

    Blindman, thanks for your reply. While I'd like to re-write a lot of this 'ere POS system so that issues like exposing surrogate keys to a user are eliminated, time constraints only allow me to chip away at them. As it stands the reason I'd end up typing a surrogate keys is to locate a sale when investigating an issue as no natural key is currently stored.

  8. #8
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    IRL is Insert Row Locking, a phenomenon that has many different names (even within SQL Server), but is a constant issue for any product that forces all of the insert operations to occur in one place. There are many mechanisms to cope with this problem, but all of them serialize the insert process somehow and that is a deathnell for high volume inserts such as loading data into a datawarehouse.

    Try to measure the time lost to sparse pages (caused by the page splits needed to support multiple insert points). It can be measured, if you have sufficiently accurate tools. While sequences and dense pages might be better for heavy read operations (such as OLAP), GUIDs are far better for the OLTP processing.

    Since 95% of our user queries are based on something other than the PK and the JOIN operations have no measurable difference between a sequence and a GUID, I don't get too excited.

    Fillfactor doesn't seem to matter unless you have both very small rows (so many fit on one page), and outrageous transaction rates (over 800 inserts per second). Just sit tight and watch... Only fix these things if they are broken!

    -PatP

  9. #9
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    I didn't see your SQL version. 2005 has sequential GUIDS.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  10. #10
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Quote Originally Posted by karlmdv
    Can I assume when you talk about initial page splits you're referring to using a clustered index on the GUID column Pat? I see that once the table is using a large number of pages the number of splits will reduce as the average free space per page increases but doesn't this lead to a highly fragmented, slow index (when compared to a non-clustered)?

    I'd guess that the fragmentation can be dealt with by re-building the index and you can guard against future page splits using a low fillfactor. Is this what you were suggesting?
    Fragmentation is largely irrelevent for singleton reads & writes. Larger returns, perhaps involving table and index scans, and in certain cases index seeks, are affected a lot.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  11. #11
    Join Date
    Jul 2003
    Location
    San Antonio, TX
    Posts
    3,662
    Fragmentation equally applies to clustered and nonclustered indexes. Both are based on BTREE structure. It is only bad when occurs on intermediate level pages, and ONLY affects ordered index scans (again, both clustered and nonclustered).
    "The data in a record depends on the Key to the record, the Whole Key, and
    nothing but the Key, so help me Codd."

  12. #12
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Have I misunderstood? A scan of the leaf level is unaffected by fragmentation?
    Testimonial:
    pootle flump
    ur codings are working excelent.

  13. #13
    Join Date
    Jul 2003
    Location
    San Antonio, TX
    Posts
    3,662
    I think you did. It affects ORDERED SCAN (re-read the above post).
    "The data in a record depends on the Key to the record, the Whole Key, and
    nothing but the Key, so help me Codd."

  14. #14
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    So what do you mean by "It is only bad when occurs on intermediate level pages"? [sic]
    Testimonial:
    pootle flump
    ur codings are working excelent.

  15. #15
    Join Date
    Jul 2003
    Location
    San Antonio, TX
    Posts
    3,662
    Fragmentation on intermediate pages increases IO on seek operations.
    "The data in a record depends on the Key to the record, the Whole Key, and
    nothing but the Key, so help me Codd."

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •