Page 1 of 2 12 LastLast
Results 1 to 15 of 27
  1. #1
    Join Date
    Sep 2009
    Posts
    21

    Would it be a good idea to index on an identity column?

    Is it a good idea to place a clustered index on an identity column? I am not sure why someone would do that

    Im asking as I am looking at a table with a million rows(identity column is PK) with data of a studentID number and a classID number along with datetime and other misc data. Then an identity column added on with that someone just seemed to have tacked on to be unique I guess...

    Im tempted to just combine studentID and classID and make that the PK and then create a clusted index based on that...

    EDIT: Asking as a view is generated from this table that does a clustered index scan 10 times and a couple of UNIONS that make generating this view excessively slow....
    Last edited by fff398; 10-27-09 at 17:53.

  2. #2
    Join Date
    Sep 2009
    Posts
    44
    Normally it's pretty reasonable to put a clustered index on your primary key. Assuming that by identity you mean the column is based on a sequence or something, that shouldn't matter one way or the other.

    A clustered index just means that the leaves of the index point straight into the data, and that the data is arranged, roughly, around the index. This means that you eliminate one seek when scanning the index and then reading the data. You may also get the data sorted by the index for free.

    It could also mean that a range scan could require a further (in terms of how far the drive head travels) seek than with an unclustered index. But if you're doing 10 scans, switching between clustered or not clustered shouldn't make much of a difference.

  3. #3
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Personally I don't think it's such a good idea to have the index on an identity field - the advantage of using a clustered index is to group records with the same values in certain fields together for faster access but the identity field will always be unique for each record so there's no real point (that I can see). Might be better if you give us an example bit of SQL that's running slowly along with the table definition (including indexes).

  4. #4
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Identities are SQL Server term. They aren't actually necessarily unique (i.e. they are not a constraint).
    Mike - that is an advantage under certain requirements, and even then I would only consider it for a composite key. I cannot remember ever putting a clustered index on a non unique combination of columns.
    Also, clustering on an ever increasing value has advantages for high insert applications.

    I'm a fan of natural keys, but when I use surrogates I always use identities, and when I use identities I ALWAYS put the clustered index on the identity column.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  5. #5
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Quote Originally Posted by pootle flump View Post
    Identities are SQL Server term.
    but they're common to most databases in one form or another:
    • Sybase has identity fields
    • SQL Server also has identity fields
    • MySQL uses auto-increment
    • Oracle uses sequences
    Quote Originally Posted by pootle flump View Post
    They aren't actually necessarily unique (i.e. they are not a constraint).
    True but they nearly always are set up that way. if I found a database where an identity value was duplicated in a table then I'd think there was a bug somewhere. If it turned out that this was a feature then I'd start to have a long hard look at the rest of the design. Even the Wiki on identity columns states:
    An Identity column is a column ( also known as a field ) in a database table that (1) uniquely identifies every row in the table, and (2) is made up of values generated by the database.
    perhaps you should make an amendment

    Quote Originally Posted by pootle flump View Post
    Also, clustering on an ever increasing value has advantages for high insert applications.
    Possibly but it might also cause hot spots as well - anyway the issue is read performance here.

    Quote Originally Posted by pootle flump View Post
    I'm a fan of natural keys, but when I use surrogates I always use identities, and when I use identities I ALWAYS put the clustered index on the identity column.
    I'm also a fan of natural keys. An example might be having an account's positions clustered by accountNo - this might make queries on an account very fast as all the positions for an account are likely to be on the same data page. If I had a clustered index on an identity field then the query would run slower as it would need to access lot's of data pages to get the same data. I suppose if you could guarantee that the records you're after were all inserted one after the other then there might be some advantage in using a clustered index on an identity field - even then I'd still probably prefer a datetime field etc.

  6. #6
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    1) I didn't say they weren't.
    2) I'm emphasising this because many people think they are. And Wikipedia is wrong.
    3) Hot spots are no issue since SQL Server 7. I can provide lots of links if you like - the very fastest insert performance in SQL Server is on a monotonically increasing index (surrogate or natural) - FACT
    4) If you have an identity then you are using surrogates. If you are using surrogates, you may not have such natural columns to cluster on. If you do have such columns, and you cluster on those, then you will get the benefit of range scans, but you will not the benefit of Merge Joins. You will also decrease the density of your nonclustered data pages.

    Your stated advantage for Clustered Indexes is an advantage in some situations. It is not the only consideration for clustered index selection.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  7. #7
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    II.
    Then change the wiki!
    III.
    I guess it just depends on what RDBMS you're using - Sybase still has issues on hot spots but recommends using clustered indexes to spread the inserts across the table ie not putting the clustered index on an identity field:
    Create a clustered index to distribute the updates across the data pages in the table. Like partitioning, this solution creates multiple insertion points for the table. However, it also introduces overhead for maintaining the physical order of the table’s rows.
    Would the index you describe make it faster to insert than having no clustered index at all (on SQL Server)? - just curious

    IV.
    You'd still have fields you'd normally want to grab around ie a customer or an orderNo. It doesn't matter that there's no natural unique field as we're just trying to grab the data in fewer hits. If there's no field we want to grab the data around then there's no point in adding the clustered index - at least in order to improve select performance.

  8. #8
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Quote Originally Posted by mike_bike_kite View Post
    Would the index you describe make it faster to insert than having no clustered index at all (on SQL Server)? - just curious
    Yes - it is a common misconception that heaps are fastest for insert performance.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  9. #9
    Join Date
    Mar 2009
    Location
    Sydney, Australia
    Posts
    258
    fff398

    1 Where the Identity is the PK or AK (not the FK), yes, you need an unique index to enforce uniqueness.

    2 Where the Clustered Index is a sequential number such as Identity, or a consecutively increasing DateTime, yes, there is a hotspot at the insertion of the last row, which affects concurrency (not speed of insert). Therefore such columns should be NonClustered indices, never Clustered. Both MS and Sybase have done a fair amount of engineering to alleviate this problem, so it is improved, but it remains a problem, and can be avoided. (MS may be fastest with insert speed, but the concurrency problem is not something that engineering can fix.)

    2.1 Where the Identity column has an Unique index, yes, there is a hotspot created on that index (concurrency, not insert speed). Which is yet another reason why they should be minimised.

    3 The best Clustered Index is one that distributes the data evenly (ie. avoids creating a single concurrent insert hotspot), and is used in range queries (you can always get one or the other, seldom both).

    4 In your case, it looks like the third Id column was added by a developer who could not handle compound keys (sadly becoming a bit common these days). Compound keys are particularly good for Clustered indices. Choose (StudentId, ClassId) or (ClassId, StudentId) based on which ever you have more range queries on (eg. ClassId services the most common range queries), or whichever concurrent insert hotpot you wish to avoid (eg. StudentId provides the best concurrent distribution).

    -----

    In MS and Sybase, Heaps are definitely the fastest insert performance (but they too are a hotspot for concurrent inserts). The new lockscheme in fact is a Heap plus NCIs for that very reason; no Clustered Index.
    Last edited by Derek Asirvadem; 11-07-09 at 04:48.
    Regards
    Derek Asirvadem (Formerly DerekA)
    Information Architect / Senior Sybase DBA
    Copyright © 2009 Software Gems Pty Ltd

    I answer questions from the Original Poster only. If you have a genuine question, as the moderators have requested, start a new thread.

    http://www.softwaregems.com.au

  10. #10
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Quote Originally Posted by Derek Asirvadem View Post
    In MS and Sybase, Heaps are definitely the fastest insert performance (but they too are a hotspot for concurrent inserts).
    Perhaps for Sybase (I would not know) but not for SQL Server 2000+.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  11. #11
    Join Date
    Mar 2009
    Location
    Sydney, Australia
    Posts
    258
    Pootle

    BTW, last time I implemented on MS, it was 2005; since they produce a "new codeline" with every release, my info may be dated.

    We are talking insert speed, not concurrency issues, right. In terms of physical resources in a machine, using any known piece of software, inserting into a single storage structure (a Heap) has a certain cost, a specific resource use. Think about it. How is it possible, in terms of the physical resources in that machine, using the exact same piece of software, to insert into a Heap plus one additional B-Tree data structure, which may be Clustered, faster than inserting to the Heap alone. Consider not only the pure insertion of the row, but also the unavoidable insertion to the B-Tree (reasonable index page splits; a tiny amount for the rare index level addition), and the transaction log (2.5 times the writes to data rows in the rule of thumb we use over here). To me, is defies the laws of physics, it is simply not possible.

    In a Relational database, heaps are useless, but that is a separate point. I used heaps for message and queue tables in MS circa 2005, and as expected, they were faster than Heap + CI or Heap + NCI. MS, being the bastard child of Sybase, is identical in the areas discussed in this thread.

    BTW, I agree with most of your points in this thread.
    Regards
    Derek Asirvadem (Formerly DerekA)
    Information Architect / Senior Sybase DBA
    Copyright © 2009 Software Gems Pty Ltd

    I answer questions from the Original Poster only. If you have a genuine question, as the moderators have requested, start a new thread.

    http://www.softwaregems.com.au

  12. #12
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Hi Derek

    All other things being equal, then yes.
    However, inserting in to a heap does incur a cost that is not incurred when inserting into a clustered index.

    When inserting in to a new page in a heap, the storage engine inserts into the first empty page in the file. Finding this page is the additional cost of inserting in to a heap. When inserting in to an index, the next available page is used.

    Also, there is only any change to the b-tree when a new data page is created. This may or may not result in a new level too, though as you point out this is proportionately rare. If there is no page created then the new value by definition falls within the ranges already present in the B-Tree. I might not have explained this well - please tell me if not.

    It's probably worth pointing out that I'm not saying inserting in to a heap is significantly more costly, just that it is measurably so.

    Again, I can provide references if required.

    EDIT - added the word in blue
    Last edited by pootle flump; 11-09-09 at 09:39.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  13. #13
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Quote Originally Posted by pootle flump
    It's probably worth pointing out that I'm not saying inserting in to a heap is significantly more costly, just that it is measurably so.
    Would it be possible to run a few tests just to prove this one way or the other? It still sounds unlikely to me to - I come from a Sybase background too. You could use an existing average sized table with approx 1m rows then insert approx 1m rows into a fresh table with the following indexes:
    • No indexes
    • NCI
    • CI on identity field
    • CI on natural key
    • NCI + 2 other indexes
    • CI + 2 other indexes
    Average sized is difficult to judge but say 10-15 fields with no text or blob type fields might be best. You'll need to make sure the data being inserted is reasonably random (ie not skewed). You might want to run this a couple of times. It would be particularly interesting to run similar tests for other RDBMS but guess that might not be possible. It would obviously be best to run it on a spare server with no other processes

  14. #14
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Feel free to post your results when you are done
    Testimonial:
    pootle flump
    ur codings are working excelent.

  15. #15
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Hey Poots,

    If you've got any literature on this subject then I'd be intrigued to read.
    I've never heard this, but after reading your post above, it seems relatively logical..
    George
    Home | Blog

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •