If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Database Concepts & Design > Would it be a good idea to index on an identity column?

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 10-27-09, 15:25
fff398 fff398 is offline
Registered User
 
Join Date: Sep 2009
Posts: 21
Would it be a good idea to index on an identity column?

Is it a good idea to place a clustered index on an identity column? I am not sure why someone would do that

Im asking as I am looking at a table with a million rows(identity column is PK) with data of a studentID number and a classID number along with datetime and other misc data. Then an identity column added on with that someone just seemed to have tacked on to be unique I guess...

Im tempted to just combine studentID and classID and make that the PK and then create a clusted index based on that...

EDIT: Asking as a view is generated from this table that does a clustered index scan 10 times and a couple of UNIONS that make generating this view excessively slow....

Last edited by fff398; 10-27-09 at 16:53.
Reply With Quote
  #2 (permalink)  
Old 11-04-09, 09:51
scooby_at_work scooby_at_work is offline
Registered User
 
Join Date: Sep 2009
Posts: 44
Normally it's pretty reasonable to put a clustered index on your primary key. Assuming that by identity you mean the column is based on a sequence or something, that shouldn't matter one way or the other.

A clustered index just means that the leaves of the index point straight into the data, and that the data is arranged, roughly, around the index. This means that you eliminate one seek when scanning the index and then reading the data. You may also get the data sorted by the index for free.

It could also mean that a range scan could require a further (in terms of how far the drive head travels) seek than with an unclustered index. But if you're doing 10 scans, switching between clustered or not clustered shouldn't make much of a difference.
Reply With Quote
  #3 (permalink)  
Old 11-04-09, 10:11
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Personally I don't think it's such a good idea to have the index on an identity field - the advantage of using a clustered index is to group records with the same values in certain fields together for faster access but the identity field will always be unique for each record so there's no real point (that I can see). Might be better if you give us an example bit of SQL that's running slowly along with the table definition (including indexes).
Reply With Quote
  #4 (permalink)  
Old 11-04-09, 10:29
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Identities are SQL Server term. They aren't actually necessarily unique (i.e. they are not a constraint).
Mike - that is an advantage under certain requirements, and even then I would only consider it for a composite key. I cannot remember ever putting a clustered index on a non unique combination of columns.
Also, clustering on an ever increasing value has advantages for high insert applications.

I'm a fan of natural keys, but when I use surrogates I always use identities, and when I use identities I ALWAYS put the clustered index on the identity column.
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.
Reply With Quote
  #5 (permalink)  
Old 11-04-09, 11:48
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by pootle flump View Post
Identities are SQL Server term.
but they're common to most databases in one form or another:
  • Sybase has identity fields
  • SQL Server also has identity fields
  • MySQL uses auto-increment
  • Oracle uses sequences
Quote:
Originally Posted by pootle flump View Post
They aren't actually necessarily unique (i.e. they are not a constraint).
True but they nearly always are set up that way. if I found a database where an identity value was duplicated in a table then I'd think there was a bug somewhere. If it turned out that this was a feature then I'd start to have a long hard look at the rest of the design. Even the Wiki on identity columns states:
Quote:
An Identity column is a column ( also known as a field ) in a database table that (1) uniquely identifies every row in the table, and (2) is made up of values generated by the database.
perhaps you should make an amendment
Quote:
Originally Posted by pootle flump View Post
Also, clustering on an ever increasing value has advantages for high insert applications.
Possibly but it might also cause hot spots as well - anyway the issue is read performance here.
Quote:
Originally Posted by pootle flump View Post
I'm a fan of natural keys, but when I use surrogates I always use identities, and when I use identities I ALWAYS put the clustered index on the identity column.
I'm also a fan of natural keys. An example might be having an account's positions clustered by accountNo - this might make queries on an account very fast as all the positions for an account are likely to be on the same data page. If I had a clustered index on an identity field then the query would run slower as it would need to access lot's of data pages to get the same data. I suppose if you could guarantee that the records you're after were all inserted one after the other then there might be some advantage in using a clustered index on an identity field - even then I'd still probably prefer a datetime field etc.
Reply With Quote
  #6 (permalink)  
Old 11-04-09, 12:17
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
1) I didn't say they weren't.
2) I'm emphasising this because many people think they are. And Wikipedia is wrong.
3) Hot spots are no issue since SQL Server 7. I can provide lots of links if you like - the very fastest insert performance in SQL Server is on a monotonically increasing index (surrogate or natural) - FACT
4) If you have an identity then you are using surrogates. If you are using surrogates, you may not have such natural columns to cluster on. If you do have such columns, and you cluster on those, then you will get the benefit of range scans, but you will not the benefit of Merge Joins. You will also decrease the density of your nonclustered data pages.

Your stated advantage for Clustered Indexes is an advantage in some situations. It is not the only consideration for clustered index selection.
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.
Reply With Quote
  #7 (permalink)  
Old 11-04-09, 12:47
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
II.
Then change the wiki!
III.
I guess it just depends on what RDBMS you're using - Sybase still has issues on hot spots but recommends using clustered indexes to spread the inserts across the table ie not putting the clustered index on an identity field:
Quote:
Create a clustered index to distribute the updates across the data pages in the table. Like partitioning, this solution creates multiple insertion points for the table. However, it also introduces overhead for maintaining the physical order of the table’s rows.
Would the index you describe make it faster to insert than having no clustered index at all (on SQL Server)? - just curious
IV.
You'd still have fields you'd normally want to grab around ie a customer or an orderNo. It doesn't matter that there's no natural unique field as we're just trying to grab the data in fewer hits. If there's no field we want to grab the data around then there's no point in adding the clustered index - at least in order to improve select performance.
Reply With Quote
  #8 (permalink)  
Old 11-05-09, 02:35
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Quote:
Originally Posted by mike_bike_kite View Post
Would the index you describe make it faster to insert than having no clustered index at all (on SQL Server)? - just curious
Yes - it is a common misconception that heaps are fastest for insert performance.
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.
Reply With Quote
  #9 (permalink)  
Old 11-07-09, 03:35
Derek Asirvadem Derek Asirvadem is offline
Registered User
 
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 258
fff398

1 Where the Identity is the PK or AK (not the FK), yes, you need an unique index to enforce uniqueness.

2 Where the Clustered Index is a sequential number such as Identity, or a consecutively increasing DateTime, yes, there is a hotspot at the insertion of the last row, which affects concurrency (not speed of insert). Therefore such columns should be NonClustered indices, never Clustered. Both MS and Sybase have done a fair amount of engineering to alleviate this problem, so it is improved, but it remains a problem, and can be avoided. (MS may be fastest with insert speed, but the concurrency problem is not something that engineering can fix.)

2.1 Where the Identity column has an Unique index, yes, there is a hotspot created on that index (concurrency, not insert speed). Which is yet another reason why they should be minimised.

3 The best Clustered Index is one that distributes the data evenly (ie. avoids creating a single concurrent insert hotspot), and is used in range queries (you can always get one or the other, seldom both).

4 In your case, it looks like the third Id column was added by a developer who could not handle compound keys (sadly becoming a bit common these days). Compound keys are particularly good for Clustered indices. Choose (StudentId, ClassId) or (ClassId, StudentId) based on which ever you have more range queries on (eg. ClassId services the most common range queries), or whichever concurrent insert hotpot you wish to avoid (eg. StudentId provides the best concurrent distribution).

-----

In MS and Sybase, Heaps are definitely the fastest insert performance (but they too are a hotspot for concurrent inserts). The new lockscheme in fact is a Heap plus NCIs for that very reason; no Clustered Index.
__________________
Regards
Derek Asirvadem (Formerly DerekA)
Information Architect / Senior Sybase DBA
Copyright © 2009 Software Gems Pty Ltd

I answer questions from the Original Poster only. If you have a genuine question, as the moderators have requested, start a new thread.

http://www.softwaregems.com.au

Last edited by Derek Asirvadem; 11-07-09 at 03:48.
Reply With Quote
  #10 (permalink)  
Old 11-08-09, 13:35
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Quote:
Originally Posted by Derek Asirvadem View Post
In MS and Sybase, Heaps are definitely the fastest insert performance (but they too are a hotspot for concurrent inserts).
Perhaps for Sybase (I would not know) but not for SQL Server 2000+.
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.
Reply With Quote
  #11 (permalink)  
Old 11-09-09, 07:24
Derek Asirvadem Derek Asirvadem is offline
Registered User
 
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 258
Pootle

BTW, last time I implemented on MS, it was 2005; since they produce a "new codeline" with every release, my info may be dated.

We are talking insert speed, not concurrency issues, right. In terms of physical resources in a machine, using any known piece of software, inserting into a single storage structure (a Heap) has a certain cost, a specific resource use. Think about it. How is it possible, in terms of the physical resources in that machine, using the exact same piece of software, to insert into a Heap plus one additional B-Tree data structure, which may be Clustered, faster than inserting to the Heap alone. Consider not only the pure insertion of the row, but also the unavoidable insertion to the B-Tree (reasonable index page splits; a tiny amount for the rare index level addition), and the transaction log (2.5 times the writes to data rows in the rule of thumb we use over here). To me, is defies the laws of physics, it is simply not possible.

In a Relational database, heaps are useless, but that is a separate point. I used heaps for message and queue tables in MS circa 2005, and as expected, they were faster than Heap + CI or Heap + NCI. MS, being the bastard child of Sybase, is identical in the areas discussed in this thread.

BTW, I agree with most of your points in this thread.
__________________
Regards
Derek Asirvadem (Formerly DerekA)
Information Architect / Senior Sybase DBA
Copyright © 2009 Software Gems Pty Ltd

I answer questions from the Original Poster only. If you have a genuine question, as the moderators have requested, start a new thread.

http://www.softwaregems.com.au
Reply With Quote
  #12 (permalink)  
Old 11-09-09, 07:53
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Hi Derek

All other things being equal, then yes.
However, inserting in to a heap does incur a cost that is not incurred when inserting into a clustered index.

When inserting in to a new page in a heap, the storage engine inserts into the first empty page in the file. Finding this page is the additional cost of inserting in to a heap. When inserting in to an index, the next available page is used.

Also, there is only any change to the b-tree when a new data page is created. This may or may not result in a new level too, though as you point out this is proportionately rare. If there is no page created then the new value by definition falls within the ranges already present in the B-Tree. I might not have explained this well - please tell me if not.

It's probably worth pointing out that I'm not saying inserting in to a heap is significantly more costly, just that it is measurably so.

Again, I can provide references if required.

EDIT - added the word in blue
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.

Last edited by pootle flump; 11-09-09 at 08:39.
Reply With Quote
  #13 (permalink)  
Old 11-09-09, 08:13
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by pootle flump
It's probably worth pointing out that I'm not saying inserting in to a heap is significantly more costly, just that it is measurably so.
Would it be possible to run a few tests just to prove this one way or the other? It still sounds unlikely to me to - I come from a Sybase background too. You could use an existing average sized table with approx 1m rows then insert approx 1m rows into a fresh table with the following indexes:
  • No indexes
  • NCI
  • CI on identity field
  • CI on natural key
  • NCI + 2 other indexes
  • CI + 2 other indexes
Average sized is difficult to judge but say 10-15 fields with no text or blob type fields might be best. You'll need to make sure the data being inserted is reasonably random (ie not skewed). You might want to run this a couple of times. It would be particularly interesting to run similar tests for other RDBMS but guess that might not be possible. It would obviously be best to run it on a spare server with no other processes
Reply With Quote
  #14 (permalink)  
Old 11-09-09, 08:18
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Feel free to post your results when you are done
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.
Reply With Quote
  #15 (permalink)  
Old 11-09-09, 08:29
gvee gvee is offline
www.gvee.co.uk
 
Join Date: Jan 2007
Location: UK
Posts: 10,156
Hey Poots,

If you've got any literature on this subject then I'd be intrigued to read.
I've never heard this, but after reading your post above, it seems relatively logical..
__________________
George
Twitter | Blog
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On