Since I have moved here we have upgraded our 4 SQL Servers to 2005 64 bit. All 2 proc dual core hyper threaded with 4GB RAM. Windows 2003. SQL Server compatibility level is 90. SP 1 and the CTP for SP2 are installed. We are hanging on with sp2. Maxdop is 0 (but has been tried on 2). Up to a score or so databases per machine, between 8 and 300 GB in size. Offline, reasonably heavy duty ETL batch processing.
Since upgrading we have had trouble with the CPUs intermittently maxing out at 100% usage. This is when intensive routines are run. We know that many of these routines are crappy (SELECT... INTO creating tables with 10s or 100s of millions of rows, cursors etc.) and we are rewriting the code. Estimated time of completion of code review and rewrite is end of next year so we can't wait that long. The same poor code ran (albeit slowly) in SS2k but is causing the server to lock up now it is being run on SS2k5. Symptoms include connection attempts timing out, routines running very slowly. Existing connections can continue to make requests but these are very slow. If we get on the box the CPUs are at a never wavering 100% utilisation. This continues until intervention is taken, often restarting the service. We have left a maxed out box alone over a weekend and it was still in the same state when we came in on Monday. Nothing to glean from the error logs (AFAIK).
Obviously the long term fix is to get the code sorted but this is an ongoing process (in fact it is one of the reasons I was hired). Our immediate concern is to get the servers back up to the sort of performance we were realising before we upgraded.
Does anyone have any ideas?
As ever eternal gratitude and a crate of virtual beer for any help
If you have a test system that you do not mind killing processes on, you can try picking off individual connections until the CPU recovers. If you have a trace running at the time, you should catch what that process was doing, and you would be able to target your fixes a little better.
Just profiler. Catch the RPC Completed and SP Completed events, which should give you the reads, writes, duration, and number CPU ticks used. More importantly, the texdata field will tell you what query was being run at the time. If the procedure ends naturally, or by you killing it, it should show in the trace. You may get a lot of garbage to sort through, as well, so you may consider restricting the trace to an individual SPID when you go to kill it.
And how is the Prod code better than the Dev code? Time to do a code refresh from Prod?
Sorry - to clarify the project is to upgrade a whole host of legacy, procedural code acting on file based hierarchical databases to SQL Server. Some of the early work was not so hot. Some of these databases were put in to testing and found very wanting (even before these specific problems) and so have not been released to prod.
We have installed a few gig extra memory into one of the prod boxes. SP2 is due to be installed shortly.... (the hotfix so early on made us a bit wary).
Based on this: http://www.sqlteam.com/item.asp?ItemID=26788
we have rebuilt all indexes and updated stats with full scan on one db and found a three fold improvement for the ETL routine on that. We are in the process of running this on all our other databases but at 18+ hours per db (one has been running for over 24 hours now) it will take a little while to cascade this throughout our environment. We haven't actually had a crash since I posted (one load failed but this was a data corruption in the base table index) but we have had some very sluggish periods (bulk inserts in 1000 row batches ranging from a few hundred milliseconds to 10 mins per batch). These are all (effectively) off-line so we can limit access to ensure priority work does not need to compete for resources.
If we get a crash again I will set up the traces - for now we are concentrating on trying some of the suggested fixes\ optimisations.
Ok, you asked for my comments, but you won't like them!
First, more RAM. There is no bottom to SQL 2005, it can use everything I've been able to throw at it, and it gets faster every time I add RAM. This helps a ton in almost every case.
Next, the reindexing is a great idea. It helps a lot too.
My guess is that you've got a couple of tables with really poor choices of clustered index, and you may have some really bad choices of PRIMARY KEY definitions. These are usually the "big hammer" in terms of design problems. Set your primary key definition to a single column (either IDENTITY or GUID) if you can, and whatever you use, smaller (in bytes) is better. Make sure that your clustered index presents rows in the order that they are most frequently processed, regardless of whether that is unique or even seems to make sense from an ER perspective. If you can get composite sub-keys out of the clustered index, that is a bonus!
Without more detailed knowledge of what you have, what you are doing, and what are your "pain points", this is the best general advice I can give.