Results 1 to 13 of 13
  1. #1
    Join Date
    Jul 2010
    Location
    Norway
    Posts
    45

    Unanswered: SQL 2005 R2 - High CPU usage, locate the database or query causing the trouble...

    Hello

    SQL Server 2005 [not 8 as in title, but I have 2008 R2, so I can use that management studio and connect to the 2005...], where sqlserv.exe (the process) is going crazy in CPU usage, total databases is roughly 30.

    The sqlserv.exe is set to high priority in task manager (high cpu priority).

    The sqlserv.exe never slows down, always 85-99% CPU usage.

    sqlserv.exe runs at 10.6 GB memory, static, this is the limit set within management studio > server properties > memory > 10.000 MB.

    Last, but not least; a software update has been done by a company that delivers the softare (and the software's database is most likely the one causing the whole sql server to use a lot of cpu).
    A reboot was done after the company updated their software on several servers (excluding the sql server, the database remained the same and was untouched).
    So... I figured I wanted to do some research myself, hence this post.

    To begin with, searched in activity monitor and task scheduler (made sure it is sqlserv.exe, nothing else consuming CPU). Activity monitor has no scheduled jobs that takes more than a few seconds, and these run not that often (one job runs every 5 minute, rest runs at night).

    Checked other servers, where the tools are installed, those are running as expected (at first glance)...

    Back to SQL, checking for long running queries/which database uses how much CPU:

    Code:
    SELECT TOP(100)
      total_worker_time/execution_count AS AvgCPU  
    , total_worker_time AS TotalCPU
    , total_elapsed_time/execution_count AS AvgDuration  
    , total_elapsed_time AS TotalDuration  
    , (total_logical_reads+total_physical_reads)/execution_count AS AvgReads 
    , (total_logical_reads+total_physical_reads) AS TotalReads
    , execution_count   
    , st.encrypted
    , ISNULL(CAST(st.dbid AS VARCHAR(200)),'') + ' - ' + ISNULL(CAST(st.objectid AS VARCHAR(200)), '') as 'dbid'
    , ISNULL(DB_NAME(st.dbid),'') + ISNULL(OBJECT_NAME(st.objectid), '') as 'db'
    , query_plan
    ,SUBSTRING(st.TEXT, 0, 64) as 'Start query'
    , SUBSTRING(st.TEXT, qs.statement_start_offset+1, 1024) as 'query'
    FROM sys.dm_exec_query_stats AS qs  
    cross apply sys.dm_exec_sql_text(qs.sql_handle) AS st  
    cross apply sys.dm_exec_query_plan (qs.plan_handle) AS qp 
    ORDER BY total_worker_time DESC, 1 DESC
    Query above selects databasename and CPU cost/execution time, etc from exec_query_stats.

    Results in these rows:
    http://s27.postimg.org/bmd0d4rw1/Lon...ng_Queries.png

    Question 1:
    Why is the first row returned when ordered by total_worker_time (CPU) as shown in the image above, this query:
    Code:
      create procedure sys.sp_tablecollations  (      @object nvarchar(4000)  )  as      select           colid               = s_tcv.colid,          name                = s_tcv.name,          tds_collation       = s_tcv.tds_collation_28,          "collation"         = s_tcv.collation_28      from          sys.spt_tablecollations_view s_tcv      where          s_tcv.object_id = object_id(@object, 'local')      order by colid
    This is strange to me, a "create procedure" is being called every now and then? Or does it just look like it? There's something wrong with the query itself?


    Stats stored in "dm_exec_query_stats" are reliable? Google'd of course, and one guy said it was only cleaned on reboots, and clearing buffer and statistics does not help, is this the case?
    From when is these stats stored, upon every query ran, the CPU time etcd

    Side note:
    Funny thing, in my eyes... If you look at the image, the highest "average duration", row number 20 in the image, has a database ID 32767 (but the DB-name could not be retrieved, hence the blank value)... While the query for the row 20 is:
    Code:
      ------------------------------- xp_logininfo ----------------
    That's funny, must be something hidden going on!


    Additional things I've done
    SQL Profiler, I have ran it, for these events:
    RCP: Completed
    SP:Completed
    SQL:BatchCompleted
    SQL:BatchStarting

    Without "luck", the largest duration when it ran for 30 seconds was 5 queries running at 8000 duration (8 seconds)... So there must be a different event causing the problem.

    Tried sortin the profiler after starting time, seeing if there are too many events occuring within fraction of a second (bringing the CPU to it's knees):
    http://s21.postimg.org/adz6ya02e/CPUProfiler.jpg
    But, with the events I am listening to above, there's not much going on here that takes up CPU.

    PS: My "instinct" tells me there must be some compiling or recompiling of queries, something awful going on...

    At the moment going over this http://mssqlwiki.com/tag/cpu-usage-100-sqlservr-exe/, and later found another article and came up with this little query:
    Code:
    select spid, waittime, lastwaittype, dbid, DB_NAME(dbid) as 'dbname', uid, cpu, physical_io, memusage, status, hostprocess, cmd, loginame  from sys.sysprocesses  
    WHERE SPID < 51
    ORDER by cpu DESC
    Resulting rows, from the above query:
    http://s2.postimg.org/l0m50w6bd/CPUProcess.png
    The CPU and disk (physical_inputoutput) has some huge numbers, but I do not know if these are "default" (it's a standard thing, that these should have large numbers, which just gives you an "image" of that disk is used a lot/not and so is CPU...

    Though, I do note the "LOGMGR_QUEUE", its status is "SUSPENDED"... Looking it up at the moment.

    Question 2
    Suggestions? What should be my next move?


    Edit:
    After a bit more research, found several nice articles and joined them together into this:

    Code:
    SELECT getdate() as 'RunTime', 
    a.status, a.database_id, a.user_id, a.blocking_session_id, a.wait_type, a.wait_time, a.last_wait_type, a.cpu_time, a.total_elapsed_time, a.reads, a.writes, a.logical_reads,
    st.text as batch,
    SUBSTRING(st.text,statement_start_offset / 2+1 , 
    ( (CASE WHEN a.statement_end_offset = -1 
    THEN (LEN(CONVERT(nvarchar(max),st.text)) * 2) 
    ELSE a.statement_end_offset END)  - a.statement_start_offset) / 2+1)  as current_statement
    ,qp.query_plan
    FROM sys.dm_exec_requests a 
    CROSS APPLY sys.dm_exec_sql_text(a.sql_handle) as st
    CROSS APPLY sys.dm_exec_query_plan(a.plan_handle) as qp
    LEFT JOIN sys.dm_os_tasks as dt on a.session_id = dt.session_id
    order by CPU_time desc
    The query above displays currently running queries and their CPU time, CPU usage, starting time and the query itself as text.

    Question 3:
    But wondering, how does DM_OS_TASKS work in the query above?

    From MSDN:
    "Returns one row for each task that is active in the instance of SQL Server"

    If I join on "DM_OS_TASKS", the amount of rows returned are increased, greatly. From 5 to 41. See image:
    http://s29.postimg.org/n06fqo1ue/Suspendedqueries.jpg

    The query returned though, that seems to be messing it up, I have copied and ran one occasion of it, and it took 125 seconds...

    Note: if I wait a few minutes, the run-time in the image above is different, but there's still roughly ~40-70 queries going on, where 90% of them are instances of the "long running query".
    So, I believe I have found the right query (80% sure)...


    But my assumptions of why there's so mayn more rows, when joining on DM_OS_TASKS is due to MAXDOP, max degree of parallelism. A query considered "large" by SQL, splits the work on several threads per instance of the query.
    And there seem to be a few instances of the query * amount of threads per instance = a lot rows... Assumption plain wrong?
    Last edited by ManyTimes; 01-07-15 at 10:40. Reason: Reworded a lot

  2. #2
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,800
    Provided Answers: 11
    There's two reasons why a SQL Server may go to 100 % CPU. A few very large queries, or many many small queries. So let's have a look at what you have so far (which is quite a bit, actually).

    The number one query from your first query on sys.dm_exec_query_Stats is a system procedure. For myself, I tend to not pay attention to system processes or system procedures. There is not a lot you can do about them. This one however smells like an ORM layer gone mad. If they are constantly having to refresh some of the metadata about the tables in their application, this could be a fair sized problem. Fortunately, the metadata is relatively small, so these queries should not be racking up a large number of logical reads. Re-order this query by total logical reads, and see who bubbles up to the top.

    Go through the profiler trace, and see if you can get a feel for how often some queries are run. Here's a query that I built up over some time to help decipher a profiler trace (someday I will have to redo this for Extended Events)
    Code:
    set nocount on
    declare @length int
    set @length = 50
    --drop table #temp
    select count(*) as instances, 
    substring (case when textdata like 'exec%' and charindex (' ', textdata, 6) > 0 then replace (replace (replace (substring (ltrim(substring (textdata, 1, @length)), 1, charindex (' ', textdata, 6)), char(10), ' '), '	', ' '), char(13), ' ')
    		when textdata like 'exec%' then replace (replace (replace (substring (ltrim(substring (textdata, 1, @length)), 1, @length), char(10), ' '), '	', ' '), char(13), ' ')
                    when textdata like '%N''exec %' then substring (replace (replace (replace (substring (textdata, charindex('N''exec ', textdata) + 7, @length), char(10), ' '), '	', ' '), char(13), ' '), 1, charindex(' ', substring (textdata, charindex('N''exec ', textdata) + 7, @length)))
    		when textdata like 'sp[_]%' then replace (replace (replace (substring (ltrim(substring (textdata, 1, @length)), 1, charindex (' ', textdata)), char(10), ' '), '	', ' '), char(13), ' ')
    		else replace (replace (replace (substring (ltrim(substring (textdata, 1, @length)), 1, @length), char(10), ' '), '	', ' '), char(13), ' ') end, 1, @length) as formattedtextdata,
    	avg (isnull(cpu, 0)) as AvgCPU, max(isnull(cpu, 0)) as MaxCPU, sum(isnull(cpu, 0)) as TotalCPU, 
    	avg(isnull(reads, 0)) as AvgReads, max(isnull(reads, 0)) as MaxReads, sum(isnull(reads,0)) as TotalReads,
    	avg(isnull(writes, 0)) as AvgWrites, max(isnull(writes, 0)) as MaxWrites, sum(isnull(writes,0)) as TotalWrites,
    	avg(isnull(duration, 0)) as AvgDuration, max(isnull(duration, 0)) as MaxDuration, sum(isnull(duration, 0)) as TotalDuration into #temp
    from tracetable
    where EventClass in (10, 12)
    group by	substring (case when textdata like 'exec%' and charindex (' ', textdata, 6) > 0 then replace (replace (replace (substring (ltrim(substring (textdata, 1, @length)), 1, charindex (' ', textdata, 6)), char(10), ' '), '	', ' '), char(13), ' ')
    		when textdata like 'exec%' then replace (replace (replace (substring (ltrim(substring (textdata, 1, @length)), 1, @length), char(10), ' '), '	', ' '), char(13), ' ')
                    when textdata like '%N''exec %' then substring (replace (replace (replace (substring (textdata, charindex('N''exec ', textdata) + 7, @length), char(10), ' '), '	', ' '), char(13), ' '), 1, charindex(' ', substring (textdata, charindex('N''exec ', textdata) + 7, @length)))
    		when textdata like 'sp[_]%' then replace (replace (replace (substring (ltrim(substring (textdata, 1, @length)), 1, charindex (' ', textdata)), char(10), ' '), '	', ' '), char(13), ' ')
    		else replace (replace (replace (substring (ltrim(substring (textdata, 1, @length)), 1, @length), char(10), ' '), '	', ' '), char(13), ' ') end, 1, @length)
    
    -- 'Usage by Login Name'
    select 'Login', count(*) as "Total Queries", substring(LoginName, 1, 30) as LoginName, avg(cpu) as "Avg CPU", max(cpu) as "Max CPU", sum(cpu) as "Total CPU", 
    	avg(reads) as "Avg Reads", max(reads) as "Max Reads", sum(reads) as "Total Reads", 
    	avg(writes) as "Avg Writes", max(writes) as "Max Writes", sum(writes) as "Total Writes",
    	avg(Duration) as "Avg Duration", max(Duration) as "Max Duration", sum(Duration) as "Total Duration"
    from tracetable
    where spid is not null
      and eventclass in (10, 12)
    group by loginname
    order by "total reads" desc 
    
    -- 'Usage by Host Name'
    select 'Host', count(*) as "Total Queries", substring(isnull (HostName, ''), 1, 30) as "HostName", avg(cpu) as "Avg CPU", max(cpu) as "Max CPU", sum(cpu) as "Total CPU", 
    	avg(reads) as "Avg Reads", max(reads) as "Max Reads", sum(reads) as "Total Reads", 
    	avg(writes) as "Avg Writes", max(writes) as "Max Writes", sum(writes) as "Total Writes",
    	avg(Duration) as "Avg Duration", max(Duration) as "Max Duration", sum(Duration) as "Total Duration"
    from tracetable
    where spid is not null
      and eventclass in (10, 12)
    group by hostname
    order by "total reads" desc 
    
    -- 'Usage by Application Name'
    select 'Application', count(*) as "Total Queries", convert(varchar(50), isnull(ApplicationName, '')) as ApplicationName, avg(cpu) as "Avg CPU", max(cpu) as "Max CPU", sum(cpu) as "Total CPU", 
    	avg(reads) as "Avg Reads", max(reads) as "Max Reads", sum(reads) as "Total Reads", 
    	avg(writes) as "Avg Writes", max(writes) as "Max Writes", sum(writes) as "Total Writes",
    	avg(Duration) as "Avg Duration", max(Duration) as "Max Duration", sum(Duration) as "Total Duration"
    from tracetable
    where spid is not null
      and eventclass in (10, 12)
    group by ApplicationName
    order by "total reads" desc 
    
    -- Get the top results
    select top 100 *
    from #temp
    order by totalreads desc
    This will give you an idea of who your top talkers are, and what queries are beating you up the most. You wil most likely want to load your trace table on a dev machine, so the load from this query does not drag down your production server. The string operations, and grouping on them takes a little while.

    As for the SPIDs below 51, I would not spend much time on those. The CPU numbers on sysprocesses are cumulative over the life of the instance (since the last reboot). High CPU numbers here are relatively normal.

    If you suspect that compilations are a problem, you can break out perfmon, and monitor for SQL Statistics : SQL Compilations/Sec, and compare that to SQL Statistics:Batch Requests/Sec. If you are over say 10% compilations, you might have a problem, but that is just a ballpark figure.

    As to joining to sys.dm_os_tasks, you should see a row in sys.dm_os_tasks for each thread a process has spawned, so you are likely seeing a lot of parallel processing going on. This is not always a bad thing, of course. If you have a lot of cores on this machine, you could set MAXDOP to something less than all of the available processors. This will keep queries from gobbling up the whole box for each run. Another thing I have seen suggested is resetting "cost threshold for parallelism" to 20 or 25 from the default of 5. I have not played with this so much, so you will have to do a bit of your own science on this to see if it helps you (looks like you are OK with that, anyway). From what I have seen, there are no hard rules for what the cost threshold should be. It seems to be dependent on your particular workload.
    Last edited by MCrowley; 01-07-15 at 10:40. Reason: Trying to remove spurious smiley

  3. #3
    Join Date
    Jul 2010
    Location
    Norway
    Posts
    45
    Oh a reply, been busy last 30 minutes editing my post with my findings and try's and fails.

    Trying your query out tomorrow, calling it a day. :P

    1.
    ORDER BY total_logical_reads DESC, 1 DESC
    Ordering by logical reads, produced the same top row.

    2.
    SPIDS below 51, yes, figured that out. Those are just numbers that boils up over time anyways.

    3.
    Used perfmon for thread ID...but did not get anywhere... So, I see, should have looked for SQL Statistics Compilations/Sec, trying that tomorrow, enough work for today.


    Nice idea, MAXDOP is set to 0, letting SQL decide. Might reduce/increase it so it never gets THIS busy due to a single query.

    Remember I looked up cost treshold a few months back (or a year) and changed it, on a different server though.

    Rarely on "this side" of SQL-server, so thanks again!
    Last edited by ManyTimes; 01-07-15 at 11:02.

  4. #4
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    Quote Originally Posted by ManyTimes View Post
    Side note:
    Funny thing, in my eyes... If you look at the image, the highest "average duration", row number 20 in the image, has a database ID 32767 (but the DB-name could not be retrieved, hence the blank value)... While the query for the row 20 is:
    DBID = 32767
    This ID is reserved for the "Resource Database" (internal to the SQL engine)

    Pretty sure you can just exclude it for your purposes (I highly doubt you'll be able to do anything with it anyway!)
    George
    Home | Blog

  5. #5
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,800
    Provided Answers: 11
    The suspended threads are interesting, but probably not for the reason you think. The CX Packet wait is usually benign, and can never be eliminated, unless you go with a MAXDOP of 1, which is usually a bad idea. What concerns me about the suspended queries is the fact that the same aggregate query is being run on all three sessions. What is this query that starts off "select count(*), max(docmaster......"? I am not sure if they have created their own identity function, or what. With three of those running at the same time, I would suspect this query, and see what can be done about it.

  6. #6
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    At the risk of introducing one of these: red herring

    But what Mcrowley pointed out sounds promising.
    http://s29.postimg.org/n06fqo1ue/Suspendedqueries.jpg
    Look at the wait times, is there a [b]locking issue?

    Would be interested in seeing the whole code block, too.
    George
    Home | Blog

  7. #7
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,800
    Provided Answers: 11
    Blocking would be listed under a different wait type (LCK_M_IS for example). The CX packet waits are just a symptom of parallelism, although there is a chance that it could reflect bad statistics. Brent Ozar had a nice article about just this wait type.

  8. #8
    Join Date
    Jul 2010
    Location
    Norway
    Posts
    45
    gvee:
    ID 32767 is reserved, alright, thanks! I think it is funny though, a top running query, which just consisted of a 1 line comment, ran from a reserved db or not!

    The bottle-neck is the query that is running, it just keeps on running on parallel tasks, and all it does it counts a column, returns the min value of that column and the max value of that column, but in a MESS of IF, sub-selects, really strange thing. Not "man-made", but seem like it is auto-generated, so +1 to you MCrowley for mentioning ORM...

    Although I cannot really know, unless I stop the query, or change the way queries are executed... Hence my next efforts: Changing MAXDOP and Threshold!


    Question
    So, how would I best setup the value of MAXDOP, which is now currently 0?
    Task manager shows 4 CPU's.

    The CPU is:
    http://ark.intel.com/products/64591/...-GTs-Intel-QPI

    #Cores 6
    #Threads 12 per core

    Of couse I have read Brent Ozar's "MAXDOP": http://www.brentozar.com/archive/201...-query-slower/
    And look of it, 8 seems like a nice number, to begin with, but it also states max cores per CPU... So 6?

    I could just "try and fail", executing the query, see if it runs faster or if the CPU usage goes down... Basically it... I'll send a short e-mail to the support team of the application, if they have some pointers, because its "their" tool and database!



    Side note: to the "red herring"; fully aware of that it is hard to help, when you're not on site. But, helped indeed!
    Last edited by ManyTimes; 01-08-15 at 08:24.

  9. #9
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,800
    Provided Answers: 11
    One of the oddities of the system tables that deal with the procedure cache is that the query text reported is not just the name of the procedure, as you may expect, but the whole create procedure syntax. I am not sure why they did it that way, but there it is. The xp_logininfo procedure is just a result of the act of logging in to the server, and accessing a database. As a side note, how often are they logging in/out? Are they logging off after each statement? That could put a pretty invisible load on the server.

    Is this a virtual server? I can't think of another way to account for a 6 core processor (which would appear as 12 after Hyperthreading) can appear as only 4 CPUs in Task Manager. With so few (4) cores to work with, playing with the MAXDOP may not get you very far. I would start by setting it to 2, so you could potentially have 2 of these queries running without starving each other, but I would expect the overall performance to go down.

    As for the query, could an index on the column they are getting a max/min on help at all? Or are they going to a bunch of other tables as well? I am assuming here that the parallel workload is due to a table scan on this table (without seeing all the sub-selects, IFs, and other assorted sub-functions). If an index is already there, you can check the stats on it with the STATS_DATE function.
    Last edited by MCrowley; 01-08-15 at 11:17. Reason: Added a question.

  10. #10
    Join Date
    Sep 2005
    Posts
    8
    You could try deploying the sp_whoisactive to your server and use the following:

    EXEC master.dbo.sp_WhoIsActive
    @get_plans = 1,
    @get_outer_command = 1,
    @get_transaction_info = 1,
    @get_locks = 1,
    @get_additional_info = 1

  11. #11
    Join Date
    Sep 2005
    Posts
    8

    who is active

    sp_whoisactive will be beneficial for determining what is using how much of what resources.

  12. #12
    Join Date
    Jul 2010
    Location
    Norway
    Posts
    45
    MCrowley:
    Yes, this is a virtual server.

    The query goes to a bunch of other tables, access-tables, usertables... so, ignored indexes, it would "just" fasten up the query.


    Changing the maxdop and threshold value, I observed the following drops in CPU-usage:

    Some tests:
    Code:
    MAXDOP                  THRESHOLD             CPU%
    0                             5                             88-99
    2                             1500                        50-85
    2                             20                           75-90
    6                             25                           66-99    
    6                             50                           50-97
    3                             25                           99-100
    2                             80                           99-100
    6                             30                           55-97
    6                             150                         55-98
    6                             500                         28-80
    6                             250                         55-78
    6                             175                         28-90
    The estimations are done over a period of ~10 minutes, at my servers and databases, and after each change I have waited ~2 minutes to make sure executing the "infiite queries" are running with new values.

    CPU-usage
    Fully aware of that less CPU-usage does not imply better performance. Less CPU is consumed, same queries runs, results in longer execution time.
    But in my case; 4 queries running infinitely, so I believe blocking 2 CPU's and freeing the 2 other CPU's will result in... power saving and performance degradation! :P


    Conclusion:
    Settled down with
    - maxdop 6, threshold 175

    The query that has gone "mad", has a cost of 573, and I am aware of that a high threshold executes many queries in serial instead of parallel, which might not be ideal.

    Waiting to see if users next week complains about performance or not.

    Next challenge
    I'll try to forcefully quit these queries, see if that changes anything (if I dare to actually stop them), and if that works out, creating a script that runs every 1 minute to kill those running queries.


    >>MCrowley: "I would suspect this query, and see what can be done about it."
    Yes, indeed correct. It is the query, and I looked at what can be done with it: nothing. There's no stored procedures nor functions... The queries are within the programs themselves, and I got no clue which program or service is making the call, and the title of this thread is locating the query/database, so... Guess my "research" ends here.

    Thanks a ton for helping and bringing thoughts, very much appreciated!
    Last edited by ManyTimes; 01-09-15 at 11:26. Reason: Typos, rewording...

  13. #13
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,800
    Provided Answers: 11
    Should I ask if there is an index on the username and value (separate indexes for each won,t help) or if the cardinality of such an index is low enough for an index to help? How does the query plan for those four queries look?

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •