Page 1 of 5 123 ... LastLast
Results 1 to 15 of 74
  1. #1
    Join Date
    Dec 2006
    Posts
    43

    Arrow Unanswered: Quad-CPU database server running at 100% - any pointers?

    Hi guys,

    We are experiencing some severe performance problems with our production application. The bottleneck seems to be the database server, a quad-Xeon 2.0GHz machine with 2GB of RAM running Oracle 9i, which often ends up having all 4 CPUs running at 100% with the oracle.exe processes being the main culprits.

    Our application is Java EE based, running on Glassfish with a dynamic [10; 100] database connection pool. We are experiencing loads of between 10 and 150 simultaneous users, with an average load of 50. Surprisingly we have implemented aggressive caching of a lot of data in this latest release, but apparently it doesn't give the expected results.

    I am no DBA and unfortunately have no competent resource at hand to solve this problem. I know my description is very vague, but maybe someone could give me some pointers on how to improve the performance of our database server? Specifics, statistics, documentation, anything!

    Thanks a bunch,
    GB
    Last edited by gbilodeau; 12-21-06 at 04:39.

  2. #2
    Join Date
    Nov 2006
    Location
    Indianapolis
    Posts
    90
    Your best bet is to get a DBA and look into the details - If it were me I'd start with these questions:

    1) I'm assuming it's a Windows database box...
    2) What's the history? Is this a new app? Did it just suddenly start performing slow? After an upgrade? What has changed?
    3) Since CPU's are pegged, I'm assuming most application functions are slow?
    (and not hit or miss, application wise)
    4) Quantify the performance - any difference between no users, 10 users, 100 users?
    5) Is the 9i database patched?

    A DBA would run statspack every 5 to 15 minutes during the hot periods, look for top SQL, look at system and session waits, session activity.

    Could be a lot of things - missing or wrong statistics, poor execution plans, app jobs in infinite loops...

  3. #3
    Join Date
    Sep 2004
    Location
    London, UK
    Posts
    565
    Dare I mention bind variables and SQL parsing?

    Try querying v$system_event to see what the instance has spent its time waiting for.

  4. #4
    Join Date
    Dec 2006
    Posts
    43

    v$system_event

    Hi guys and thanks for the tips.

    Following the suggestion to look at the v$system_event table, I've run the query listed here: http://www.praetoriate.com/oracle_ti...ent_events.htm

    What seems to come out of this:
    - the "db file scattered read" and "db file sequential read" events seem to have the highest "total_waits" and "time_waited" values (by far)
    - the "log file parallel write" (by far) and "db file parallel write" events have the highest "total_timeouts"
    - the "control file heartbeat" event has the highest "average_wait" (by far)

    What should I make of this? From what I understand the database is scattered. How can I "defrag" it?

    Thanks again!

  5. #5
    Join Date
    Dec 2003
    Location
    Oklahoma, USA
    Posts
    354
    GB, I would have to echo "tomcat's" questions... and add a couple of my own.

    1) Are there ANY other apps running on this machine? For instance, is the "database server" also functioning as a PDC, BDC, DHCP server?

    2) How do you know it is the database server causing the problem? Did this application "perform well" in a development environment? Is the new productoin environment similar to the development environment?

    3) Have you made any attempts to tune the SQL statements contained in your Java app?

    4) Also, how many database instances are running on this one machine? If there is just one, 2GB should be fine, however, with two or more, you'd be pushing it especially on a Windows box... Since you said the "running at 100% with the oracle.exe PROCESSES" I assume you might have more than one running on this machine.
    Last edited by joebednarz; 12-26-06 at 15:28.
    JoeB
    save disk space, use smaller fonts

  6. #6
    Join Date
    Dec 2006
    Posts
    43
    Hi, here are some answers, maybe it would be help in diagnosing the problem before a DBA can come (if that miracle ever happens!):

    1) Yes, it's running on Win2000.

    2) This is a version 2.0 application, and version 1.0 was running in the same environment although on a different application server. Performance was already slow in V1 - it's impossible to put numbers on whether V2 is faster or slower.

    Another thing that has changed is that a 2nd Oracle instance was added for development - on the actual production database server, for lack of other hardware. Only 3 developers and 1/2 tester is connecting to it. Could this really cause a big performance loss?

    3) Both applications, which are hosted on a different application server, run equally slow. The database server is dedicated to running Oracle - however as I said the development instance is running on it as well as the production instance.

    4) Performance (obviously) degrades as the number of users increases, although CPUs can be almost maxed with even a low number of users.

    5) No idea

    Some more info that we collected recently:

    - Stats were recalculated a few weeks ago
    - Using the Oracle Management Console, we have followed open sessions and noticed that "Long Operations" always involved queries targeting 3 specific tables. These tables are somewhat large (300.000, 1.200.000 and 1.400.000 records respectively, and growing) and are queried on different fields.

    Any comments would be greatly appreciated!

    Thanks a lot,
    GB

    Quote Originally Posted by Indy_tomcat
    Your best bet is to get a DBA and look into the details - If it were me I'd start with these questions:

    1) I'm assuming it's a Windows database box...
    2) What's the history? Is this a new app? Did it just suddenly start performing slow? After an upgrade? What has changed?
    3) Since CPU's are pegged, I'm assuming most application functions are slow?
    (and not hit or miss, application wise)
    4) Quantify the performance - any difference between no users, 10 users, 100 users?
    5) Is the 9i database patched?

    A DBA would run statspack every 5 to 15 minutes during the hot periods, look for top SQL, look at system and session waits, session activity.

    Could be a lot of things - missing or wrong statistics, poor execution plans, app jobs in infinite loops...

  7. #7
    Join Date
    Dec 2006
    Posts
    43
    Hi,

    1) Nope, just Oracle on Win2000. As you correctly spotted, the production instance and a development instance are both running on it, for lack of hardware.

    2) A quad-CPU machine maxed at 100% is the main symptom But seriously applications do perform slowly in a development environment as well, but this is surely related to 1).

    3) As much as we could. We cached most of the "reference model" and tuned some SQL statements, although the strategy of one heavily-used feature is, by its inherent nature, database intensive as its queries 3 tables with 300.000, 1.200.000 and 1.400.000 records (and growing) with joins and different where clauses.

    4) Yup, well spotted. Can this really have such an impact?

    We'll try to move the development instance to another server. Otherwise would you have any other suggestions?

    Thanks a lot for your precious help!
    GB

    Quote Originally Posted by joebednarz
    GB, I would have to echo "tomcat's" questions... and add a couple of my own.

    1) Are there ANY other apps running on this machine? For instance, is the "database server" also functioning as a PDC, BDC, DHCP server?

    2) How do you know it is the database server causing the problem? Did this application "perform well" in a development environment? Is the new productoin environment similar to the development environment?

    3) Have you made any attempts to tune the SQL statements contained in your Java app?

    4) Also, how many database instances are running on this one machine? If there is just one, 2GB should be fine, however, with two or more, you'd be pushing it especially on a Windows box... Since you said the "running at 100% with the oracle.exe PROCESSES" I assume you might have more than one running on this machine.

  8. #8
    Join Date
    Nov 2006
    Location
    Indianapolis
    Posts
    90
    You're really starting out from scratch on tuning this, there's nothing to compare against
    There are tons of possible things to look at, my first thoughts would be:

    See how large those big tables are in # of blocks, gigs
    Statspack is best but the "shotgun" approach sql below may help identify the larger SQL
    Look for SQL with high logical I/O per row or block returned

    Code:
    col TOTAL_LIO for 999,999,999,999,999
    col TOTPERC for 999.99
    col LIOS for 999,999,999,999
    
    select a.hash_value, a.exes, a.pios, a.lios, a.nrows, a.lio_exe, a.lio_row,
      round(a.lios/b.totlio,4) * 100 "LIO_PERC"
    from
    (select * from
    (select 
      hash_value, 
      sum(executions) exes, 
      sum(disk_reads) pios, 
      sum(buffer_gets) lios, 
      sum(rows_processed) nrows, 
      round(sum(buffer_gets)/(sum(executions)),2) "LIO_EXE", 
      round(sum(buffer_gets) / decode (sum(rows_processed), 0, 1, sum(rows_processed)),2) "LIO_ROW"
    from gv$sqlarea where executions>0 
    group by hash_value order by 4 desc) where rownum<21) a,
    (select sum(buffer_gets) "TOTLIO" from gv$sql where executions>0) b
    order by lio_row desc;
    Check for # of checkpoints per hour (< 10 would be best)

    Code:
    select trunc(completion_time,'HH'), count(*) from v$archived_log
    where completion_time > trunc(sysdate)-2
    group by trunc(completion_time,'HH')
    order by trunc(completion_time,'HH');

  9. #9
    Join Date
    Dec 2006
    Posts
    43
    Hi,

    More and more and it looks like we really need a DBA for this! In the mean time thanks a bunch for helping me out.

    I tried to get some statistics on the problematic tables and here's what I could get:

    - T1: 12901 used blocks, 0 empty blocks, 0 chained blocks, 85 extent count
    - T2: 2890 used blocks, 0 empty blocks, 0 chained blocks, 39 extent count
    - T3: 10605 used blocks, 0 empty blocks, 0 chained blocks, 82 extent count

    Block size is 8192 so I suppose this adds up to 105MB, 24MB and 87MB.

    Is there anything suspicious about this? The fact that there are no empty blocks seems strange to me - maybe it's spending too much time extending the table?

    Once again, thanks a bunch!
    GB

    Quote Originally Posted by Indy_tomcat
    You're really starting out from scratch on tuning this, there's nothing to compare against
    There are tons of possible things to look at, my first thoughts would be:

    See how large those big tables are in # of blocks, gigs
    Statspack is best but the "shotgun" approach sql below may help identify the larger SQL
    Look for SQL with high logical I/O per row or block returned

  10. #10
    Join Date
    Dec 2006
    Posts
    43

    Post Got a statspack report

    Ok guys, I've installed STATSPACK, took 2 snapshots 15 minutes apart and generated the attached report.

    Following some general advice on the Internet, it looks like these things should worry me:

    - Execute to parse % is quite low (4%)
    - Parse CPU to Parse Elapsd % is also quite low (16%)
    - % Non-Parse CPU could be higher (75%)
    - some queries use huge in clauses (select... from... where x in ...) which result in lots of buffer gets

    We do use queries with big "in clauses" which are built dynamically using Java string concatenation. Is this something we should try to change?

    What should I make of all this?

    Once again, thanks a million!
    GB
    Attached Files Attached Files

  11. #11
    Join Date
    Sep 2004
    Location
    London, UK
    Posts
    565
    "db file sequential read" tends to mean index reads, which can be CPU-intensive. Could any of the top SQL statements in the "SQL ordered by Gets" section perhaps be using indexes when they shouldn't?

    Low Execute to Parse ratio suggests the app is parsing a lot of statements and never executing them, which would appear to be wasted effort unless I'm missing something.

    It's odd that most of the database time seems to have been spent on events generally regarded as idle ("SQL*Net message from client" and "virtual circuit status") when the box is maxing out 4 CPUs. I wonder whether these are actually indicating a problem. e.g. in Cary Millsap's "Optimizing Oracle Performance", he is analysing a trace file and he comments:

    Because of a database driver deficiency, this application actually submitted each parse call to the database two times ... Even though the parse calls were routinely inexpensive, ... the response times for the overall user action suffered brutally from the tremendous number of unnecessary "SQL*Net message from client" executions...
    In a footnote he adds,
    Lots of drivers provide an option to behave this way. The extra parse is used to produce a "describe" of the SQL being parsed, so that the driver can provide more informative error messages for thee developer. Even the Perl DBI behaves this way by default.
    ...and in a later chapter:
    The remedy for this event is to eliminate as many unnecessary database calls as possible. For example, eliminate redundant parse calls. Use Oracle's array processing features to manipulate many rows per database call instead of just one.
    The book also mentions a case where a SQL*Net misconfiguration caused excessive "SQL*Net message from client" wait times.

    Of course this could all be a total red herring and there was simply a SQL*Plus session sitting idle, most likely the one that was used to start the Statspack report (except that there were apparently 7,868 waits of 131 ms each, rather than one long one, so I'm not so sure).

    Quote Originally Posted by gbilodeau
    We do use queries with big "in clauses" which are built dynamically using Java string concatenation. Is this something we should try to change?
    Yes, especially if there are a lot of them.
    Last edited by WilliamR; 12-27-06 at 12:22.

  12. #12
    Join Date
    Nov 2006
    Location
    Indianapolis
    Posts
    90
    I agree with William, I would look at execution plans and indexes on the big table sqls. The problem is I/O (doesn't mass disk I/O on windows tend to use lots of cpu?)

    Looks like the app is opening a new cursor (parses=executions) for each sql it runs; this costs cpu & network, and does not scale. If it is, only a coding change can fix that.

    How slow/busy was the system when this statspack was taken? Doesn't seem like it would kill a 4 cpu box.

    Top sqls were only run once each, not much else close to these two:
    954071772 - 1.4 million lios, 1765 pios.
    634999307 - 24k lios, 14k pios

    I would see if these exact sql are run more, or if they're similar to top sql from other time periods (do we need to tune an exact sql, or a common reocurring sql plan)

  13. #13
    Join Date
    Dec 2006
    Posts
    43

    New statspack report

    Hi guys,

    Once again well spotted, the previous statspack report was run on the wrong database. Here attached is the new report, which was run on the actual production database.

    Things are a little different. The Buffer Hit % is lower (94%) and the Execute to Parse % is abysmally low (0.27%). The wait events are much different too: CPU time is now taking 94% instead of the previous 13%. What should I make of this Execute to Parse % ?

    There are also a TON of "SQL*Net message from client" wait events, over 800,000 waits. What is this, should I worry about this? One of you has suggested that it could be an open SQLPlus session - I did leave SQLPlus open between the 2 "exec statspack.snap" commands. Could this have had this effect?

    Thanks a lot!
    GB
    Attached Files Attached Files

  14. #14
    Join Date
    Jul 2003
    Posts
    2,296
    Dood, you HAVE to patch this to the latest 9i release.
    9.2.0.1 is not going to cut it.

    tune these queries and the others listed in the buffer section.
    All three are not only causing the buffer i/o but also physical i/o.
    Make sure they are using indexes (which they probably are not).
    If they are not using indexes either create appropriate indexes or rewrite the sql to use existing indexes.
    PHP Code:
                                                         CPU      Elapsd
      Buffer Gets    Executions  Gets per Exec  
    %Total Time (s)  Time (sHash Value
    --------------- ------------ -------------- ------ -------- --------- ----------
          
    1,482,505            1    1,482,505.0   96.3    16.41     38.53  954071772
    SELECT 
    /*+FIRST_ROWS*/ 
      
    AG.Code as Goods_codeAG.NAME as Goods_nameAG.Status as Goods_status
      
    AG.IS_CRC_ACTIVE as Goods_is_crcActive,  SP.NAME as supplier_name
      
    SP.Supplier_Number as supplier_code,  SP.EMAIL as supplier_email,  
      
    SP.ID as supplier_id,  AGS.Status as supplier_statusAGS.I


             24
    ,998            1       24,998.0    1.6     3.09     31.90  634999307 
    select 
      RESULT
    .TG_ID as "promoPlaceId"tg.name as "promoplace_name"
      
    tg.is_active as "promoplace_active"tg.user_id as "user"
      
    tg.create_date as "create_date"as "has_ponton"as "has_gondol",  
      
    tg.STORE_ID as "shopId",  as "onPonton"'tg' as "TYPE",  
      
    RESULT.ID as "id"RESULT.NEED_ID as 


             
    13,567            1       13,567.0    0.9     2.39     52.70 4045768620
    select 
      agrd
    .CODEagrd.SUPPLIER_IDagrd.GRAPPE_IDt.TARIFt.VAT
      
    ags.BUY_DIRECTIONags.STATUSags.IS_ACTIVE
      
    ags.IS_PICK 
    from AGOODS_REGDELIVERY agrd 
      left outer join GOODS_SUPPLIER ags on 
      
    (agrd.CODE ags.CODE and agrd.SUPPLIER_ID ags.SUPPLIER_ID
      
    left outer join TARIF t on (agrd.AUCH 
    Last edited by The_Duck; 12-28-06 at 12:41.
    - The_Duck
    you can lead someone to something but they will never learn anything ...

  15. #15
    Join Date
    Jul 2003
    Posts
    2,296
    1. First patch to the latest version which I believe is 9.2.0.7 or 9.2.0.8

    2. * removed as looking at wrong statspack *

    3. statspack states you could benefit by increasing your buffer cache to 480M. Right now it is 256M so I see no reason to not increase as statspack suggests.

    4. check the sizing of your shared memory objects. statspack didn't show any sign that increase would improve much as of yet.
    PHP Code:
    select from v$sgastat
     where name 
    'free memory'
       
    and pool 'shared pool';


    select
       shared_pool_size_for_estimate
    ,
       
    shared_pool_size_factor,
       
    estd_lc_time_saved,
       
    estd_lc_memory_object_hits
    from v$shared_pool_advice

    5. Statspack states you will get improvement in PGA memory by increasing your pga_aggregate_target to 384M. Right now it is 192M so doubling would be a good start.
    Last edited by The_Duck; 12-28-06 at 12:40.
    - The_Duck
    you can lead someone to something but they will never learn anything ...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •