Page 1 of 2 12 LastLast
Results 1 to 15 of 21
  1. #1
    Join Date
    Aug 2004
    Location
    France
    Posts
    754

    Unanswered: Strange performance behaviour

    Hello everyone,

    I'm currently facing a problem but can't figure out where it comes from.

    Here is the background : I have an OCCI client app which performs many inserts / updates on a 9i server (with Sun Solaris OS). Each time I execute an insert on table A, I execute an update on table B. I'm using bind variables, only commit every second (no autocommit), and send queries in batch (50 at a time for inserts, 20 at a time for updates). Generally, in this config, I'm doing about 500 inserts/updates per second.

    It seems to go fine for a time, but if I insert/update many rows, it comes to a point where the Oracle thread corresponding to the OCCI connection (the one used for inserts and updates) takes all the CPU, and the app becomes very slow. This lasts for about 5 minutes, and then works as before.

    I performed alter session set max_dump_file_size=unlimited and alter session set events '10046 trace name context forever, level 12' at the beginning of the client app, and ran tkprof afterwards. One time I do about 100000 inserts/updates, and the other one 800000.

    Result for insert, 100000 : see attached file '100000.txt'
    Result for insert, 800000 : see attached file '800000.txt'

    In these files, I deleted data on all other queries but inserts, for nothing seemed abnormal to me about the other ones. I kept the summaries at the end.

    What bothers me in 'current' reads : 200 000 for 100 000 inserts compared to 37 600 000 for 800 000 inserts !

    According to the docs, 'current' reads, are 'non consistent' reads. Does that mean disk read (because the needed blocks weren't in the buffer cache) ?

    If it can help you understand the phenomenon, I have 2 indexes on the table on which I perform inserts.

    Does someone have an idea of what is going on ?

    Regards,

    RBARAER
    Attached Files Attached Files

  2. #2
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    Things you might want to try

    1) Repeat the test with one index and no indexes on the table. This might point to index hotspots. This often happens if you have an index on a date column which is using sysdate. To cure this you can reverse the byte order.

    2) Check initrans, freelist and freelist group settings on the table (and indexes). If set to a low value maybe increasing it will help.

    3) Check out the undo tablespace and the associated views like v$undostat during the load as it does seem to wait a lot on undo extention. You might want to switch to manual undo configuration just for this this job. Is it growing the undo datafiles?

    4) Run os monitoring tools to see which disks are being hit during your inserts, it could be you need to 'rebalance' your tablespaces, redo logs etc. if there is a hot spot.

    5) Are there other jobs running which are hitting the same table?

    6) Check out the redo log and archived redo log disks to see if this is the bottleneck. This can be reduced using the APPEND hint.

    Alan

  3. #3
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    Something else which might point to the indexes...

    http://www.jlcomp.demon.co.uk/faq/slowdown.html

  4. #4
    Join Date
    Aug 2004
    Location
    France
    Posts
    754
    Thanks Alan, this article was interesting. I'm gonna try with only one index. However, it would not be practical to delete the second one, as it is the one used by my PK constraint. By the way, the first one is useful for some important queries... so I hope I'll be able to keep it anyway.

    My app will insert / update data in real time, it isn't batch processing, and it may go up to about 1000 inserts/second in peak periods, which may last some minutes (the flow of data will be very different from time to time).

    2) Check initrans, freelist and freelist group settings on the table (and indexes). If set to a low value maybe increasing it will help.
    They should be at their default values, for I have not specified them at table or index creation time.

    BTW, my indexes were created as follows :

    CREATE UNIQUE INDEX indx_cp_Transaction
    ON bcadmin.transaction (CodeValeur, DateHeure, CodeReferentiel, Numero)
    PCTFREE 10
    STORAGE
    (
    INITIAL 75738K
    NEXT 37869K
    PCTINCREASE 0
    )
    TABLESPACE indx00
    ;

    CREATE INDEX index_transac_dateref
    ON transaction(dateheure, codereferentiel)
    PCTFREE 10
    STORAGE
    (
    INITIAL 2000000
    NEXT 1000000
    PCTINCREASE 0
    )
    TABLESPACE indx00
    ;

    I did this because of "DB Tools" advice (when I trialed it), but now I wonder if such high values for INITIAL and NEXT are good. What do you think of it ? (I will have several millions of rows in the table)

    Moreover, my DB_BLOCK_SIZE is currently 2K (certainly not the best choice), and it's the same for every tablespace in my db. I read in an article pointed out in this forum (http://www.dbazine.com/burleson2.shtml) that it was good to have a big block size for indexes, so I plan to create a 32K-Blocks tablespace for indexes. Do you think it would help in this situation ?

    Thanks & Regards,

    RBARAER

  5. #5
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    1) Copy the table and try it without no indexes so you can identify if one particular index is the issue, if it is an index issue at all. Then you try things like increasing initrans on the index or trying compression or reversing the bytes on the index which is causing the problem.

    2) Increasing block size on the index will help.

    3) You db cache settings will also have an affect so on your prod database with a large cache the slowdown may be much less pronounced.

    4) Insert with the APPEND hint will help a lot if you have a redo/archived log issue.

    5) Look at your OS stats on disk IO. Remember Oracle can only guess if a disk IO is happening as there might be OS buffering. You may find a disk hotspot.

    6) Partitioning may help especially if both indexes.

    7) Do you have queries which just select from your table on CodeValeur and no other criteria. If you dont have such queries try one index on (DateHeure, CodeReferentiel, CodeValeur, Numero) and other possible combinations of the same columns. Also look to see if you can use compression.

    You may find you need to try out a lot of combinations of the above before you find a near optimal solution. And dont forget that you may have to do it again when you get your PROD box.

    Alan

  6. #6
    Join Date
    Aug 2004
    Location
    France
    Posts
    754
    Thank you very much Alan for these suggestions, I will try them and see how I can come to a good solution.

    Concerning point 1 :

    1) initrans is 1 for the table, and 2 for the indexes. You said it could be good to increase them, but which values could I reasonably try ? The Oracle doc specifies that it should generally not be changed :
    In general, you should not change the INITRANS value from its default.
    Each transaction that updates a block requires a transaction entry in the block. The size of a transaction entry depends on your operating system.
    This parameter ensures that a minimum number of concurrent transactions can update the block and helps avoid the overhead of dynamically allocating a transaction entry.
    So what would be the gain in changing it ?

    2) Could you please tell me how to reverse the bytes order in the index ? I haven't seen that in the Oracle doc.

    3) I understand compressing the index would decrease the disk I/Os when accessing it, but isn't there a performance drawback anyway ?

    Concerning point 7 :

    4) I know that both indexes are somewhat redundant, but when I tested performance of a particular query, I found that using the index on (dateheure, codereferentiel) proved to be better than using the index on (CodeValeur, DateHeure, CodeReferentiel, Numero). As I don't have any query on CodeValeur only, you suggest that I had only one index, in this order : (DateHeure, CodeReferentiel, CodeValeur, Numero). Am I correct ?

    Thanks & Regards,

    RBARAER

  7. #7
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    1) try a value of 2 or 3 for initrans if you have more than one process inserting (i.e. parallel).

    2) REVERSE keyword in create index if you have a hotspot, dont do it otherwise.

    3) The reduced IO generally overcomes any penalty.

    4) Yes try one index on (DateHeure, CodeReferentiel, CodeValeur, Numero) and also try other permutations of these 4 columns, compression will help if the leading columns have repeated values.

    Alan

  8. #8
    Join Date
    Aug 2003
    Location
    Where the Surf Meets the Turf @Del Mar, CA
    Posts
    7,776
    Provided Answers: 1
    Ready, Fire, Aim!
    Never confuse movement with progress.
    Going around in circles is movement, but most folks don't consider it to be progress.

    AFAIK, nobody knows what is actually the root cause of the problem.
    Rather than taking a professional or scientific approach to collect facts,
    the plan is to make guesses at what might be the problem & hope to get lucky;
    while at the same timing hoping that you don't make a bad situation worse.

    Rhetorical question -
    Why is time and effort better spent shooting in the dark, rather than collecting facts that actually & directly show where the delays occur?
    You can lead some folks to knowledge, but you can not make them think.
    The average person thinks he's above average!
    For most folks, they don't know, what they don't know.
    Good judgement comes from experience. Experience comes from bad judgement.

  9. #9
    Join Date
    Aug 2004
    Location
    France
    Posts
    754
    Dear anacedent,

    Excuse me, but even if I understand you can say such things when somebody doesn't look for facts at all, I don't understand your reaction here. Generally, you advise people in such a situation to enable tracing and run tkprof to analyze the trace file, so as to see where the time is being spent.

    I did that : just have a look at my first post, and you'll see part of the resulting tkprof reports. I'm not "shooting in the dark and hoping to get lucky", just looking for facts and analyzing them correctly. I thought it was what you proned.

    But now I just don't know how to analyze the results correctly and find the cause : I've pointed the culprit SQL, but still don't know what the cause is.

    AlanP just gave me some interesting directions for finding the root of the problem. Thank you again, Alan.

    Now, if you have a more precise idea of where the problem might come from, I'll be glad to have your opinion.

    Best regards,

    RBARAER

  10. #10
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    Anacedent, what makes you think there is one problem? And also why do you think once you found what the problem is there is only one solution. If you have 10 possible solutions I would suggest IMHO that you need to try them all out as Oracle is not the simple product it was 10 years ago. There are so many variables from client, network, server hardware, OS, database configuration, IO subsystem etc that it is impossible to guess at the ideal solution without trying them out.

    If you look at the trace files you will notice a number of potential problems. Now some of them may or may not be significant for these particular query but even if they are not it would be better to fix as many problems as possible before you have issues with them.

    Alan

  11. #11
    Join Date
    Aug 2003
    Location
    Where the Surf Meets the Turf @Del Mar, CA
    Posts
    7,776
    Provided Answers: 1
    >This lasts for about 5 minutes, and then works as before.
    >Execute 16000 935.39 1231.49 180 1738010 37640367 800000
    The line immediately above shows in round numbers a difference of about 300 seconds between CPU & elapsed time; which is about 5 minutes.
    EXACTLY what now is happening during these FIVE minutes?
    Code:
    SELECT DECODE(request,0,'Holder: ','Waiter: ')||sid sess, 
             id1, id2, lmode, request, type
        FROM V$LOCK
       WHERE (id1, id2, type) IN
                 (SELECT id1, id2, type FROM V$LOCK WHERE request>0)
       ORDER BY id1, request
    /
    If the SQL above is run "repeatedly" during the slowdown, it might show useful information. Then again it might always return 0 rows & prove useless.

    My main problem with the shotgun approach, is that even if the problem does
    not recur any time soon, you'll never really, really know if you solved it or if/when it will come back to bite you.

    It is your system & you have to live with the choices you make.
    I'm just stating what & why I'd do things differently.

    > it is impossible to guess at the ideal solution without trying them out.
    Doing this on a production server IMO is not a wise course of action.
    The DB's I manage & have manged, "run the company".
    When the DB is down, the company is essentially out of business.
    I am not foolish enough to "bet the company" based upon some free advice from somebody who has nothing to lose if the outcome is less than desireable.
    You can lead some folks to knowledge, but you can not make them think.
    The average person thinks he's above average!
    For most folks, they don't know, what they don't know.
    Good judgement comes from experience. Experience comes from bad judgement.

  12. #12
    Join Date
    Aug 2004
    Location
    France
    Posts
    754
    Be sure I understand your point, anacedent, but please recognize that your reaction here was not really fair : I try to do my best, and follow most of your advice. I forgot to mention it, but I had the reflex to try your query above (taken from another thread) several times during the problem, but got nothing. Thanks anyway.

    I tried just to delete the index on (DateHeure, CodeReferentiel), and got a much better result, see attached file (with 1 200 000 inserts). The "current" value is much less strange. It seems that the bad time was being spent waiting for "log buffer space", "undo segment extension" and "free buffer waits" (when compared to '800000.txt').

    Anacedent, Alan, what is your opinion ? Could you please explain the meaning of each of these 3 wait reasons ?

    Thanks & regards,

    RBARAER
    Attached Files Attached Files

  13. #13
    Join Date
    Aug 2003
    Location
    Where the Surf Meets the Turf @Del Mar, CA
    Posts
    7,776
    Provided Answers: 1
    >but I had the reflex to try your query above (taken from another thread) several times during the problem, but got nothing.
    Then it will do NO good to increase INITRANS, because you have no sessions waiting for one.
    You can lead some folks to knowledge, but you can not make them think.
    The average person thinks he's above average!
    For most folks, they don't know, what they don't know.
    Good judgement comes from experience. Experience comes from bad judgement.

  14. #14
    Join Date
    Aug 2003
    Location
    Where the Surf Meets the Turf @Del Mar, CA
    Posts
    7,776
    Provided Answers: 1
    Code:
    ********************************************************************************
    
    OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS
    
    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse      111      0.01       0.02          0          0          0           0
    Execute  84110   1504.59    1826.47     224780    7002029    3165339     2400007
    Fetch        0      0.00       0.00          0          0          0           0
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total    84221   1504.60    1826.49     224780    7002029    3165339     2400007
    
    Misses in library cache during parse: 0
    
    Elapsed times include waiting on following events:
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      SQL*Net message to client                   84542        0.00          1.05
      SQL*Net message from client                 84541       58.33        402.22
      db file sequential read                    224699        1.81        104.09
      SQL*Net break/reset to client                 200        0.01          0.10
      SQL*Net more data from client               22001        0.03          1.51
      undo segment extension                         23        0.00          0.00
      log file sync                                 446        1.00         61.13
      log buffer space                               24        0.62          2.53
      latch free                                      3        0.02          0.07
      log file switch completion                      4        0.29          0.73
      db file scattered read                          6        0.01          0.02
    >It seems that the bad time was being spent waiting for "log buffer space", "undo segment extension" and "free buffer waits" (when compared to '800000.txt').
    Combined these three account for only slightly more than 1 minute of time.
    Why are you fixated on them?
    The way I read these statistics, the code in question does NOT appear to me to be running directly on the DB server itself.
    Would I be correct to say that the code is running from a remote client?
    If so, what happens if/when the code run directly on the DB server system?
    You can lead some folks to knowledge, but you can not make them think.
    The average person thinks he's above average!
    For most folks, they don't know, what they don't know.
    Good judgement comes from experience. Experience comes from bad judgement.

  15. #15
    Join Date
    Aug 2004
    Location
    France
    Posts
    754
    Then it will do NO good to increase INITRANS, because you have no sessions waiting for one.
    Thanks.

    Would I be correct to say that the code is running from a remote client?
    Yes you are, hence the important SQLNet waits from the client. These would certainly be reduced a little by running the app on the server, but I've read in the Oracle doc that it is greatly recommended not to run other app but Oracle on the server machine. Don't you agree with that ?

    Combined these three account for only slightly more than 1 minute of time.
    No, sorry, just look at '800000.txt' in my first thread, which is the result of tkprof when I encountered the problem :

    Code:
    OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS
    
    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse      111      0.06       0.14          5          8          0           0
    Execute  56110   1438.90    1888.38      19594    4561269   38525462     1600007
    Fetch        0      0.00       0.00          0          0          0           0
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total    56221   1438.96    1888.53      19599    4561277   38525462     1600007
    
    Misses in library cache during parse: 9
    
    Elapsed times include waiting on following events:
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      SQL*Net message to client                   56373        0.00          0.36
      SQL*Net message from client                 56372       59.16        201.56
      SQL*Net break/reset to client                 200        0.00          0.09
      SQL*Net more data from client                5825        0.00          0.20
      log buffer space                             1303        0.74        135.58
      log file sync                                 306        1.00         77.89
      undo segment extension                    4490153        0.44         34.40
      latch free                                      2        0.02          0.02
      log file switch completion                      5        0.44          1.18
      free buffer waits                             180        1.07        102.39
      db file sequential read                     19518        1.70         23.38
      buffer busy waits                               1        0.00          0.00
      db file scattered read                          6        0.03          0.08
    As you can see, these 3 put together represent 273 seconds wait time, that is 4 minutes 33 seconds, compared to barely 3 seconds in "12000000_one_index.txt". And considering the single insert query :
    Code:
    Elapsed times include waiting on following events:
      Event waited on                             Times   Max. Wait  Total Waited
      ----------------------------------------   Waited  ----------  ------------
      SQL*Net more data from client                5825        0.00          0.20
      SQL*Net message to client                   16000        0.00          0.11
      SQL*Net message from client                 16000        0.33         29.45
      log buffer space                              899        0.74         90.66
      undo segment extension                    4489506        0.44         34.40
      log file sync                                  35        0.49          8.37
      latch free                                      1        0.02          0.02
      log file switch completion                      5        0.44          1.18
      free buffer waits                             155        1.07         85.67
      db file sequential read                       180        1.70         11.56
    They represent 210 seconds wait time : 3 minutes 30 seconds only for this query. So of course I'm focusing on these 3 results ! Don't you think I'm right ?

    If you compare CPU elapsed time for this insert query between "800000.txt" and "12000000_one_index.txt", you will see : 935 vs 287, while in the 2nd case I insert 50% more data.

    So my conclusion, which may not be the best, is that the 3 results mentioned above are the visible side of the problem, and that they result in a great CPU overhead.

    Now, this is not sufficient for me to clearly identify the root of the problem and solve it. Having read the article Alan pointed to me, I think I may encounter the same sort of problem, and the fact that dropping the index on (DateHeure, CodeReferentiel) seem to have solved the problem comforts me in this idea. However, I'd really like to have your opinion, both of you, and I'd also like to understand what these 3 "results" make you think of it : what is their exact meaning and what actions may result in such high values, so that I can better analyze tkprof results in the future.

    Thanks & Best Regards,

    RBARAER
    Last edited by RBARAER; 11-25-04 at 06:15.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •