Results 1 to 15 of 15
  1. #1
    Join Date
    Apr 2006
    Posts
    6

    Unanswered: long checkpoint duration/disk flush

    Hello,

    This is my first post to this forum, so please excuse me if I don't supply all the necessary information. Just let me know what additional information you need.

    I am running Informix IDS v9.4FC5 on a HP9000 w/4 processors using the HP/UX 11i O/S. We rebooted our server the other day and immediately upon initialization of Informix we noticed long checkpoint durations. Our checkpoint duration for the past few months has been either 0 or 1 second. Now at peak load our checkpoint durations are around 30 seconds. My users are complaining vehemently!

    I have made some changes to the configuration file based on some of the posts I read on this forum regarding "long checkpoint duration". The configuration changes have NOT corrected the problem. Please find, below, the Shared Memory configuration settings before and after my attempt to correct the problem.

    What I can't understand is that NOTHING changed!...we simply rebooted the server. We did not change any Informix configuration settings, add dbspaces, etc. Does anyone have any ideas?


    # Shared Memory Parameters

    LOCKS 100000 # Maximum number of locks
    BUFFERS 256000 # Maximum number of shared buffers
    NUMAIOVPS 2 # Number of IO vps
    PHYSBUFF 32 # Physical log buffer size (Kbytes)
    LOGBUFF 32 # Logical log buffer size (Kbytes)
    CLEANERS 128 (changed from 2) # Number of buffer cleaner processes
    SHMBASE 0x0 # Shared memory base address
    SHMVIRTSIZE 128000 # initial virtual shared memory segment size
    SHMADD 20000 # Size of new shared memory segments (Kbytes)
    SHMTOTAL 0 # Total shared memory (Kbytes). 0=>unlimited
    CKPTINTVL 300 # Check point interval (in sec)
    LRUS 128 (changed from 8) # Number of LRU queues
    LRU_MAX_DIRTY 2.000000 # LRU percent dirty begin cleaning limit
    LRU_MIN_DIRTY 1.000000 # LRU percent dirty end cleaning limit
    TXTIMEOUT 0x12c # Transaction timeout (in sec)
    STACKSIZE 64 # Stack size (Kbytes)

  2. #2
    Join Date
    Dec 2003
    Location
    North America
    Posts
    146
    Have you looked at your online.log file to see if anything changed?

    Perhaps someone (other than yourself) with adequate permissions editted the onconfig file & bouncing the IDS instance brought those unintended changes into effect.

  3. #3
    Join Date
    Apr 2006
    Posts
    6
    Thank you for your reply.

    There are only 2 people with adequate permissions (myself included) and we have not changed the HP/UX configuration or the Informix configuration files for well over a year. Things have been working just fine, so we didn't feel there was a need to tune the kernel or database.

  4. #4
    Join Date
    May 2004
    Location
    New York
    Posts
    248
    Please post you full $onconfig, online.log files and the output of onstat -p
    onstat -F

  5. #5
    Join Date
    Apr 2006
    Posts
    6

    Requested Information

    I have attached the online.log and onconfig files. Below you will find the results of the "onstat -p" and "onstat -F" commands.



    onstat -p:

    IBM Informix Dynamic Server Version 9.40.FC5 -- On-Line -- Up 1 days 07:02:36 -- 713280 Kbytes

    Profile
    dskreads pagreads bufreads %cached dskwrits pagwrits bufwrits %cached
    14706284 10531140 5983189370 99.75 838472 3262021 298477547 99.72

    isamtot open start read write rewrite delete commit rollbk
    3074331191 82737328 378778894 1466602748 141150282 1202813 97196 11705 0

    gp_read gp_write gp_rewrt gp_del gp_alloc gp_free gp_curs
    0 0 0 0 0 0 0

    ovlock ovuserthread ovbuff usercpu syscpu numckpts flushes
    0 0 0 94814.64 5168.81 301 901

    bufwaits lokwaits lockreqs deadlks dltouts ckpwaits compress seqscans
    1483966 0 209545907 0 0 1944 9181288 2230521

    ixda-RA idx-RA da-RA RA-pgsused lchwaits
    1252985 27670 8772545 10052847 16553892



    onstat -F:

    IBM Informix Dynamic Server Version 9.40.FC5 -- On-Line -- Up 1 days 07:03:03 -- 713280 Kbytes


    Fg Writes LRU Writes Chunk Writes
    0 0 391595

    address flusher state data
    c0000000238d5860 0 I 0 = 0X0
    c0000000238d6098 1 I 0 = 0X0
    states: Exit Idle Chunk Lru
    Attached Files Attached Files

  6. #6
    Join Date
    Dec 2003
    Location
    North America
    Posts
    146
    Are you sure you changed the number of LRU's & CLEANERS at stated in your original post or did you revert back to your original onconfig settings? The onconfig you posted shows 2 CLEANERS & 8 LRU's rather than 128 of each, LRU_MAX_DIRTY of 10% and LRU_MIN_DIRTY of 5% rather that 2% and 1%.

    I scanned the online log & I can't find any reference to any changes in number of LRU's or CLEANERS ... these changes would usually be found a few lines after "Informix Dynamic Server started" in the online.log.

    Perhaps you're editting the wrong onconfig file or someone changed the INFORMIXDIR env variable so it's pointing to the wrong onconfig at start-up.

    Looks like you're doing all chunk writes and there are no LRU writes so all page flushing is occuring only at the 5 minute checkpoint interval and at physical log flush when 75% full. No writing (LRU writes) is occuring between checkpoints so you have to find a way to increase your LRU writes.

    Based on your onconfig you have 32000 buffers per LRU (256K/8) and you start flushing one LRU when 3200 (10%) are dirty and stop flushing that LRU when 1600 (5%) are left dirty.

    Based on no LRU writes potentially you have between 12800 & 25600 dirty buffers to flush every 5 minutes. 256K/8 LRUs = 32000 then 32000 * .10 (start at 10%) = 3200 then 3200/2 (stop at 5%)= 1600 then 1600 * 8 LRUs = 12800 buffers and that's alot.

    This document written by Informix guru Art Kagel helped me

    http://www.prstech.com/tips/art_kagel_tuning_tips.shtml
    Last edited by mjldba; 04-20-06 at 08:46.

  7. #7
    Join Date
    Apr 2006
    Posts
    6
    Since the configuration changes I documented in my initial posting did NOT decrease the checkpoint durations, we reverted to the original settings. I included the "online.log" that shows the 4 months of checkpoints before the driver installation and immediately after the driver was installed. Therefore, the configuration changes I made would not have been reflected in the log file. However, I have attached the current log file that DOES include those changes.

    I will definitely read the information you referred to. My apologies for not explaining the information I provided more clearly.
    Attached Files Attached Files

  8. #8
    Join Date
    Dec 2003
    Location
    North America
    Posts
    146
    Tuning is far from a science ... very site specific. I would never insinuate my way is perfect but I've had good results by forcing more LRU writes (better for OLTP) than chunk writes (better for batch env) and I never see foreground writes.

    I have 248000 buffers, 127 LRUs & CLEANERS, LRU_MIN_DIRTY= 0, LRU_MAX_DIRTY = 1, CHECKPOINTS every 4 hours, and PHYSFILE = 20000.

    Monitor the # of buffers per LRU and the current total # of dirty buffers using the last of 3 lines of onstat -R.
    You can monitor the physical log (PHYSFILE) using onstat -l looking at the 3rd & 4th line under "Physical Logging" near the top.

    I let PHYSFILE automatically flush dirty buffers when it hits 75% full, LRU writes are constant and small (flushing starts when 19 buffers are dirty) and checkpoint times are usually <= 1 second (never more than 2 seconds) which is unnoticeable.

    I'm using IDS 9.30.UC6 in an AIX env, and you're using IDS 9.4 in a HP env so you have greater flexibility & granularity available with your LRU_MIN_DIRTY & LRU_MAX_DIRTY parameters.

    Some parameters hurt you when they're set too big, some hurt you when they're set too small, and some hurt you when they're not modified in unison so you've got to figure out what works best at your site.
    Unfortunately, bouncing the engine is necessary so you'll get one chance per day to make incremental changes in a controlled fashion.

    here's another site worth checking http://www.oninit.com/

    good luck
    Last edited by mjldba; 04-20-06 at 09:51.

  9. #9
    Join Date
    Apr 2006
    Posts
    6
    First, I just wanted to thank you for taking time to provide me with this valuable information. I have a few questions as a result of your last post.

    You stated you have Checkpoints every 4 hours, you mean 4 minutes right? The number which defines the Checkpoint duration is in seconds.

    I know what information to look at in the "onstat -R" and "onstat -l" output. But what should I be looking FOR? In your experience, what information from each of the "onstat" outputs should concern me?

    Thanks again.

  10. #10
    Join Date
    Dec 2003
    Location
    North America
    Posts
    146
    You're very welcome & if I can help you a little, or point you in the right direction, then I've done my part to lend a hand. I've had no formal training aside from learning some lessons the hard way, reading documentation, and using BB resources like the two I listed, and I just found this one in some notes:

    http://docs.rinet.ru/InforSmes/

    Like I said, I let the physical log take care of large buffer flushing when it hits 75% full rather than scheduling checkpoints so I have automatic checkpoints set-up for 14,400 seconds = 4 hours. I use this method so that if system activity is light a checkpoint will take place during lunch time & the next will occur a quitting time or just beyond. I know it sounds rather unorthodox but it was a method suggested in one of the URLs & it works for me.

    Juggling # of buffers, # of LRUs & CLEANERS, and LRU MIN/MAX parameters got me out of hot water with long checkpoints. Initially, I tried doing checkpoints every 2 minutes, then every one minute & results were inconsistent if activity was heavy 'cause sometimes I still had 10-15 second checkpoints.

    I have 2 telnet sessions running all day; one displays the line from onstat -R
    (onstat -R -r 2 | grep queued) to monitor # of dirty buffers and the total number never exceeds 1400 'cause LRU writes are taking care of flushing small number of buffers.
    I use onstat -l -r 2 | grep 200035 which is the number beneath phybegin (it's unique in the whole onstat -l output) to monitor how full the physical log is.

    My physical log is 20000KB, which is 5000 4k pages, and flushes will automatically occur when 3750 are dirty; I find this to be a manageable number.

    There's a useful script available at the oninit site: choose Informix Database,
    download (left side menu), scripts, Health Check. It's kind of basic but it may point you in the right direction

  11. #11
    Join Date
    Dec 2003
    Location
    North America
    Posts
    146
    The Art Kagel document includes this reference regarding HP:

    "HP/UX PA RISC has only four shared memory segment registers. If you have to access more than four shared memory segments concurrently, your process will become very slow. Informix needs access to at least: 1-Resident segment, 1-Virtual segment, and for each process 1-Message segment. If your engine has to add more than one additional virtual segment, the CPU VPs will become bogged down. If your application is already using its own shared memory and the engine allocates even one additional virtual segment, your application will bog down switching between the engine's shared memory and its own. "

    onstat -g seg will show how many "V" segments you have and, if you have more than 1, will indicate SHMVIRTSIZE may be too small, causing allocation of additional segments using SHMADD.
    Last edited by mjldba; 04-20-06 at 14:52.

  12. #12
    Join Date
    Apr 2006
    Posts
    6
    I have found where my problem lies. However, I have no idea how to correct it. I turned on the "TRACEFUZZYCKPT" to monitor the checkpoint performance more closely. The problem is not the number of dirty buffers to be flushed. The problem is with the "dskflush()". This is the area of the checkpoint that is taking ALL the time. Only one problem...I can't find any information regarding this problem. I have no idea what to do to correct it.

  13. #13
    Join Date
    Dec 2003
    Location
    North America
    Posts
    146

  14. #14
    Join Date
    Feb 2009
    Posts
    51
    Hi, ckercher
    I think that your problem is in very small PHYSFILE - 6Mb! It is funny.
    Increase it up to 200Mb or 300Mb.
    How many memory on your server?
    I see that you have only 512Mb buffers pool.
    If you can increase it - increase it.

  15. #15
    Join Date
    Sep 2009
    Posts
    1

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •