This is my first post to this forum, so please excuse me if I don't supply all the necessary information. Just let me know what additional information you need.
I am running Informix IDS v9.4FC5 on a HP9000 w/4 processors using the HP/UX 11i O/S. We rebooted our server the other day and immediately upon initialization of Informix we noticed long checkpoint durations. Our checkpoint duration for the past few months has been either 0 or 1 second. Now at peak load our checkpoint durations are around 30 seconds. My users are complaining vehemently!
I have made some changes to the configuration file based on some of the posts I read on this forum regarding "long checkpoint duration". The configuration changes have NOT corrected the problem. Please find, below, the Shared Memory configuration settings before and after my attempt to correct the problem.
What I can't understand is that NOTHING changed!...we simply rebooted the server. We did not change any Informix configuration settings, add dbspaces, etc. Does anyone have any ideas?
# Shared Memory Parameters
LOCKS 100000 # Maximum number of locks
BUFFERS 256000 # Maximum number of shared buffers
NUMAIOVPS 2 # Number of IO vps
PHYSBUFF 32 # Physical log buffer size (Kbytes)
LOGBUFF 32 # Logical log buffer size (Kbytes) CLEANERS 128 (changed from 2) # Number of buffer cleaner processes
SHMBASE 0x0 # Shared memory base address
SHMVIRTSIZE 128000 # initial virtual shared memory segment size
SHMADD 20000 # Size of new shared memory segments (Kbytes)
SHMTOTAL 0 # Total shared memory (Kbytes). 0=>unlimited
CKPTINTVL 300 # Check point interval (in sec) LRUS 128 (changed from 8) # Number of LRU queues
LRU_MAX_DIRTY 2.000000 # LRU percent dirty begin cleaning limit
LRU_MIN_DIRTY 1.000000 # LRU percent dirty end cleaning limit
TXTIMEOUT 0x12c # Transaction timeout (in sec)
STACKSIZE 64 # Stack size (Kbytes)
There are only 2 people with adequate permissions (myself included) and we have not changed the HP/UX configuration or the Informix configuration files for well over a year. Things have been working just fine, so we didn't feel there was a need to tune the kernel or database.
Are you sure you changed the number of LRU's & CLEANERS at stated in your original post or did you revert back to your original onconfig settings? The onconfig you posted shows 2 CLEANERS & 8 LRU's rather than 128 of each, LRU_MAX_DIRTY of 10% and LRU_MIN_DIRTY of 5% rather that 2% and 1%.
I scanned the online log & I can't find any reference to any changes in number of LRU's or CLEANERS ... these changes would usually be found a few lines after "Informix Dynamic Server started" in the online.log.
Perhaps you're editting the wrong onconfig file or someone changed the INFORMIXDIR env variable so it's pointing to the wrong onconfig at start-up.
Looks like you're doing all chunk writes and there are no LRU writes so all page flushing is occuring only at the 5 minute checkpoint interval and at physical log flush when 75% full. No writing (LRU writes) is occuring between checkpoints so you have to find a way to increase your LRU writes.
Based on your onconfig you have 32000 buffers per LRU (256K/8) and you start flushing one LRU when 3200 (10%) are dirty and stop flushing that LRU when 1600 (5%) are left dirty.
Based on no LRU writes potentially you have between 12800 & 25600 dirty buffers to flush every 5 minutes. 256K/8 LRUs = 32000 then 32000 * .10 (start at 10%) = 3200 then 3200/2 (stop at 5%)= 1600 then 1600 * 8 LRUs = 12800 buffers and that's alot.
This document written by Informix guru Art Kagel helped me
Since the configuration changes I documented in my initial posting did NOT decrease the checkpoint durations, we reverted to the original settings. I included the "online.log" that shows the 4 months of checkpoints before the driver installation and immediately after the driver was installed. Therefore, the configuration changes I made would not have been reflected in the log file. However, I have attached the current log file that DOES include those changes.
I will definitely read the information you referred to. My apologies for not explaining the information I provided more clearly.
Tuning is far from a science ... very site specific. I would never insinuate my way is perfect but I've had good results by forcing more LRU writes (better for OLTP) than chunk writes (better for batch env) and I never see foreground writes.
I have 248000 buffers, 127 LRUs & CLEANERS, LRU_MIN_DIRTY= 0, LRU_MAX_DIRTY = 1, CHECKPOINTS every 4 hours, and PHYSFILE = 20000.
Monitor the # of buffers per LRU and the current total # of dirty buffers using the last of 3 lines of onstat -R.
You can monitor the physical log (PHYSFILE) using onstat -l looking at the 3rd & 4th line under "Physical Logging" near the top.
I let PHYSFILE automatically flush dirty buffers when it hits 75% full, LRU writes are constant and small (flushing starts when 19 buffers are dirty) and checkpoint times are usually <= 1 second (never more than 2 seconds) which is unnoticeable.
I'm using IDS 9.30.UC6 in an AIX env, and you're using IDS 9.4 in a HP env so you have greater flexibility & granularity available with your LRU_MIN_DIRTY & LRU_MAX_DIRTY parameters.
Some parameters hurt you when they're set too big, some hurt you when they're set too small, and some hurt you when they're not modified in unison so you've got to figure out what works best at your site.
Unfortunately, bouncing the engine is necessary so you'll get one chance per day to make incremental changes in a controlled fashion.
First, I just wanted to thank you for taking time to provide me with this valuable information. I have a few questions as a result of your last post.
You stated you have Checkpoints every 4 hours, you mean 4 minutes right? The number which defines the Checkpoint duration is in seconds.
I know what information to look at in the "onstat -R" and "onstat -l" output. But what should I be looking FOR? In your experience, what information from each of the "onstat" outputs should concern me?
You're very welcome & if I can help you a little, or point you in the right direction, then I've done my part to lend a hand. I've had no formal training aside from learning some lessons the hard way, reading documentation, and using BB resources like the two I listed, and I just found this one in some notes:
Like I said, I let the physical log take care of large buffer flushing when it hits 75% full rather than scheduling checkpoints so I have automatic checkpoints set-up for 14,400 seconds = 4 hours. I use this method so that if system activity is light a checkpoint will take place during lunch time & the next will occur a quitting time or just beyond. I know it sounds rather unorthodox but it was a method suggested in one of the URLs & it works for me.
Juggling # of buffers, # of LRUs & CLEANERS, and LRU MIN/MAX parameters got me out of hot water with long checkpoints. Initially, I tried doing checkpoints every 2 minutes, then every one minute & results were inconsistent if activity was heavy 'cause sometimes I still had 10-15 second checkpoints.
I have 2 telnet sessions running all day; one displays the line from onstat -R
(onstat -R -r 2 | grep queued) to monitor # of dirty buffers and the total number never exceeds 1400 'cause LRU writes are taking care of flushing small number of buffers.
I use onstat -l -r 2 | grep 200035 which is the number beneath phybegin (it's unique in the whole onstat -l output) to monitor how full the physical log is.
My physical log is 20000KB, which is 5000 4k pages, and flushes will automatically occur when 3750 are dirty; I find this to be a manageable number.
There's a useful script available at the oninit site: choose Informix Database,
download (left side menu), scripts, Health Check. It's kind of basic but it may point you in the right direction
The Art Kagel document includes this reference regarding HP:
"HP/UX PA RISC has only four shared memory segment registers. If you have to access more than four shared memory segments concurrently, your process will become very slow. Informix needs access to at least: 1-Resident segment, 1-Virtual segment, and for each process 1-Message segment. If your engine has to add more than one additional virtual segment, the CPU VPs will become bogged down. If your application is already using its own shared memory and the engine allocates even one additional virtual segment, your application will bog down switching between the engine's shared memory and its own. "
onstat -g seg will show how many "V" segments you have and, if you have more than 1, will indicate SHMVIRTSIZE may be too small, causing allocation of additional segments using SHMADD.
I have found where my problem lies. However, I have no idea how to correct it. I turned on the "TRACEFUZZYCKPT" to monitor the checkpoint performance more closely. The problem is not the number of dirty buffers to be flushed. The problem is with the "dskflush()". This is the area of the checkpoint that is taking ALL the time. Only one problem...I can't find any information regarding this problem. I have no idea what to do to correct it.
I think that your problem is in very small PHYSFILE - 6Mb! It is funny.
Increase it up to 200Mb or 300Mb.
How many memory on your server?
I see that you have only 512Mb buffers pool.
If you can increase it - increase it.