Unanswered: Performance issues with PostgreSQL 9.2 on RHEL 6
We're doing performance-testing of an OLTP product built on Java (hosted on JBoss/ Tomcat) and PostgreSQL, that is supposed to process around 9000 user-requests in an hour (each request takes up several seconds due to huge data volume being submitted) on an RHEL VM with 32 GB RAM and 16 cores.
It works fine until we are sending around 6000-7000 requests but when we send around 9K requests, Postgres starts throwing exceptions in its log. The errors we get are one or more of:
* Could not fork new process for connection. Resource unavailable
* Checkpoints are happening too frequently
* Unable to write log
* Similar sh*t
Consequently, on the Java side we get an error stating that it can't obtain more connections from the DB. The check-points error is gone after we changed the frequency at which check-points are being done, but the forking related error remains.
Another interesting point is that the CPU utilization only goes up to a max of 50%, and that is exactly the point when this error starts coming.
Any idea why RHEL is not allowing Postgres to fork more processes for the new connections when 50% of the CPU is still available? This is a dedicated DB server for this product and no other applications will run on it.
Help on this will be really appreciated. We are stuck on this for about a week now and have tried various permutations and combinations. Any voodoo artists out there who know the black magic that runs this are welcome to please share their spells.
Have you looked at the output from pg_stat_bgwriter view? It will give you an idea as to the checkpoint write reasons. If the checkpoints are being driven by timing, you'll want is to adjust settings to make the checkpoint writes more frequent (and thus smaller.) With a checkpoint completion target of .9, the checkpoint writes (if triggered by time) will only leave 18 seconds to be completed before the collisions with new WAL files could happen.
Even with 32G main memory, the shared buffers size seems a bit large. The minimum for 1400 connections, would be 2.8GB. I've seen reports that large systems seem to work well with 6-15% of available RAM. 15% of (32-14) would be 2.7 GB, so setting it at 3-4 GB may be helpful. This would also allow you to increase Effective_Cache_Size to the 17-18 GB range, which could also help.
"Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
"I have my standards. They may be low, but I have them!" - Bette Middler
"It's a book about a Spanish guy named Manual. You should read it." - Dilbert