I'm looking for some help with a scalability problem we're having. We have a large database (1800+ tables) and some very complex batch processes. These batch processes are broken up into 1000s of jobs and run in separate threads by a multithreaded open client app we run alongside ASE.
We are finding that when we add CPUs to the box, the scalability simply isn't there. Throughput scalability appears linear at 1, 2 & 3 CPUs, but starts falling off after 3 CPUs, and starts deteriorating after 6 CPUs. We see similar results whether running 10, 32, 64 or 128 threads (open client app connections running the batch jobs).
We typically see 85% of max throughput at 3 CPUs, max throughput at 6 CPUs, and back down to 82% of max throughput with 8 and 10 CPUs.
We have basically eliminated application level blocking. So the multiple threads are very rarely blocking each other and we don't see any deadlock situations.
We have 8 hdisks in a RAID10 configuration for the log and 24 hdisks for the data, also in RAID10. From extensive testing we believe it maybe a log IO issue, and we're having trouble seeking guidance from IBM on how to tell if we're IO bound on the log array.
If anyone has any thoughts they would be most appreciated. Please email me if you would like to see the full sp_sysmon report. Please see details of box spec, mda wait info and abbreviated sp_sysmon output below.
Thanks very much in advance,
System Model: IBM,7040-681
Processor Type: PowerPC_POWER4
Number Of Processors: 11
Processor Clock Speed: 1904 MHz
CPU Type: 64-bit
Kernel Type: 64-bit
LPAR Info: 1 DOMAIN A
Memory Size: 7168 MB
Good Memory Size: 7168 MB
Firmware Version: IBM,RG040719_regatta
Adaptive Server Enterprise/12.5.3/EBF 12146/P/RS6000/AIX 5.1/ase125x/1883/64-bit/FBO/Thu Nov 11 22:31:37 2004
IBM ESS 800 -128 Disks
mda Wait Output... ASE server: XXXXXXXX (12.5.3/EBF 12146)
Sampling period: 25-Jan-2005 11:40:10 - 11:42:10 (120 seconds)
Wait event times for: entire ASE server
WaitSecs NrWaits WaitEvent WtEvtID
----------- ----------- -------------------------------------------------- --------
3955 1404053 waiting for disk write to complete 52
1192 147340 waiting while no network read or write is required 179
1118 2256657 waiting on run queue after sleep 215
959 228421 waiting for semaphore 150
495 5620 waiting for incoming network data 250
480 8 xact coord: pause during idle loop 19
298 100170 waiting for disk write to complete 51
279 151559 waiting for network send to complete 251
152 48647 waiting for lock on PLC 272
143 80819 wait to acquire latch 41
131 5 hk: pause for some time 61
127 31542 wait for buffer read to complete 29
120 4 wait until an engine has been offlined 104
120 1 waiting for date or time in waitfor command 260
117 32 waiting while allocating new client socket 178
117 4 checkpoint process idle loop 57
Total #spids in ASE server: 87 (system: 12; user: 75)
Increasing the number of CPU's/engines does not necessarily guarantee a performance increase.
From the sysmon output, you can see that the engines are less than 50% busy. As you increase the number of engines, the degree of parallelism increases, but so does the the overhead on ASE to manage all the engines and each process queue running on each engine.
You could try binding process to engines to increase throughput.
Hi there Tad, the nature of our server side multithreaded app is a parent child process, whereby the parent is OpenServer (basically a controller process), and the child is multithreaded OpenClient. The OpenClient process has one thread that polls the DB for jobs, and dispatches them to a newly spawned thread, or queues the job on an existing thread if the thread pool is maxed out. Each thread has it's own connection to the database and executes it's jobs.
Thanks very much for your input so far. With more understanding of our Open Client/Server app, if you have any more ideas, we'd love to hear them.