We've got an ordinary MSA replication setup, one repserver replicating to two replicates.
What we've been seeing for some time now is that replication to one of the two replicates is much much slower than to the other. Ironically, the one that's always behind is more powerful than the other. Now, we do run reports on this box, but even when there are no reports running, if data builds up in the stable queues, the offending server still seems to process it more slowly.
The configs of the replicate servers, both within Sybase and at the Unix level is very similar except that the slow server is more powerful with more memory and has a slightly 'increased' config to suit.
I've run a trace on the repserver_maint user process and it appears to be processing sql fairly quickly, usually with around 2-3 seconds between command executions.
Thinking about it, I should do the same on the other replicate and see how that one is performing. I'll have to wait for the opportunity again where we're seeing a back log in the queues.
Otherwise, if anyone has any other ideas or suggestions then I'd love to hear about them.
Repserver box is a Sun-Fire V440, 4GB RAM, dual 1062MHz US-IIIi's (Sol 8) (Repserver 12.6)
Good Replicate is a Sun-Fire 880, 16GB RAM, 8-way 1200MHz US-III+'s (Sol 8) (ASE 12.5.4)
Bad Replicate is a Netra-1280 (T12), 24GB RAM, 12-way 1200MHz US-III+'s (Sol 8) (ASE 12.5.4)
ASE's have the same ESD level.
Can you check at that time what the slow server was excuting . Might be a case like for a perticular table on fast replicate side a index is their and for other slow one index is not thier !!! ...( as usually it should not be a case i am assumimg it as i faced one issue once where the replciate side index was dropped mistankenly ) . Also see index that indexes on slow sides are not corrupted .
Can you get the showplan of rhe maint users which was applying tran on both sides ?
Many thanks for your input. At times when it's been going slow I've examined the sql running, the query plans (where possible), indexes, and the last update on the statistics and all looks fine (with the exception of certain queries which have no suitable index and are updated on all columns, but both replicates have to deal with the same thing anyway). I've even tried injecting object statistics direct from the primary using optdiag and it's made no difference. Not only that, but I've also seen this happen immediately after both replicates have been refreshed (dump & load) from the primary. So it's not indexes, and it's not statistics. Either way, both replicates are processing the same thing with recent copies of the primary db. This is why I'm so stumped.
We've also monitored the repserver using admin who,sqt and tuned the value for sqt_max_cachesize. All to no avail.
We've rebuilt replication several times for one reason or another and we do this the same way for each server, so the connections and setup are exactly the same. So I guess maybe it's some quirk with repserver, or perhaps the Netra architecture doesn't perform as well as the Sun-Fire 880?
The only other thing I can add, more for information, is that we've added around 30 table rep defs with replicate minimal columns to help improve performance and this has given a tremendous improvement (but yes, we're still seeing the one server lagging behind).
At this stage I'm thinking I'll need to write a script to monitor the actual throughput for each replicate...
Since you mentioned that you have changed repdefs to minimal columns, check if you have any autocorrection enabled in one site for any set of tables in subscriptions but not the other site. That could kill your performance for sure.
KevR/trvishi, many thanks for your input. Very useful!
Yes, there's only the one repserver.
Auto correction is definitely off, we simply haven't used it at all.
With regard to the datacentres, that's a very good point! In fact, they're in different datacentres, but surprisingly it's the server that's lagging which is actually local to the primary. That's not to say that network isn't the problem, it could be that there's a faulty or overloaded switch local to primary which is causing the problem whereas the route the other site might be ok. I'll look into that and do some benchmarking to each replicate.
trvishi - thanks for the offer, I'll have a look around the web and will certainly get back to you if I can't find anything.
As they're in different datacentres the network is certainly worth a look. I'd start by ftping a largish file from the repserver host to both the replicate hosts and timing the transfer. If there's a big difference between the two, maybe it's not the ASE or repserver config at all, and it's possible that there's not much you can do at the database level. In which case it could be network or san config.
I've worked at sites that use netbackup and we've used the backup network for heavily replicated systems. It's obviously busy at night, but where most of the rep work was done during the day, we would utilise the quiet network during business hours. We'd set up a additional listener on the alternative interface, give that one to the repserver in its interfaces file and replicate without touching the app facing network. Don't know if you have a seperate backup network at your site, but maybe worth a try if you do?
Transfer rate to the DR site is 6.0MB/s
Transfer rate to the local (slow replicate) site is 12.6MB/s
Repserver box is also on the same subnet as the slow replicate.
Still a mystery.
On the bright side, I've written a script to monitor throughput, so maybe that will give some clues as to what's going on.
I've already compared the two ASE configs - all looks good to me. They are not configured the same, but they are properly configured for their respective hardware environments - number of processors, memory, etc. I might revisit it for good measure.
Sorry, I wasn't very clear with that, here's what I should have put:
repserver to fast replicate: 6.0MB/s
repserver to slow replicate: 12.6MB/s
That was a straight scp of a 200mb file.
ok. So, that doesnt help.
I guess then we have to go back to the basics.
Is there any difference in types of replication between the two patterns. i.e. Is one a warm-standby replication and other a table-table replication, if so post which one is warm-standby and which one is table-table (fast/slow)?