Hi all,

I'm doing some testing of Postgres 9.0 archiving and streaming replication between a couple of Solaris 10 servers. Recently I was trying to test how well the standby server catches up after an outage, and a question arose.

It seems that if the standby is uncontactable by the primary when it is attempting WAL archiving, the primary will attempt the copy three times, then log that the log file could not be archived, as there were too many failures. See:

ssh: connect to host 172.18.131.212 port 22: Connection timed out^M
lost connection
LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: scp pg_xlog/000000010000000000000006 postgres@172.18.131.212:/postgres/postgres/9.0-pgdg/primary_archive
ssh: connect to host 172.18.131.212 port 22: Connection timed out^M
lost connection
LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: scp pg_xlog/000000010000000000000006 postgres@172.18.131.212:/postgres/postgres/9.0-pgdg/primary_archive
ssh: connect to host 172.18.131.212 port 22: Connection timed out^M
lost connection
LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: scp pg_xlog/000000010000000000000006 postgres@172.18.131.212:/postgres/postgres/9.0-pgdg/primary_archive
WARNING: transaction log file "000000010000000000000006" could not be archived: too many failures


But then the primary retries this another 49 times! So 150 attempts in all.

What I need to know is whether these numbers are configurable? Can they be timed? How long before the primary stops retrying altogether?

Any help appreciated. Thanks!
Dan