Page 1 of 2 12 LastLast
Results 1 to 15 of 16
  1. #1
    Join Date
    Mar 2009
    Posts
    3

    Unanswered: DB2 hadr on standby hangs after disconnect from primary

    Hi,
    We are running db2 v8.2 fixpack 14 on linux sles 10. Hadr is configured as sync with a hadr_timout value of 60 seconds. During high activity times, the primary db loses connection to the standby db (not sure why), blocking update transactions for 60 seconds. The only way to re-establish the hadr connection is to go to the standby server and issue a db2_kill because the deactivate of the standby db "hangs". A db2start and activate of the standby database then results in the connection being re-established and all is well until the next time this happens. Has anyone else experienced something similar? Is it correct that db2 on primary does not automatically re-establish the connection and why does the standby database seem to be in this "hanging" state when trying to deactivate?

  2. #2
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    Very few people run in synch mode, because near synch has basically the same level of redundancy, with less overhead.

    I would check the db2diag.log on both machines.
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

  3. #3
    Join Date
    Mar 2009
    Posts
    3
    Thanks Marcus, unfortunately the db2diag.log files are not very helpful.

    Primary log file after 60 second timeout
    2009-03-19-12.42.57.525017+120 I2619992E513 LEVEL: Error
    PID : 4052 TID : 47398647731840PROC : db2hadrp (BPRP) 0
    INSTANCE: db2inst1 NODE : 000 DB : BPRP
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20200
    MESSAGE : Did not receive anything through HADR connection for the duration of
    HADR_TIMEOUT. Closing connection.
    DATA #1 : Hexdump, 4 bytes
    0x00007FFFD38B887C : 3D00 0000 =...

    2009-03-19-12.42.57.525227+120 I2620506E338 LEVEL: Severe
    PID : 4052 TID : 47398647731840PROC : db2hadrp (BPRP) 0
    INSTANCE: db2inst1 NODE : 000 DB : BPRP
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20200
    RETCODE : ZRC=0x00000000=0=PSM_OK "Unknown"

    2009-03-19-12.42.57.525308+120 E2620845E354 LEVEL: Event
    PID : 4052 TID : 47398647731840PROC : db2hadrp (BPRP) 0
    INSTANCE: db2inst1 NODE : 000 DB : BPRP
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
    CHANGE : HADR state set to P-RemoteCatchupPending (was P-Peer)


    Status in the snapshot shows : Disconnected

    Nothing appears in the standby db2diag.log until the database is deactivated.

    Secondary log file after deactivate of database
    2009-03-19-13.13.04.353260+120 I19685252E394 LEVEL: Warning
    PID : 19375 TID : 47600804660864PROC : db2agent (BPRP) 0
    INSTANCE: db2insta NODE : 000 DB : BPRP
    APPHDL : 0-1632 APPID: *LOCAL.db2insta.090319111304
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduStartup, probe:21151
    MESSAGE : Info: HADR Startup has begun.

    2009-03-19-13.13.04.360058+120 I19685647E413 LEVEL: Error
    PID : 2123 TID : 47600804660864PROC : db2redom (BPRP) 0
    INSTANCE: db2insta NODE : 000 DB : BPRP
    APPHDL : 0-1653
    FUNCTION: DB2 UDB, recovery manager, sqlpshrScanNext, probe:1450
    RETCODE : ZRC=0x80100003=-2146435069=SQLP_LINT "Interrupt from application"
    DIA8003C The interrupt has been received.

    2009-03-19-13.13.04.360729+120 I19686061E413 LEVEL: Error
    PID : 2123 TID : 47600804660864PROC : db2redom (BPRP) 0
    INSTANCE: db2insta NODE : 000 DB : BPRP
    APPHDL : 0-1653
    FUNCTION: DB2 UDB, recovery manager, sqlpPRecReadLog, probe:1275
    RETCODE : ZRC=0x80100003=-2146435069=SQLP_LINT "Interrupt from application"
    DIA8003C The interrupt has been received.

    2009-03-19-13.13.06.272529+120 I19686475E413 LEVEL: Error
    PID : 2123 TID : 47600804660864PROC : db2redom (BPRP) 0
    INSTANCE: db2insta NODE : 000 DB : BPRP
    APPHDL : 0-1653
    FUNCTION: DB2 UDB, recovery manager, sqlpPRecReadLog, probe:1280
    RETCODE : ZRC=0x80100003=-2146435069=SQLP_LINT "Interrupt from application"
    DIA8003C The interrupt has been received.

    Nothing else get written to the log file until I do the db2_kill. I have in the past waited longer than 10 minutes for my deactivate command to respond.

    I will try setting the mode to near-sync before our change freeze for month-end and see what happens. During this period the hadr heartbeat breaks almost everyday on the busy database. The less busy databases are not affected.

  4. #4
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    You probably need to do some network tests between your primary and standby. If the machines are located too far apart (different buildings or cities) or go through too many network hops, you may need HADR asynchronous mode. Or perhaps there is just a problem with your particular network that could be fixed. Doing some file transfers tests (large and small files) to determine the speed may suffice, but I would consult with your network staff also.

    For high volume HADR applications I use a private Ethernet connection between the HADR primary and standby servers. This requires an extra NIC on each machine (we actually use a pair of bonded NICs on each machine for redundancy) that are hooked together without any other network connections (if they are close enough together you can simply use a crossover cable without any switch, router, hub, etc). That way the HADR log traffic between primary and standby has its own private network and cannot be slowed down by any other traffic on the network.
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

  5. #5
    Join Date
    Aug 2010
    Posts
    11
    Many years later.....

    DB2 9.7 fp5 workgroup ed.

    I get pretty much identical problem and diaglog messages. I've done many network tests to verify that the network between servers is fine.

    on the HADR diaglogs it looks like:


    2012-07-03-15.05.41.742040-300 I413488E540 LEVEL: Error
    PID : 28370 TID : 140615568320256PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 738 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20200
    MESSAGE : Did not receive anything through HADR connection for the duration of
    HADR_TIMEOUT. Closing connection.
    DATA #1 : Hexdump, 4 bytes
    0x00007FE39CFFD0AC : B200 0000 ....

    2012-07-03-15.05.41.742140-300 I414029E365 LEVEL: Severe
    PID : 28370 TID : 140615568320256PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 738 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20200
    RETCODE : ZRC=0x00000000=0=PSM_OK "Unknown"

    2012-07-03-15.05.41.742185-300 E414395E381 LEVEL: Event
    PID : 28370 TID : 140615568320256PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 738 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
    CHANGE : HADR state set to S-RemoteCatchupPending (was S-Peer)

    2012-07-03-15.05.41.842540-300 I414777E393 LEVEL: Warning
    PID : 28370 TID : 140615568320256PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 738 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32201
    MESSAGE : Info: HADR Socket send buffer size, SO_SNDBUF: 16384 bytes

    2012-07-03-15.05.41.842624-300 I415171E396 LEVEL: Warning
    PID : 28370 TID : 140615568320256PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 738 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32251
    MESSAGE : Info: HADR Socket receive buffer size, SO_RCVBUF: 87380 bytes

    2012-07-03-15.05.41.986606-300 I415568E399 LEVEL: Warning
    PID : 28370 TID : 140615568320256PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 738 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrHandleHsAck, probe:30532
    MESSAGE : Info: HADR Socket send buffer size adjusted to, SO_SNDBUF: 16384 byte

    2012-07-03-15.05.41.986677-300 I415968E414 LEVEL: Warning
    PID : 28370 TID : 140615568320256PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 738 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrHandleHsAck, probe:30534
    MESSAGE : Info: HADR Socket receive buffer size adjusted to, SO_RCVBUF: 87380
    bytes

  6. #6
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    You need to post the HADR section of the database configuration on both the primary and standby.
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

  7. #7
    Join Date
    Aug 2010
    Posts
    11
    HADR database role = PRIMARY
    HADR local host name (HADR_LOCAL_HOST) = 10.1.8.102
    HADR local service name (HADR_LOCAL_SVC) = 50092
    HADR remote host name (HADR_REMOTE_HOST) = 10.65.1.50
    HADR remote service name (HADR_REMOTE_SVC) = 50092
    HADR instance name of remote server (HADR_REMOTE_INST) = cs90prd
    HADR timeout value (HADR_TIMEOUT) = 120
    HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
    HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 0




    HADR database role = STANDBY
    HADR local host name (HADR_LOCAL_HOST) = 10.65.1.50
    HADR local service name (HADR_LOCAL_SVC) = 50092
    HADR remote host name (HADR_REMOTE_HOST) = 10.1.8.102
    HADR remote service name (HADR_REMOTE_SVC) = 50092
    HADR instance name of remote server (HADR_REMOTE_INST) = cs90prd
    HADR timeout value (HADR_TIMEOUT) = 120
    HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
    HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 0

  8. #8
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    HADR database role = PRIMARY
    HADR local host name (HADR_LOCAL_HOST) = 10.1.8.102
    HADR local service name (HADR_LOCAL_SVC) = 58000
    HADR remote host name (HADR_REMOTE_HOST) = 10.65.1.50
    HADR remote service name (HADR_REMOTE_SVC) = 58001
    HADR instance name of remote server (HADR_REMOTE_INST) = cs90prd
    HADR timeout value (HADR_TIMEOUT) = 120
    HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
    HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 0

    HADR database role = STANDBY
    HADR local host name (HADR_LOCAL_HOST) = 10.65.1.50
    HADR local service name (HADR_LOCAL_SVC) = 58001
    HADR remote host name (HADR_REMOTE_HOST) = 10.1.8.102
    HADR remote service name (HADR_REMOTE_SVC) = 58000
    HADR instance name of remote server (HADR_REMOTE_INST) = cs90prd
    HADR timeout value (HADR_TIMEOUT) = 120
    HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
    HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 0

    Notice I have changed the local and remote service names (ports):

    1. Make sure they are different than the port number used for any DB2 instance as indicated in the "db2 get dbm cfg" (including cs90prd instance). HADR needs its own ports.
    2. Make sure they are not used by another other process and are unique to the above database. Two HADR databases in the same instance must have different HADR port numbers.
    3. Flip/Flop them as show above.
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

  9. #9
    Join Date
    Aug 2010
    Posts
    11
    I disconnected HADR, changed the cfg, bounced the inst on the primary, and started everything back up. I will monitor the diag log and hope for the absence of error:

    "Did not receive anything through HADR connection for the duration of HADR_TIMEOUT. Closing connection"

  10. #10
    Join Date
    Aug 2010
    Posts
    11
    I seem to get a lot more HADR error logs now even though its apparently working enough to stay in sync.

    I let a ping run all night and it didn't miss a packet, so I still must rule out any network troubles. Packets: Sent = 20330, Received = 20330, Lost = 0

    Also, when i'm using the SSH console to the HADR system, it frequently gives me problems where when I type into the terminal the characters dont echo for several seconds, sometimes minutes, like the whole box grinds to a halt. CPU/RAM both seem to be acceptable. It eventually recovers and then repeats.

    While the standby is in its hung-up state, if I examine db2pd -hadr from the primary, i see that the standby is not rolling through transactions; in usually only a few minutes the standby begins to roll through the transactions and the box is again responsive and hadr gets in sync.

    2012-07-04-03.50.35.775083-300 I470423E445 LEVEL: Error
    PID : 28370 TID : 140615861921536PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 839 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrRecvMsgS, probe:30080
    MESSAGE : HADR standby recv error:
    DATA #1 : Hexdump, 4 bytes
    0x00007FE3AE7F8E20 : 0100 0000 ....

    2012-07-04-03.50.35.775192-300 I470869E431 LEVEL: Severe
    PID : 28370 TID : 140615861921536PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 839 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20215
    RETCODE : ZRC=0x8280001B=-2105540581=HDR_ZRC_COMM_CLOSED
    "Communication with HADR partner was lost"

    2012-07-04-03.50.35.775452-300 E471301E381 LEVEL: Event
    PID : 28370 TID : 140615861921536PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 839 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
    CHANGE : HADR state set to S-RemoteCatchupPending (was S-Peer)

    2012-07-04-03.50.35.775549-300 I471683E393 LEVEL: Warning
    PID : 28370 TID : 140615861921536PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 839 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32201
    MESSAGE : Info: HADR Socket send buffer size, SO_SNDBUF: 16384 bytes

    2012-07-04-03.50.35.775608-300 I472077E396 LEVEL: Warning
    PID : 28370 TID : 140615861921536PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 839 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32251
    MESSAGE : Info: HADR Socket receive buffer size, SO_RCVBUF: 87380 bytes

    2012-07-04-03.50.35.776600-300 I472474E461 LEVEL: Severe
    PID : 28370 TID : 140615861921536PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 839 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20280
    MESSAGE : Failed to connect to primary. rc:
    DATA #1 : Hexdump, 4 bytes
    0x00007FE3AE7FD110 : 1900 0F81 ....


    2012-07-04-04.02.24.544269-300 I474211E397 LEVEL: Severe
    PID : 28370 TID : 140615861921536PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 839 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20280
    RETCODE : ZRC=0x810F0019=-2129723367=SQLO_CONN_REFUSED "Connection refused"
    Last edited by jsnyder9; 07-04-12 at 15:10.

  11. #11
    Join Date
    Aug 2010
    Posts
    11
    I'm seeing the original error as well. To get to this point I invoke a process which begins a reasonable amount of transactions on the primary. The secondary server starts to hang up (shell and typing become very sluggish) The diag log starts starts to show error:


    2012-07-04-12.20.55.918603-300 E535161E371 LEVEL: Event
    PID : 3388 TID : 140737001809664PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 48 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
    CHANGE : HADR state set to S-Peer (was S-NearlyPeer)

    2012-07-04-12.43.47.170795-300 I535533E540 LEVEL: Error
    PID : 3388 TID : 140737001809664PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 48 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20200
    MESSAGE : Did not receive anything through HADR connection for the duration of
    HADR_TIMEOUT. Closing connection.
    DATA #1 : Hexdump, 4 bytes
    0x00007FFFE2FFD0AC : 9200 0000 ....

    2012-07-04-12.43.47.170930-300 I536074E365 LEVEL: Severe
    PID : 3388 TID : 140737001809664PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 48 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20200
    RETCODE : ZRC=0x00000000=0=PSM_OK "Unknown"

    2012-07-04-12.43.47.171002-300 E536440E381 LEVEL: Event
    PID : 3388 TID : 140737001809664PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 48 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
    CHANGE : HADR state set to S-RemoteCatchupPending (was S-Peer)

    2012-07-04-12.45.29.876347-300 I536822E393 LEVEL: Warning
    PID : 3388 TID : 140737001809664PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 48 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32201
    MESSAGE : Info: HADR Socket send buffer size, SO_SNDBUF: 16384 bytes

    2012-07-04-12.45.29.879122-300 I537216E396 LEVEL: Warning
    PID : 3388 TID : 140737001809664PROC : db2sysc
    INSTANCE: cs90prd NODE : 000
    EDUID : 48 EDUNAME: db2hadrs (CS90PRD)
    FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32251
    MESSAGE : Info: HADR Socket receive buffer size, SO_RCVBUF: 87380 bytes

  12. #12
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Provided Answers: 1
    Quote Originally Posted by jsnyder9
    CPU/RAM both seem to be acceptable.

    Quote Originally Posted by jsnyder9
    The secondary server starts to hang up (shell and typing become very sluggish)
    I find these statements contradictory.

    My guess would be that the standby server is unable to keep up with the incoming transaction rate. If that's the case, you can either tune or upgrade the standby server, or try changing the HADR synchronization mode to ASYNC.

  13. #13
    Join Date
    May 2003
    Location
    USA
    Posts
    5,737
    Quote Originally Posted by jsnyder9 View Post
    Also, when i'm using the SSH console to the HADR system, it frequently gives me problems where when I type into the terminal the characters dont echo for several seconds, sometimes minutes, like the whole box grinds to a halt. CPU/RAM both seem to be acceptable. It eventually recovers and then repeats.
    I would recommend that you get this problem resolved first, which apparently is not related to HADR.
    M. A. Feldman
    IBM Certified DBA on DB2 for Linux, UNIX, and Windows
    IBM Certified DBA on DB2 for z/OS and OS/390

  14. #14
    Join Date
    Aug 2010
    Posts
    11
    When looking at db2top bottleneck on the secondary it looks like this, even when no transactions are occurring.


    => SessionCpu 191 100.00% 22.276299 db2replay
    => IO r/w 191 100.00% 33 db2replay
    => Memory 191 100.00% 640.0K db2replay

    --refresh

    => SessionCpu 191 100.00% 25.793601 db2replay
    => IO r/w N/A 0% 0 N/A
    => Memory 191 100.00% 640.0K db2replay

    --refresh and repeat

  15. #15
    Join Date
    Jun 2003
    Location
    Toronto, Canada
    Posts
    5,516
    Provided Answers: 1
    Quote Originally Posted by jsnyder9 View Post
    When looking at db2top bottleneck on the secondary it looks like this
    So? It tells you that of all DB2 processes it's db2replay that consumes most CPU time and memory. Would you expect anything else?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •