I know this is an old thread but just wondering if anyone has any ideas on this subject? I'm having almost the same issue (only once so far - it just started). I get all of the above errors but haven't had the service stop or had a cluster failover yet. I'm concerned that if the problem gets worse we will have failovers.
I've done some reading and I've got a few theories:
1. Cluster networking looks like it could be tweaked. Both the private and public connections are setup for internal and all communications. I'm wondering if this could be causing issues with the heartbeat in certain circumstances??? (Although it's worked fine for over 2 years)
2. We upgraded our ram from 16GB to 32GB a few weeks ago. I read someone who claims that it's possible to cache too much data in memory. In some cases if major updates are made to tables in memory then when it comes time to commit those changes to disk the IO subsystem can be saturated and become unresponsive???
I'm still digging but I see no other issues on this server. There doesn't appear to be any memory pressure and there are no other problems in any of the logs etc.
The only other difference to the above post is that my server is a SQL2000 Enterprise cluster - not SQL2005.