Re: [Linux-cluster] node kicked out of cluster

On Wed, Feb 23, 2005 at 03:12:22PM -0800, Daniel McNeil wrote:
> On Tue, 2005-02-22 at 00:35, Patrick Caulfield wrote:
> > On Mon, Feb 21, 2005 at 05:34:23PM -0800, Daniel McNeil wrote:
> > > My latest test ran 49 hours before a node got kicked out.
> My test is doing a bunch of mount/umount's, so that is causing
> the service join/leave for the DLM lock space and file system
> mount group.  
> Are you saying leaving a DLM lock space causes a DLM recovery and
>  that is what is leading to the 'No response to messages' ?

Indirectly, we think.
> How does DLM hogging a cpu lead to 'no response'?
> BTW, this is running on 2 proc machines, so hogging one cpu
> still leaves on available.  

Hmm that makes it more, er, interesting.
> Who is not responding to what message?

It's cman that has sent a message to another node, to process the join. It
hasn't got back an ACK for that message after max_retries (default 3) attempts.



