[Linux-cluster] Kernel messages causing node to be fenced out

Fri Jan 26 15:09:51 UTC 2007

Hi,

We have a setup with two HP DL360 nodes connected to an MSA500 disk array
via SCSI cables. We are running RH4U3 and our product has an active passive
design. The Active-passive is managed internally in the product.

Every now and then one of the nodes outputs the below kernel messages after
which the other node fences it out. This causes a failover for our product.

Jan 19 13:38:58 n1 kernel: FS1 move flags 0,1,0 ids 0,2,0
Jan 19 13:38:58 n1 kernel: FS1 move use event 2
Jan 19 13:38:58 n1 kernel: FS1 recover event 2 (first)
Jan 19 13:38:58 n1 kernel: FS1 add nodes
Jan 19 13:38:58 n1 kernel: FS1 total nodes 1
Jan 19 13:38:58 n1 kernel: FS1 rebuild resource directory
Jan 19 13:38:58 n1 kernel: FS1 rebuilt 0 resources
Jan 19 13:38:58 n1 kernel: FS1 recover event 2 done
Jan 19 13:38:58 n1 kernel: FS1 move flags 0,0,1 ids 0,2,2
Jan 19 13:38:58 n1 kernel: FS1 process held requests
Jan 19 13:38:58 n1 kernel: FS1 processed 0 requests
Jan 19 13:38:58 n1 kernel: FS1 recover event 2 finished
Jan 19 13:38:58 n1 kernel: FS1 move flags 1,0,0 ids 2,2,2
Jan 19 13:38:58 n1 kernel: FS1 move flags 0,1,0 ids 2,5,2
Jan 19 13:38:58 n1 kernel: FS1 move use event 5
Jan 19 13:38:58 n1 kernel: FS1 recover event 5
Jan 19 13:38:58 n1 kernel: FS1 add node 2
Jan 19 13:38:58 n1 kernel: FS1 total nodes 2
Jan 19 13:38:58 n1 kernel: FS1 rebuild resource directory
Jan 19 13:38:58 n1 kernel: FS1 rebuilt 7409 resources
Jan 19 13:38:58 n1 kernel: FS1 purge requests
Jan 19 13:38:58 n1 kernel: FS1 purged 0 requests
Jan 19 13:38:58 n1 kernel: FS1 mark waiting requests
Jan 19 13:38:58 n1 kernel: FS1 marked 0 requests
Jan 19 13:38:58 n1 kernel: FS1 recover event 5 done
Jan 19 13:38:58 n1 kernel: FS1 move flags 0,0,1 ids 2,5,5
Jan 19 13:38:58 n1 kernel: FS1 process held requests
Jan 19 13:38:58 n1 kernel: FS1 processed 0 requests
Jan 19 13:38:58 n1 kernel: FS1 resend marked requests
Jan 19 13:38:58 n1 kernel: FS1 resent 0 requests
Jan 19 13:38:58 n1 kernel: FS1 recover event 5 finished
Jan 19 13:38:58 n1 kernel: FS1 send einval to 2
Jan 19 13:38:58 n1 kernel: FS1 send einval to 2
Jan 19 13:38:58 n1 kernel: FS1 unlock ff9b0297 no id
Jan 19 13:38:59 n1 kernel:  -2
Jan 19 13:38:59 n1 kernel: 2712 en punlock 7,3019aa2
Jan 19 13:38:59 n1 kernel: 2712 ex punlock -2
Jan 19 13:38:59 n1 kernel: 2712 en punlock 7,3019aa2
Jan 19 13:38:59 n1 kernel: 2712 ex punlock -2
Jan 19 13:38:59 n1 kernel: 2712 en punlock 7,3019aa2
Jan 19 13:38:59 n1 kernel: 2712 ex punlock -2
Jan 19 13:38:59 n1 kernel: 2712 en punlock 7,3019aa2

Then the other node says "missed too many heartbeats" and fences it out. it
does some minor recovery work and is all fine.

Is this a bug? The two nodes don't seem to do much at the time when this
happens.
We have seen this on another similar setup (2 DL360, MSA500). It seems to
happen quite regularly.

I remember I saw a mention about something similar on a mailing list and
Patrick Caulfield answered:

"If you're running the cman from RHEL4 Update 3 then there's a bug in there
you might be hitting.

You'll need to upgrade all the nodes in the cluster to get rid of it. I
can't tell for sure
if it is that problem you're having without seeing more kernel messages
though."

Any ideas?

Thanks.

-- 
Coman ILIUT

Mitel Networks
Ottawa, ON
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070126/bf73033f/attachment.htm>