[Linux-cluster] Re: SMP and GFS

Mon Oct 3 17:35:46 UTC 2005

On Mon, Oct 03, 2005 at 11:51:51AM -0500, David Teigland wrote:
> On Sun, Oct 02, 2005 at 12:23:05PM +0200, Axel Thimm wrote:
> > On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> > > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval
> > > d6c30333 fr 1 r 1        2
> > > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1
> > > Jul 14 14:19:35 atmail-2 last message repeated 2 times
> 
> > I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP
> > Proliant). The machine crashed with a kernel panic shortly after
> > telling the other nodes to leave the cluster (sorry the staff was
> > under pressure and noone wrote down the panic's output):
> > 
> > Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel)
> > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> > Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> > Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> 
> These "einval" messages from the dlm are not necessarily bad and are not
> directly related to the "removing from cluster" messages below.  The
> einval conditions above can legitimately occur during normal operation and
> the dlm should be able to deal with them.  Specifically they mean that:
> 
>   1. node A is told that the lock master for resource R is node B
>   2. the last lock is removed from R on B
>   3. B gives up mastery of R
>   4. A sends lock request to B
>   5. B doesn't recognize R and returns einval to A
>   6. A starts over
> 
> The message "send einval to..." is printed on B in step 5.
> The message "req reply einval..." is printed on A in step 6.
> 
> This is an unfortunate situation, but not lethal.  That said, a spike in
> these messages may indicate that something is amiss (and that a "removing
> from cluster" may be on the way).  Or, maybe the gfs load has struck the
> dlm in a particularly sore way.
> 
> > Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster :
> > Missed too many heartbeats (P:kernel)
> > Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster :
> > No response to messages (P:kernel)
> 
> After this happens, the dlm will often return an error (like -EINVAL) to
> lock_dlm.  It's not the same thing as above.  Lock_dlm will always panic
> at that point since it can no longer acquire locks for gfs.

At the time all of this happened, the three node cluster zs01 to zs03
had only zs01 active with nfs and samba exports (both with neglidgible
activity at that time of the day) and a proprietary backup solution
(vertias' netbackup). The latter created a network traffic of 25-30
MB/sec of the interface the cluster heartbeat was also running on. The
backup was running for a couple of hours already.

Can that we the root of evil? Delayed or dropped UDP cman packages?

Can the same scanario explain the (silent!) desyncing of GFS later on,
after all nodes were rebooted?
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/c5cbe42b/attachment.sig>