[Linux-cluster] weird happenings on my cluster and another panic.

Thu Oct 26 14:53:12 UTC 2006

ok, thanks for the breakdown.. so basically, I just need to rebuild all 
of my packages from 
ftp://updates.redhat.com:/enterprise/4AS/en/RHGFS/SRPMS
and try again, right?

regards,
Jason

On Thu, Oct 26, 2006 at 10:14:31AM -0400, Lon Hohberger wrote:
> On Wed, 2006-10-25 at 20:56 -0400, jason at monsterjam.org wrote:
> > ok, I was just logging into the 2 nodes of my cluster, tf1 and tf2, I noticed that tf1 was NOT 
> > available via ssh, but tf2 was. tf1 was pingable, but that was it. I looked on tft2 and 
> > noticed that he had taken over the cluster virtual ip address 
> > 
> > 2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
> >     link/ether 00:11:43:d7:c9:c6 brd ff:ff:ff:ff:ff:ff
> >     inet 192.168.1.6/24 brd 192.168.1.255 scope global eth0
> >     inet 192.168.1.7/32 scope global eth0
> >     inet6 fe80::211:43ff:fed7:c9c6/64 scope link 
> >        valid_lft forever preferred_lft forever
> 
> Well, I can walk through what happened here.
> 
> > Oct 25 20:26:00 tf2 kernel: CMAN: removing node tf1 from the cluster : Missed too many 
> > heartbeats
> 
> Node died for some reason.
> 
> > Oct 25 20:26:00 tf2 fenced[4091]: tf1 not a cluster member after 0 sec post_fail_delay
> > Oct 25 20:26:00 tf2 fenced[4091]: fencing node "tf1"
> > Oct 25 20:26:04 tf2 kernel: e100: eth2: e100_watchdog: link down
> > Oct 25 20:26:08 tf2 fenced[4091]: fence "tf1" success
> 
> ^^ Fence recovery.
> 
> > Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Trying to acquire journal 
> > lock...
> ...
> > Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Done
> 
> ^^ GFS recovery
> 
> > Oct 25 20:26:27 tf2 clurgmgrd[4903]: <info> Magma Event: Membership Change 
> > Oct 25 20:26:27 tf2 clurgmgrd[4903]: <info> State change: tf1 DOWN 
> 
> ^^ Rgmanager recovery
> 
> > Oct 25 20:26:27 tf2 clurgmgrd[4903]: <notice> Starting stopped service Apache Service 
> > Oct 25 20:26:29 tf2 httpd: httpd startup succeeded
> > Oct 25 20:26:29 tf2 clurgmgrd[4903]: <notice> Service Apache Service started 
> > Oct 25 20:26:36 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex
> > Oct 25 20:28:08 tf2 kernel: e100: eth2: e100_watchdog: link down
> > Oct 25 20:28:10 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex
> > Oct 25 20:29:40 tf2 kernel: CMAN: node tf1 rejoining
> 
> ^^ CMAN restarted on tf1 (rebooted)
> 
> 
> > Oct 25 20:34:25 tf2 kernel: CMAN: too many transition restarts - will die
> > Oct 25 20:34:25 tf2 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
> 
> Argh.  That's not good.  I *think* this is a bug in CMAN-kernel in U3,
> which was fixed in U4.
> 
> > Oct 25 20:34:26 tf2 kernel: lock_dlm:  Assertion failed on line 428 of file 
> > /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
> > Oct 25 20:34:26 tf2 kernel: lock_dlm:  assertion:  "!error"
> ...
> > Oct 25 20:34:26 tf2 kernel: ------------[ cut here ]------------
> ...
> > Oct 25 20:34:26 tf2 kernel:  <0>Fatal exception: panic in 5 seconds
> > 
> > and now tf2 is  unreachable too.. 
> > ideas? suggestions?
> 
> The panic above is a bug in the dlm-kernel rpm/package; I don't know
> much more than that.  When a machine panics, it stops responding to
> things over the network.
> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
================================================
|    Jason Welsh   jason at monsterjam.org        |
| http://monsterjam.org    DSS PGP: 0x5E30CC98 |
|    gpg key: http://monsterjam.org/gpg/       |
================================================