[Linux-cluster] GFS/CS blocks all I/O on 1 server reboot of 11 nodes?

Wed Mar 21 00:36:59 UTC 2007

I ran a series of reboots, and this problem is totally reproducible.  Should I be opening a ticket at Red Hat Support on this?

The problem is immediate with 'service rgmanager stop', as it hangs in its sleep loop forever, even though all nodes in the cluster report that it changed its state to down.  But worse than that, it also hangs all GFS I/O and the load average on all nodes start to spike (>9.00) -- I see gfs_scand in top racing away.

It only gets fixed when I manually 'power reset' the node, then I get the 'Missed too many heartbeats' followed by fencing.  Help.

Robert Hurst, Sr. Caché Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 · Fax: 617-754-8730 · Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.

-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of rhurst at bidmc.harvard.edu
Sent: Tue 3/20/2007 11:39 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] GFS/CS blocks all I/O on 1 server reboot of 11 nodes?

Troubling, this behavior has not occurred prior to our Mar 2nd up2date
on our RHEL GFS/CS subscription.  I rebooted an application server
(app1) in an 11-node cluster, and from viewing its console, it 'hung' on
a service cman stop.  Consequently, ALL GFS I/O got blocked on ALL
nodes.  All servers are configured the same:

AMD64 dual CPU/duo core HP DL385, 8GB RAM, dual hba (PowerPath)
# uname -r
2.6.9-42.0.10.ELsmp

ccs-1.0.7-0
cman-1.0.11-0
dlm-1.0.1-1
fence-1.32.25-1
GFS-6.1.6-1
magma-1.0.6-0
magma-plugins-1.0.9-0
rgmanager-1.9.54-1

My central syslog server showed that all nodes registered the membership
change, yet the service continued to hang.

Mar 20 11:06:18 app1 shutdown: shutting down for system reboot 
Mar 20 11:06:18 app1 init: Switching to runlevel: 6 
Mar 20 11:06:19 app1 rgmanager: [1873]: <notice> Shutting down Cluster
Service Manager...  
Mar 20 11:06:20 app1 clurgmgrd[11220]: <notice> Shutting down  
Mar 20 11:06:20 net2 clurgmgrd[30893]: <info> State change: app1 DOWN 
Mar 20 11:06:20 app3 clurgmgrd[11092]: <info> State change: app1 DOWN  
Mar 20 11:06:20 db1 clurgmgrd[8351]: <info> State change: app1 DOWN  
Mar 20 11:06:20 db3 clurgmgrd[8279]: <info> State change: app1 DOWN  
Mar 20 11:06:20 db2 clurgmgrd[10875]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app6 clurgmgrd[10959]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app4 clurgmgrd[11146]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app2 clurgmgrd[10835]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app5 clurgmgrd[11198]: <info> State change: app1 DOWN  
Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN  
Mar 20 11:12:26 net2 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 db2 kernel: CMAN: removing node app1 from the cluster :
Missed too many heartbeats 
Mar 20 11:12:26 db3 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app4 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app5 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app6 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app3 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 db1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app2 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0
sec post_fail_delay 
Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1" 
Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success 

I issued a 'power reset' on its HP ILO management port to hardware
reboot the server around 11:12.  That is when the net1 server attempted
to fence app1, after it was missing.  Here's net1's syslog entries on
that event:

Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> Magma Event: Membership
Change 
Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN
Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0
sec post_fail_delay
Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1"
Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success
Mar 20 11:15:45 net1 kernel: CMAN: node app1 rejoining
Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> Magma Event: Membership
Change 
Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> State change: app1 UP 

Robert Hurst, Sr. Caché Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 5074 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070320/adf80302/attachment.bin>