[Linux-cluster] GFS/CS blocks all I/O on 1 server reboot of 11 nodes?
rhurst at bidmc.harvard.edu
rhurst at bidmc.harvard.edu
Tue Mar 20 15:39:11 UTC 2007
Troubling, this behavior has not occurred prior to our Mar 2nd up2date
on our RHEL GFS/CS subscription. I rebooted an application server
(app1) in an 11-node cluster, and from viewing its console, it 'hung' on
a service cman stop. Consequently, ALL GFS I/O got blocked on ALL
nodes. All servers are configured the same:
AMD64 dual CPU/duo core HP DL385, 8GB RAM, dual hba (PowerPath)
# uname -r
2.6.9-42.0.10.ELsmp
ccs-1.0.7-0
cman-1.0.11-0
dlm-1.0.1-1
fence-1.32.25-1
GFS-6.1.6-1
magma-1.0.6-0
magma-plugins-1.0.9-0
rgmanager-1.9.54-1
My central syslog server showed that all nodes registered the membership
change, yet the service continued to hang.
Mar 20 11:06:18 app1 shutdown: shutting down for system reboot
Mar 20 11:06:18 app1 init: Switching to runlevel: 6
Mar 20 11:06:19 app1 rgmanager: [1873]: <notice> Shutting down Cluster
Service Manager...
Mar 20 11:06:20 app1 clurgmgrd[11220]: <notice> Shutting down
Mar 20 11:06:20 net2 clurgmgrd[30893]: <info> State change: app1 DOWN
Mar 20 11:06:20 app3 clurgmgrd[11092]: <info> State change: app1 DOWN
Mar 20 11:06:20 db1 clurgmgrd[8351]: <info> State change: app1 DOWN
Mar 20 11:06:20 db3 clurgmgrd[8279]: <info> State change: app1 DOWN
Mar 20 11:06:20 db2 clurgmgrd[10875]: <info> State change: app1 DOWN
Mar 20 11:06:20 app6 clurgmgrd[10959]: <info> State change: app1 DOWN
Mar 20 11:06:20 app4 clurgmgrd[11146]: <info> State change: app1 DOWN
Mar 20 11:06:20 app2 clurgmgrd[10835]: <info> State change: app1 DOWN
Mar 20 11:06:20 app5 clurgmgrd[11198]: <info> State change: app1 DOWN
Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN
Mar 20 11:12:26 net2 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 db2 kernel: CMAN: removing node app1 from the cluster :
Missed too many heartbeats
Mar 20 11:12:26 db3 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 app4 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 app5 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 app6 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 app3 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 db1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 app2 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0
sec post_fail_delay
Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1"
Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success
I issued a 'power reset' on its HP ILO management port to hardware
reboot the server around 11:12. That is when the net1 server attempted
to fence app1, after it was missing. Here's net1's syslog entries on
that event:
Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> Magma Event: Membership
Change
Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN
Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0
sec post_fail_delay
Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1"
Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success
Mar 20 11:15:45 net1 kernel: CMAN: node app1 rejoining
Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> Magma Event: Membership
Change
Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> State change: app1 UP
Robert Hurst, Sr. Caché Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts 02120-2140
617-754-8754 ∙ Fax: 617-754-8730 ∙ Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070320/ed68c110/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070320/ed68c110/attachment.p7s>
More information about the Linux-cluster
mailing list