[Linux-cluster] rgmanger stuck, hung on futex

Mon Dec 11 18:04:53 UTC 2006

I have a five node cluster, RHEL4.4 with latest errata. One node is telling
me "Timed out waiting for a response from Resource Group Manager" when I run
clustat, and if I strace one of its clurgmgrd PIDs it seems to be stuck at a
futex locking call ---

root at bamf01:~
(1)>clustat
Timed out waiting for a response from Resource Group Manager
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  bamf01                                   Online, Local, rgmanager
  bamf02                                   Online
  bamf03                                   Online, rgmanager
  bamf04                                   Online, rgmanager
  bamf05                                   Online, rgmanager

Other nodes are fine:

root at bamf03:/etc/init.d
(0)>clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  bamf01                                   Online, rgmanager
  bamf02                                   Online
  bamf03                                   Online, Local, rgmanager
  bamf04                                   Online, rgmanager
  bamf05                                   Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  goat-design          bamf05                         started
  cougar-compout       bamf05                         started
  cheetah-renderout    bamf01                         started
  postgresql-blur      bamf04                         started
  tiger-jukebox        bamf01                         started
  hartigan-home        bamf01                         started

cman_tool status on rgmanger-failed node (namf01) matches cman_tool status
on other nodes besides "Active Subsytems" counts. Difference is that node
with failed rgmanger is running service that uses GFS, so its has +4 active
subsystems, two DLM lock spaces for a gfs fs and two gfs mount groups --

root at bamf01:~
(0)>cman_tool status
Protocol version: 5.0.1
Config version: 34
Cluster name: bamf
Cluster ID: 1492
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 5
Expected_votes: 5
Total_votes: 5
Quorum: 3
Active subsystems: 8
Node name: bamf01
Node ID: 2
Node addresses: 10.0.19.21

root at bamf05:~
(0)>cman_tool status
Protocol version: 5.0.1
Config version: 34
Cluster name: bamf
Cluster ID: 1492
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 5
Expected_votes: 5
Total_votes: 5
Quorum: 3
Active subsystems: 5
Node name: bamf05
Node ID: 4
Node addresses: 10.0.19.25

root at bamf01:~
(0)>cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 1 4 5 3]

DLM Lock Space:  "clvmd"                             2   3 update    U-4,1,3
[1 2 4 5 3]

DLM Lock Space:  "Magma"                             4   5 run       -
[1 2 4 5]

DLM Lock Space:  "gfs1"                              5   6 run       -
[2]

GFS Mount Group: "gfs1"                              6   7 run       -
[2]

User:            "usrm::manager"                     3   4 run       -
[1 2 4 5]

root at bamf04:~
(0)>cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 4 5 3]

DLM Lock Space:  "clvmd"                             2   3 update    U-4,1,3
[1 2 4 5 3]

DLM Lock Space:  "Magma"                             4   5 run       -
[1 2 4 5]

User:            "usrm::manager"                     3   4 run       -
[1 2 4 5]

An strace of the two running clurgmgrd processes on an OK node shows this:

root at bamf05:~
(1)>ps auxw |grep clurg |grep -v grep
root      7988  0.0  0.0  9568  376 ?        S<s  Dec08   0:00 clurgmgrd
root      7989  0.0  0.0 58864 5012 ?        S<l  Dec08   0:35 clurgmgrd

root at bamf05:~
(0)>strace -p 7988
Process 7988 attached - interrupt to quit
wait4(7989,
[nothing]

root at bamf05:~
(0)>strace -p 7989
Process 7989 attached - interrupt to quit
select(7, [4 5 6], NULL, NULL, {7, 760000}) = 0 (Timeout)
socket(PF_FILE, SOCK_STREAM, 0)         = 12
connect(12, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"}, 110) = 0
write(12, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
read(12, "\1\0\0\0\0\0\0\0|o_\0\0\0\0\0\0\0\0\0", 20) = 20
close(12)                               = 0
[snip]

strace of clurgmgrd PIDs on failed node shows:

root at bamf01:~
(0)>ps auxw |grep clurg |grep -v grep
root      7982  0.0  0.0  9568  376 ?        S<s  Dec08   0:00 clurgmgrd
root      7983  0.0  0.0 61592 7220 ?        S<l  Dec08   1:03 clurgmgrd

root at bamf01:~
(0)>strace -p 7982
Process 7982 attached - interrupt to quit
wait4(7983,
[nothing]

root at bamf01:~
(0)>strace -p 7983
Process 7983 attached - interrupt to quit
futex(0x522e28, FUTEX_WAIT, 5, NULL
[nothing]

Abe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20061211/d8051cbf/attachment.htm>