[Linux-cluster] GFS hangs, nodes die

Sun Aug 19 09:53:39 UTC 2007

Hi Marc!

Thanks for your help. As I restarted everything now, I can't check this. 
I will do when it's crahsing again (I will do some tests now). I 
realised that one node did hang with kernel panic. Attached is the 
screenshot.

regards
sebastian

Marc Grimme wrote:
> Hello Sebastian,
> what do gfs_tool counters on the fs tell you?
> And ps axf? Do you have a lot of "D" processes?
> Regards Marc.
> On Sunday 19 August 2007 02:06:30 Sebastian Walter wrote:
>   
>> Dear list,
>>
>> this is the tragical story of my cluster running rhel/csgfs 4u5: the
>> cluster in generally is running fine, but when I increase the load to a
>> certain level (heavy I/O), it collapses. About 20% of the nodes do crash
>> (not reacting any more, but no sign of kernel panic), the others can't
>> access the gfs resource.
>> Gfs is set up as a rgmanager service with failover domain for each node
>> (same problem also exists when mounting via /etc/fstab).
>>
>> Who is willing to provide a happy end?
>>
>> Thanks, Sebastian
>> **
>>
>> This is what /var/log/messages gives me (on nearly all nodes):
>> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting
>> status for RG gfs-2
>> and e.g.
>> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain
>> cluster lock: Connection timed out
>>
>> [root at compute-0-3 ~]# cat /proc/cluster/status
>> Protocol version: 5.0.1
>> Config version: 53
>> Cluster name: dtm
>> Cluster ID: 741
>> Cluster Member: Yes
>> Membership state: Cluster-Member
>> Nodes: 10
>> Expected_votes: 11
>> Total_votes: 10
>> Quorum: 6
>> Active subsystems: 8
>> Node name: compute-0-3
>> Node ID: 4
>> Node addresses: 10.1.255.252
>>
>> [root at compute-0-6 ~]# cat /proc/cluster/services
>> Service          Name                              GID LID State     Code
>> Fence Domain:    "default"                           3   2 recover 4 -
>> [1 2 6 10 9 8 3 7 4 11]
>> DLM Lock Space:  "clvmd"                             7   3 recover 0 -
>> [1 2 6 10 9 8 3 7 4 11]
>> DLM Lock Space:  "Magma"                            12   5 recover 0 -
>> [1 2 6 10 9 8 3 7 4 11]
>> DLM Lock Space:  "homeneu"                          17   6 recover 0 -
>> [10 9 8 7 2 3 6 4 1 11]
>> GFS Mount Group: "homeneu"                          18   7 recover 0 -
>> [10 9 8 7 2 3 6 4 1 11]
>> User:            "usrm::manager"                    11   4 recover 0 -
>> [1 2 6 10 9 8 3 7 4 11]
>>
>> [root at compute-0-10 ~]# cat /proc/cluster/dlm_stats
>> DLM stats (HZ=1000)
>>
>> Lock operations:       4036
>> Unlock operations:     2001
>> Convert operations:    1862
>> Completion ASTs:       7898
>> Blocking ASTs:           52
>>
>> Lockqueue        num  waittime   ave
>> WAIT_RSB        3778     28862     7
>> WAIT_CONV         75       482     6
>> WAIT_GRANT      2171      7235     3
>> WAIT_UNLOCK      153      1606    10
>> Total           6177     38185     6
>>
>> [root at compute-0-10 ~]# cat /proc/cluster/sm_debug
>> sevent state 7
>> 02000012 sevent state 9
>> 00000003 remove node 5 count 10
>> 01000011 remove node 5 count 10
>> 0100000c remove node 5 count 10
>> 01000007 remove node 5 count 10
>> 02000012 remove node 5 count 10
>> 0300000b remove node 5 count 10
>> 00000003 recover state 0
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
>
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Picture 8.png
Type: image/png
Size: 23680 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070819/9112cced/attachment.png>