[Linux-cluster] Again DLM messages and high load
Bas van der Vlies
basv at sara.nl
Wed Oct 25 11:14:02 UTC 2006
Riaan van Niekerk wrote:
>
> Bas van der Vlies wrote:
>> WE are using:
>> GFS : CVS 1.0.3 stable
>> kernel : 2.6.17.11-sara1
>> NFS-daemons : 128
>> GFS-servers : 5
>>
>> This node was the master and when this message was displayed, the load
>> will rise to the number of NFS daemons and nfs does not work more. We
>> had to reboot the node:
>> Oct 25 03:12:31 ifs4 kernel: dlm: lisa_vg5_lv1: cancel reply ret 0
>> Oct 25 03:12:31 ifs4 kernel: lock_dlm: unlock sb_status 0 2,a45325d
>> flags 0
>> Oct 25 03:12:31 ifs4 kernel: dlm: lisa_vg5_lv1:
>> process_lockqueue_reply id a50a027c state 0
>>
>> I had to reboot the master node (ifs4) when the node went down the
>> other nodes re-elected another master. 3 nodes use the same master and
>> on one node has another master. Is this oke?:
>>
>> node 1,2,4
>> Fence Domain: "default" 1 2 run - [5 2 1 3 4]
>>
>> node 3:
>> Fence Domain: "default" 1 2 run - [2 5 1 3 4]
>>
>> cman_tool nodes is the same for all nodes.
>>
>>
>> Regards
>>
>
> good day Bas
>
> We had the EXACT same symptom (load average rising to number of NFSDs,
> NFS then becomes unresponsive - these processes actually become
> defunct), happening about 2x a week.
>
> We have had a service request open with Red Hat for the past 3 months.
> Our biggest problem was with regards to capturing the sysrq T output,
> which we could not provide (since the problem always surfaced so
> quickly, and being a production server, our biggest concern was getting
> the service up, rather than capture debugging info) and therefore could
> not take the issue further.
>
> We still had this problem with the DLM/GFS kernel modules accompanying
> kernel 2.6.9-42.0.EL
>
> We loaded the DLM/GFS kernel modules accompanying kernel 2.6.9-42.0.2.EL:
> GFS-kernel-smp-2.6.9-60.1
> dlm-kernel-smp-2.6.9-44.2
> a week and a half ago, and since then we have not seen this or two other
> problem symptoms.
>
> The bugzilla entry we were tracking (some assertion failures):
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=199673
> it was the only significant change between kernel modules for 42.EL and
> 42.0.2.EL .
> The DLM errata is in http://rhn.redhat.com/errata/RHBA-2006-0702.html
>
> I am not sure how these versions map to the CVS versions, or if our NFSD
> problem is indeed solved. However, it has never stayed away this long.
>
> If our NFS problem does occur again, I will let you know.
>
> greetings
> Riaan
Riaan,
Thanks for the info. We had this problem also several times in a week
with the previous versions. Now we use the latest version from CVS
STABLE and hit this bug again, the uptime was 50 days ;-)
Regards
--
--
********************************************************************
* *
* Bas van der Vlies e-mail: basv at sara.nl *
* SARA - Academic Computing Services phone: +31 20 592 8012 *
* Kruislaan 415 fax: +31 20 6683167 *
* 1098 SJ Amsterdam *
* *
********************************************************************
More information about the Linux-cluster
mailing list