[Linux-cluster] Again DLM messages and high load

Bas van der Vlies basv at sara.nl
Wed Oct 25 11:14:02 UTC 2006


Riaan van Niekerk wrote:
> 
> Bas van der Vlies wrote:
>> WE are using:
>>  GFS         : CVS 1.0.3 stable
>>  kernel      : 2.6.17.11-sara1
>>  NFS-daemons : 128
>>  GFS-servers   : 5
>>
>> This node was the master and when this message was displayed, the load 
>> will rise to the number of NFS daemons and nfs does not work more. We 
>> had to reboot the node:
>>  Oct 25 03:12:31 ifs4 kernel: dlm: lisa_vg5_lv1: cancel reply ret 0
>>  Oct 25 03:12:31 ifs4 kernel: lock_dlm: unlock sb_status 0 2,a45325d 
>> flags 0
>>  Oct 25 03:12:31 ifs4 kernel: dlm: lisa_vg5_lv1: 
>> process_lockqueue_reply id a50a027c state 0
>>
>> I had to reboot the master node (ifs4) when the node went down the 
>> other nodes re-elected another master. 3 nodes use the same master and 
>> on one node has another master. Is this oke?:
>>
>> node 1,2,4
>> Fence Domain:    "default" 1   2 run       - [5 2 1 3 4]
>>
>> node 3:
>> Fence Domain:    "default" 1   2 run       - [2 5 1 3 4]
>>
>> cman_tool nodes is the same for all nodes.
>>
>>
>> Regards
>>
> 
> good day Bas
> 
> We had the EXACT same symptom (load average rising to number of NFSDs, 
> NFS then becomes unresponsive - these processes actually become 
> defunct), happening about 2x a week.
> 
> We have had a service request open with Red Hat for the past 3 months. 
> Our biggest problem was with regards to capturing the sysrq T output, 
> which we could not provide (since the problem always surfaced so 
> quickly, and being a production server, our biggest concern was getting 
> the service up, rather than capture debugging info) and therefore could 
> not take the issue further.
> 
> We still had this problem with the DLM/GFS kernel modules accompanying 
> kernel 2.6.9-42.0.EL
> 
> We loaded the DLM/GFS kernel modules accompanying kernel 2.6.9-42.0.2.EL:
> GFS-kernel-smp-2.6.9-60.1
> dlm-kernel-smp-2.6.9-44.2
> a week and a half ago, and since then we have not seen this or two other 
> problem symptoms.
> 
> The bugzilla entry we were tracking (some assertion failures):
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=199673
> it was the only significant change  between kernel modules for 42.EL and 
> 42.0.2.EL .
> The DLM errata is in http://rhn.redhat.com/errata/RHBA-2006-0702.html
> 
> I am not sure how these versions map to the CVS versions, or if our NFSD 
> problem is indeed solved. However, it has never stayed away this long.
> 
> If our NFS problem does occur again, I will let you know.
> 
> greetings
> Riaan

Riaan,

  Thanks for the info. We had this problem also several times in a week 
with the previous versions. Now we use the latest version from CVS 
STABLE and hit this bug again, the uptime was 50 days ;-)

Regards





-- 
--
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv at sara.nl      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************




More information about the Linux-cluster mailing list