[Linux-cluster] GFS2 directory hangs on one node CentOS 5.3

Mon Sep 28 10:56:24 UTC 2009

>Hi,
>
>On Mon, 2009-09-28 at 12:13 +0200, Libor Tomsik wrote:
>> Hi,
>> >Hi,
>> >
>> >On Sat, 2009-09-26 at 18:29 +0200, Libor Tomsik wrote:
>> >> Hi all,
>> >>
>> >> I'm having a strange issue with a two nodes cluster based on xen
>> >> virtual hosts with shared disk on clvm. The servers are running apache
>> >> and one is considered as hot backup. On that node awstats are counted
>> >> from the apache custom logs stored on the shared device. Web data,
>> >> logs, configs and awstats results are in different directories withing
>> >> the same GFS2 volume.
>> >>
>> >> Everything works fine, but sometimes (at production environment, damn)
>> >> the directory with logs get frozen for the spare node with awstats.
>> >> All commands like ls, cd, mc on that directory get status D. On the
>> >> second node all works fine. Other directories seems unaffected too.
>> >>
>> >> I can not umount fs neither remout it ro and back rw since there are
>> >> "running" processes at D state.
>> >>
>> >> Can someone give me some advice, how-to prevent this problem? And
>> >> how-to recovery from it? It is a production with SLA on :(  In next
>> >> time, I'll try to make lockdump on both nodes.
>> >>
>> >> Kernel is 2.6.18-128.1.10.el5xen, gfs2-utils-0.1.53-1.el5_3.2,
>> >> kmod-gfs2-xen-1.92-1.1.el5_2.2
>> >>
>> >> Regards
>> >>
>> >> Libor
>> >>
>> >That sounds to me like there is a lot of activity from both nodes
>> >relating to the same directory. Can you split the logs of the two nodes
>> >into two different directories? That will probably solve the problem.
>> >
>> Actually there is just one apache writing on one server. Well in many
>> threads. Maybe this is the problem? I have about 40 sites hosted
>> there. So 2x40 separate log files.
>> The second node is just periodically reading this directory.
>>
>That can still cause a problem. The second node will require a shared
>lock on the directory, so if there is any file creation going on, it
>will be dramatically slowed down by that. Is it possible to stop the
>second node's I/O to check that?
>
Yes it is possible to generate statistic at the apache node, not the
second one, if it might help. But where is then the advantage of
clustered fs accessible from all nodes?
>
>There shouldn't really be a bit issue with lots of threads provided they
>are all on the same node as is the case here,
>
>Steve.
>
>
>> >This kind of problem is tricky to debug since the glock dumps will tell
>> >you what state the glocks are currently in, and not what has been
>> >happening the in past.
>> >
Is somehow possible to free this locks by force? I mean kill dead
processes and remount fs on affected node? Some recovery solution?
>> >In the upstream code we've now got GFS2 tracepoints which will help in
>> >tracking down issues like this, but those are not in RHEL yet,
>> >
>> >Steve.
>> >
>> >> --
>> >> Linux-cluster mailing list
>> >> Linux-cluster redhat com
>> >> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> Regards
>>
>> Libor.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster redhat com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
Libor