[Linux-cluster] Freeze with cluster-2.03.11

Sat Mar 28 00:36:15 UTC 2009

On Sat, 28 Mar 2009, Kadlecsik Jozsef wrote:

> On Fri, 27 Mar 2009, Bob Peterson wrote:
> 
> > Perhaps you should change your post_fail_delay to some very high
> > number, recreate the problem, and when it freezes force a
> > sysrq-trigger to get call traces for all the processes.
> > Then also you can look at the dmesg to see if there was a kernel
> > panic or something on the node that would otherwise be
> > immediately fenced.
> 
> I enabled more kernel debugging, netconsole and captured the attaced 
> console log. I hope it gives the required info.

I should get some sleep - but can't it be that I hit the potential 
deadlock mentioned here:

commit	4787e11dc7831f42228b89ba7726fd6f6901a1e3

gfs-kmod: workaround for potential deadlock. Prefault user pages

The bug uncovered in 461770 does not seem fixable without a massive
change to how gfs works.  There is a lock ordering mismatch between
the process address space lock and the glocks. The only good way to
avoid this in all cases is to not hold the glock for so long, which
is what gfs2 does. This is impossible without completely changing
how gfs does locking.  Fortunately, this is only a problem when you
have multiple processes sharing an address space, and are doing IO
to a gfs file with a userspace buffer that's part of an mmapped gfs
file. In this case, prefaulting the buffer's pages immediately
before acquiring the glocks significantly shortens the window for
this deadlock. Closing the window any more causes a large
performance hit.

Mailman do mmap files...

Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary