[Linux-cluster] Freeze with cluster-2.03.11

Sun Mar 29 18:38:41 UTC 2009

Hi,

On Sat, 28 Mar 2009, Wendy Cheng wrote:

> Kadlecsik Jozsef wrote:
> > > I don't see a strong evidence of deadlock (but it could) from the 
> > > thread backtraces However, assuming the cluster worked before, you 
> > > could have overloaded the e1000 driver in this case. There are 
> > > suspicious page faults but memory is very "ok". So one possibility 
> > > is that GFS had generated too many sync requests that flooded the 
> > > e1000. As the result, the cluster heart beat missed its interval.
> > 
> > It's a possibility. But it assumes also that the node freezes >because< it
> > was fenced off. So far nothing indicates that.
> 
> Re-read your console log. There are many foot-prints of spin_lock - that's
> worrisome. Hit a couple of "sysrq-w"  next time when you have hangs, other
> than sysrq-t. This should give traces of the threads that are actively on CPUs
> at that time. Also check your kernel change log (to see whether GFS has any
> new patch that touches spin lock that doesn't in previous release).

I went through the git changelogs yesterday but could not spot anything 
suspicious, however I'm not a filesystem expert at all. The patch titled

gfs-kernel: Bug 450209: Create gfs1-specific lock modules + minor fixes to 
build with 2.6.27

hit me hard as according to the description, it was *not* tested in 
cluster environmet when it did replace dlm behind gfs.

I reached the decision and we downgraded - could not delay anymore:

cluster-2.03.11 -> cluster-2.01.00
linux-2.6.27.21 -> linux-2.6.23.17

The e1000 and e1000e drivers are the newest ones. The aoe driver is from 
aoe6-59 because aoe6-69 does not support 2.6.23.17. We did not downgrade 
openais and LVM2. Tomorrow we'll move back mailman to GFS.

There are three different netconsole log recordings at 
http://www.kfki.hu/~kadlec/gfs/, that's all I could do. If there'll be 
some patches I'll try to test it at one of the nodes but it can't be the 
one which runs the mailman queue manager and so far I could not find any 
other method to crash the system at will but to run it. That's a debugging 
problem to solve.

> BTW, I do have opinions on other parts of your postings but don't have 
> time to express them now. Maybe I'll say something when I finish my 
> current chores :)

I'd definitiely like to read your opinion!

We'll reorganize one of the AOE blades by backing up the GFS volume and 
creating a smaller one to make space for a new GFS2 test volume.

Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary