Kadlecsik Jozsef wrote:
I don't see a strong evidence of deadlock (but it could) from the thread backtraces However, assuming the cluster worked before, you could have overloaded the e1000 driver in this case. There are suspicious page faults but memory is very "ok". So one possibility is that GFS had generated too many sync requests that flooded the e1000. As the result, the cluster heart beat missed its interval.It's a possibility. But it assumes also that the node freezes >because< it was fenced off. So far nothing indicates that.
Re-read your console log. There are many foot-prints of spin_lock - that's worrisome. Hit a couple of "sysrq-w" next time when you have hangs, other than sysrq-t. This should give traces of the threads that are actively on CPUs at that time. Also check your kernel change log (to see whether GFS has any new patch that touches spin lock that doesn't in previous release).
BTW, I do have opinions on other parts of your postings but don't have time to express them now. Maybe I'll say something when I finish my current chores :) ... Need to rush out now. Good luck on your debugging !