[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] GFS2 interesting death with error



Saw an interesting and different GFS2 death this morning that I wanted to pass along in case anyone has insights. We have not seen any of the "hanging in dlm_posix_lock" since fsck'ing early Sunday morning. In any case I'm pretty confident that's being triggered by the creation & deletion of ".lock" files within Dovecot. This was something completely different and it left some potentially useful debug info in the logs.

Things were running fine when the machine "post2" abruptly died. The following was found to have been enscribed upon its stone logs:

Nov  5 10:56:28 post2 kernel: original: gfs2_rindex_hold+0x32/0x153 [gfs2]
Nov  5 10:56:28 post2 kernel: pid : 27197
Nov  5 10:56:28 post2 kernel: lock type: 2 req lock state : 3
Nov  5 10:56:28 post2 kernel: new: gfs2_rindex_hold+0x32/0x153 [gfs2]
Nov  5 10:56:28 post2 kernel: pid: 27197
Nov  5 10:56:28 post2 kernel: lock type: 2 req lock state : 3
Nov 5 10:56:28 post2 kernel: G: s:SH n:2/2053b f:s t:SH d:EX/0 l:0 a:0 r:4 Nov 5 10:56:28 post2 kernel: H: s:SH f:H e:0 p:27197 [procmail] gfs2_rindex_hold+0x32/0x153 [gfs2]
Nov  5 10:56:28 post2 kernel:   I: n:23/132411 t:8 f:0x00000010
Nov 5 10:56:28 post2 kernel: ----------- [cut here ] --------- [please bite here ] --------- Nov 5 10:56:32 post2 kernel: Kernel BUG at ...ir/build/BUILD/gfs2-kmod-1.92/_kmod_build_/glock.c:950

The fact that it died in procmail indicates that the failure occurred while writing mail to someone's Inbox. The system wasn't heavily loaded at the time -- the load averages were a little bit below 1.0 at the time of the crash.

Also interesting is what happened next. The load average on post1 (the only other node) shot up over 100, as numerous processes were blocked. It spent several minutes with an administrative process using 100% of a CPU -- I believe it was dlm_recoverd though I'm not 100% certain. Then, just as the load average had come back down to 15-20 and functionality was returning, it abruptly hung. At this point I reset both cluster nodes and all was well.

Anyway, if you've seen anything like this or have a clue as to the cause, I'd love to hear it. Looks like more lock-related glitchiness in our relatively lock intensive environment.

Thanks,
Allen

--
Allen Belletti
allen isye gatech edu                             404-894-6221 Phone
Industrial and Systems Engineering                404-385-2988 Fax
Georgia Institute of Technology


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]