[Linux-cluster] dlm kernel bug (node crashed)

Fabrizio Lippolis Fabrizio.Lippolis at AurigaInformatica.it
Tue Mar 21 11:35:05 UTC 2006


I have set up a linux cluster consisting of two nodes and a disk array 
formatted with GFS. I have configured the cluster to run a mysql process 
where the data files are on the disk array. Though it isn't yet in a 
production environment seems that everything works fine. When I kill the 
mysql process running on one machine, the other mounts the shared 
filesystem and starts the mysql process itself.

Yesterday I experienced a crash of the node running the mysql process 
appareantly with no reason. I had to brutally switch off the node and 
restart since it was completely locked. Looking in the /var/log/messages 
file I found the following:

Mar 20 17:30:00 AICLSRV02 kernel: CMAN: Being told to leave the cluster 
by node 2
Mar 20 17:30:00 AICLSRV02 kernel: CMAN: we are leaving the cluster.
Mar 20 17:30:00 AICLSRV02 kernel: SM: 0000002c sm_stop: SG still joined
Mar 20 17:30:00 AICLSRV02 kernel: SM: 0100002d sm_stop: SG still joined
Mar 20 17:30:00 AICLSRV02 kernel: SM: 0200004f sm_stop: SG still joined
Mar 20 17:30:00 AICLSRV02 kernel: SM: 03000044 sm_stop: SG still joined
Mar 20 17:30:00 AICLSRV02 clurgmgrd[24620]: <warning> #67: Shutting down 
uncleanly
Mar 20 17:30:00 AICLSRV02 ccsd[4847]: Cluster manager shutdown. 
Attemping to reconnect...
Mar 20 17:30:01 AICLSRV02 kernel: t 179
Mar 20 17:30:01 AICLSRV02 kernel: clvmd add node 2
Mar 20 17:30:01 AICLSRV02 kernel: clvmd total nodes 2
Mar 20 17:30:01 AICLSRV02 kernel: clvmd rebuild resource directory
Mar 20 17:30:01 AICLSRV02 kernel: clvmd rebuilt 0 resources
Mar 20 17:30:01 AICLSRV02 kernel: clvmd purge requests
Mar 20 17:30:01 AICLSRV02 kernel: clvmd purged 0 requests
Mar 20 17:30:01 AICLSRV02 kernel: clvmd mark waiting requests
Mar 20 17:30:01 AICLSRV02 kernel: clvmd marked 0 requests
Mar 20 17:30:01 AICLSRV02 kernel: clvmd recover event 179 done
Mar 20 17:30:01 AICLSRV02 kernel: clvmd move flags 0,0,1 ids 176,179,179
Mar 20 17:30:01 AICLSRV02 kernel: clvmd process held requests
Mar 20 17:30:01 AICLSRV02 kernel: clvmd processed 0 requests
Mar 20 17:30:01 AICLSRV02 kernel: clvmd resend marked requests
Mar 20 17:30:01 AICLSRV02 kernel: clvmd resent 0 requests
Mar 20 17:30:01 AICLSRV02 kernel: clvmd recover event 179 finished
Mar 20 17:30:01 AICLSRV02 kernel: Magma move flags 1,0,0 ids 173,173,173
Mar 20 17:30:01 AICLSRV02 kernel: Magma move flags 0,1,0 ids 173,181,173
Mar 20 17:30:01 AICLSRV02 kernel: Magma move use event 181
Mar 20 17:30:01 AICLSRV02 kernel: Magma recover event 181
Mar 20 17:30:01 AICLSRV02 kernel: Magma add node 2
Mar 20 17:30:01 AICLSRV02 kernel: Magma total nodes 2
Mar 20 17:30:01 AICLSRV02 kernel: Magma rebuild resource directory
Mar 20 17:30:01 AICLSRV02 kernel: Magma rebuilt 0 resources
Mar 20 17:30:01 AICLSRV02 kernel: Magma purge requests
Mar 20 17:30:01 AICLSRV02 kernel: Magma purged 0 requests
Mar 20 17:30:01 AICLSRV02 kernel: Magma mark waiting requests
Mar 20 17:30:01 AICLSRV02 kernel: Magma marked 0 requests
Mar 20 17:30:01 AICLSRV02 kernel: Magma recover event 181 done
Mar 20 17:30:01 AICLSRV02 kernel: Magma move flags 0,0,1 ids 173,181,181
Mar 20 17:30:01 AICLSRV02 kernel: Magma process held requests
Mar 20 17:30:01 AICLSRV02 kernel: Magma processed 0 requests
Mar 20 17:30:01 AICLSRV02 kernel: Magma resend marked requests
Mar 20 17:30:01 AICLSRV02 kernel: Magma resent 0 requests
Mar 20 17:30:01 AICLSRV02 kernel: Magma recover event 181 finished
Mar 20 17:30:01 AICLSRV02 kernel: Magma send einval to 2
Mar 20 17:30:01 AICLSRV02 kernel: Magma (24620) req reply einval 14b026a 
fr 2 r 2 usrm::rg="mysql
Mar 20 17:30:01 AICLSRV02 kernel: lk 2,66e728 id 0 -1,3 10001
Mar 20 17:30:01 AICLSRV02 kernel: 18109 qc 2,66e728 -1,3 id 1c00ec sts 0 0
Mar 20 17:30:01 AICLSRV02 kernel: 7078 lk 2,66e725 id 0 -1,3 10001
Mar 20 17:30:01 AICLSRV02 kernel: 18109 qc 2,66e725 -1,3 id 2202ce sts 0 0
Mar 20 17:30:01 AICLSRV02 kernel: 7078 lk 2,66e284 id 0 -1,3 10001
Mar 20 17:30:01 AICLSRV02 kernel: 18109 qc 2,66e284 -1,3 id 1801d9 sts 0 0

[...]

Mar 20 17:30:04 AICLSRV02 kernel: 7544 lk 2,14c8 id 0 -1,3 10000
Mar 20 17:30:04 AICLSRV02 kernel:
Mar 20 17:30:04 AICLSRV02 kernel: lock_dlm:  Assertion failed on line 
411 of file /usr/src/build/573241-i686/BUILD/smp/src/dlm/lock.c
Mar 20 17:30:04 AICLSRV02 kernel: lock_dlm:  assertion:  "!error"
Mar 20 17:30:04 AICLSRV02 kernel: lock_dlm:  time = 1565317077
Mar 20 17:30:04 AICLSRV02 kernel: AICLSRV01: num=2,14c8 err=-22 cur=-1 
req=3 lkf=10000
Mar 20 17:30:04 AICLSRV02 kernel:
Mar 20 17:30:04 AICLSRV02 kernel: ------------[ cut here ]------------
Mar 20 17:30:04 AICLSRV02 kernel: kernel BUG at 
/usr/src/build/573241-i686/BUILD/smp/src/dlm/lock.c:411!
Mar 20 17:30:04 AICLSRV02 kernel: invalid operand: 0000 [#1]
Mar 20 17:30:04 AICLSRV02 kernel: SMP
Mar 20 17:30:04 AICLSRV02 kernel: Modules linked in: lock_dlm(U) gfs(U) 
lock_harness(U) dlm(U) cman(U) parport_pc lp parport autofs4 rfcomm 
l2cap bluetooth sunrpc dm_mod video button battery ac md5 ipv6 uhci_hcd 
ehci_hcd hw_random tg3 floppy ext3 jbd cciss sd_mod scsi_mod
Mar 20 17:30:04 AICLSRV02 kernel: CPU:    1
Mar 20 17:30:04 AICLSRV02 kernel: EIP:    0060:[<f8964bf5>]    Not 
tainted VLI
Mar 20 17:30:04 AICLSRV02 kernel: EFLAGS: 00010296   (2.6.11-1.1369_FC4smp)
Mar 20 17:30:04 AICLSRV02 kernel: EIP is at do_dlm_lock+0x1b7/0x21d 
[lock_dlm]
Mar 20 17:30:04 AICLSRV02 kernel: eax: 00000004   ebx: 00000000   ecx: 
c035ea4c   edx: 00000296
Mar 20 17:30:04 AICLSRV02 kernel: esi: f27f4d80   edi: ffffffea   ebp: 
00000002   esp: f4bb2d88
Mar 20 17:30:04 AICLSRV02 kernel: ds: 007b   es: 007b   ss: 0068
Mar 20 17:30:04 AICLSRV02 kernel: Process lsof (pid: 7544, 
threadinfo=f4bb2000 task=c36cea80)
Mar 20 17:30:04 AICLSRV02 kernel: Stack: f8969e75 f27f4d80 00000002 
000014c8 00000000 ffffffea ffffffff 00000003
Mar 20 17:30:04 AICLSRV02 kernel:        00010000 00000003 00000000 
f5ee4000 00000001 00010000 202000d0 20202020
Mar 20 17:30:04 AICLSRV02 kernel:        20203220 20202020 20202020 
34312020 00183863 f5736200 00000003 00000008
Mar 20 17:30:04 AICLSRV02 kernel: Call Trace:
Mar 20 17:30:04 AICLSRV02 kernel:  [<f8964cff>] lm_dlm_lock+0x52/0x5e 
[lock_dlm]
Mar 20 17:30:04 AICLSRV02 kernel:  [<f8964cad>] lm_dlm_lock+0x0/0x5e 
[lock_dlm]
Mar 20 17:30:04 AICLSRV02 kernel:  [<f8adcffc>] gfs_lm_lock+0x3d/0x5c [gfs]
Mar 20 17:30:04 AICLSRV02 kernel:  [<f8ad2039>] 
gfs_glock_xmote_th+0xae/0x1d3 [gfs]
Mar 20 17:30:04 AICLSRV02 kernel:  [<f8ad163c>] rq_promote+0x126/0x150 [gfs]
Mar 20 17:30:05 AICLSRV02 kernel:  [<f8ad1840>] run_queue+0xee/0x113 [gfs]
Mar 20 17:30:05 AICLSRV02 kernel:  [<f8ad2aea>] gfs_glock_nq+0x93/0x144 
[gfs]
Mar 20 17:30:05 AICLSRV02 kernel:  [<c016f43b>] link_path_walk+0x53/0xe0
Mar 20 17:30:05 AICLSRV02 kernel:  [<f8ad3196>] 
gfs_glock_nq_init+0x18/0x2d [gfs]
Mar 20 17:30:05 AICLSRV02 kernel:  [<f8aea0fe>] gfs_getattr+0x3b/0x5b [gfs]
Mar 20 17:30:05 AICLSRV02 kernel:  [<f8aea0c3>] gfs_getattr+0x0/0x5b [gfs]
Mar 20 17:30:05 AICLSRV02 kernel:  [<c016a42b>] vfs_getattr+0x40/0xa2
Mar 20 17:30:05 AICLSRV02 kernel:  [<c016a4b5>] vfs_stat+0x28/0x3a
Mar 20 17:30:05 AICLSRV02 kernel:  [<c0142c36>] 
audit_syscall_entry+0x132/0x160
Mar 20 17:30:05 AICLSRV02 kernel:  [<c016aa30>] sys_stat64+0xf/0x28
Mar 20 17:30:05 AICLSRV02 kernel:  [<c01086f9>] do_syscall_trace+0xef/0x123
Mar 20 17:30:05 AICLSRV02 kernel:  [<c0104025>] syscall_call+0x7/0xb
Mar 20 17:30:05 AICLSRV02 kernel: Code: 7c 24 14 89 4c 24 0c 89 5c 24 10 
89 6c 24 08 89 74 24 04 c7 04 24 28 a6 96 f8 e8 6e cf 7b c7 c7 04 24 75 
9e 96 f8 e8 62 cf 7b c7 <0f> 0b 9b 01 a0 a4 96 f8 c7 04 24 3c a5 96 f8 
e8 7a c5 7b c7 66
Mar 20 17:30:05 AICLSRV02 kernel:  <0>Fatal exception: panic in 5 seconds



On the other node, I see this log:

Mar 20 17:29:03 AICLSRV01 kernel: CMAN: removing node AICLSRV02 from the 
cluster : Missed too many heartbeats
Mar 20 17:29:03 AICLSRV01 fenced[2092]: AICLSRV02 not a cluster member 
after 0 sec post_fail_delay
Mar 20 17:29:03 AICLSRV01 fenced[2092]: fencing node "AICLSRV02"
Mar 20 17:29:03 AICLSRV01 fence_manual: Node AICLSRV02 needs to be reset 
before recovery can procede.  Waiting for AICLSRV02 to rejoin the 
cluster or for manual acknowledgement that it has been reset (i.e. 
fence_ack_manual -n AICLSRV02)


Any suggestion about what happened? Any way to avoid it? Thanks in advance.


Fabrizio Lippolis




More information about the Linux-cluster mailing list