[Linux-cluster] Oops


I wasn't sure whether to send this to LKML or here, but DLM seems
involved.  Please let me know if I'd better repost it to somewhere

It's a vanilla 2.6.21 kernel patched by cluster-2.00.00 (with the
three extra export for GFS1).  Config attached.  The machine froze
during the morning updatedb cronjob, which performed a recursive find
into the shared GFS filesystem.  Two other nodes doing the same at the
same time are still up.

I experienced a similar hang with cluster-1 not long ago, though that
didn't lock up the whole machine, but the cluster software only.

Please ask back if I didn't provide all information necessary.
clvm: 2.02.26
libdevmapper: 1.02.19
openais: 0.80.2
otherwise stock Debian Etch system.

kernel BUG at kernel/workqueue.c:212!
invalid opcode: 0000 [#1]
Modules linked in: button ac battery ipv6 gfs lock_nolock lock_dlm gfs2 dlm configfs loop evdev i2c_piix4 pcspkr psmouse rtc serio_raw sworks_agp agpgart i2c_core xfs dm_mirror dm_snapshot ide_generic dm_round_robin dm_emc dm_multipath dm_mod sd_mod ide_disk ata_generic libata serverworks ohci_hcd generic qla2xxx firmware_class scsi_transport_fc scsi_mod usbcore tg3 ide_core thermal processor fan
CPU:    2
EIP:    0060:[<c012f476>]    Not tainted VLI
EFLAGS: 00010213   (2.6.21gfs-xeon #2)
EIP is at queue_work+0x2f/0x49
eax: dfb176e4   ebx: 00000002   ecx: f7e66a80   edx: dfb176e0
esi: 00000002   edi: e2bfa080   ebp: 00000000   esp: f7a91bb4
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Process dlm_recv/2 (pid: 10261, ti=f7a90000 task=c196aa50 task.ti=f7a90000)
Stack: f798d434 f7c5a980 c026dc79 ab0ee1c1 e2bfa080 dfaea000 f798d434 00200000 
       00000020 00000000 c1b6bd80 0101e520 e2bfa080 e2bfa080 c0272f90 000000d0 
       0000000e f7c5a980 00000000 00000039 00000000 00000000 00000000 00000286 
Call Trace:
 [<c026dc79>] tcp_rcv_established+0x53a/0x7d1
 [<c0272f90>] tcp_v4_do_rcv+0x28/0x2c5
 [<c0275306>] tcp_v4_rcv+0x81b/0x88d
 [<c02957a8>] packet_rcv_spkt+0x0/0x150
 [<c024035d>] dev_hard_start_xmit+0x1be/0x21d
 [<c025ccef>] ip_local_deliver+0x187/0x230
 [<c025cb2f>] ip_rcv+0x409/0x442
 [<c02958ed>] packet_rcv_spkt+0x145/0x150
 [<c011b434>] __wake_up+0x32/0x43
 [<c023ff15>] netif_receive_skb+0x2dc/0x350
 [<f8879cfa>] tg3_poll+0x5b6/0x82f [tg3]
 [<c0241a00>] net_rx_action+0x9d/0x1a8
 [<c012608e>] __do_softirq+0x66/0xcc
 [<c0126137>] do_softirq+0x43/0x51
 [<c010648f>] do_IRQ+0x5c/0x71
 [<c010474b>] common_interrupt+0x23/0x28
 [<c0134e03>] down_read_trylock+0x10/0x1d
 [<f8c9d90a>] dlm_receive_message+0xa2/0xc0b [dlm]
 [<c023870d>] sock_common_recvmsg+0x3e/0x54
 [<c02371ff>] sock_recvmsg+0xec/0x107
 [<f8c9fe36>] dlm_process_incoming_buffer+0x11a/0x18c [dlm]
 [<f8ca3e4c>] receive_from_sock+0x124/0x217 [dlm]
 [<c010648f>] do_IRQ+0x5c/0x71
 [<f8ca3b4e>] process_recv_sockets+0xf/0x15 [dlm]
 [<c012f559>] run_workqueue+0x85/0x125
 [<f8ca3b3f>] process_recv_sockets+0x0/0x15 [dlm]
 [<c012fde7>] worker_thread+0xf9/0x124
 [<c011d23f>] default_wake_function+0x0/0xc
 [<c012fcee>] worker_thread+0x0/0x124
 [<c013248a>] kthread+0xb2/0xdc
 [<c01323d8>] kthread+0x0/0xdc
 [<c0104993>] kernel_thread_helper+0x7/0x10
Code: 64 8b 35 04 00 00 00 f0 0f ba 2a 00 19 c0 31 db 85 c0 75 2c 8d 41 08 39 41 08 8b 1d f4 94 39 c0 0f 45 de 8d 42 04 39 42 04 74 04 <0f> 0b eb fe 8b 01 f7 d0 8b 04 98 e8 34 ff ff ff bb 01 00 00 00 
EIP: [<c012f476>] queue_work+0x2f/0x49 SS:ESP 0068:f7a91bb4
Kernel panic - not syncing: Fatal exception in interrupt

Attachment: config.gz
Description: Binary data

