[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] lock_dlm - unable to handle kernel NULL pointer dereference



Hi,

We have a cluster of two rh 2.6.7 smp machines using gfs and we exprerience
random stability issues.
Every 2 days or so, a lock_dlm error message is dumped to the log (see
below), and either both machines are unable to access the gfs file system
(hanging on ls, df, ...), or a random process that was accessing a file is
hanging on one of the machine (always a different process, can be tar, gzip,
mv, ...) and cannot be terminated.
At this point the only thing we can do is reboot both nodes.

We haven't found a way to reproduce this problem, it seems to happen
randomly.

We have done the following to eliminate the problem (without success nor
improvement):

- Shutdown machine A and run all services on machine B
- Shutdown machine B and run all services on machine A
- Disable heavy I/O on both machines (mainly full daily backups)

The error message is the following:

------

Sep 13 15:05:43 L1_OAS56_B kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000005 Sep 13 15:05:43 L1_OAS56_B kernel:
printing eip:
Sep 13 15:05:43 L1_OAS56_B kernel: c013a1f6 Sep 13 15:05:43 L1_OAS56_B
kernel: *pde = 17aea001 Sep 13 15:05:43 L1_OAS56_B kernel: Oops: 0002 [#1]
Sep 13 15:05:43 L1_OAS56_B kernel: SMP Sep 13 15:05:43 L1_OAS56_B kernel:
Modules linked in: nfsd exportfs ipv6 autofs e1000 af_packet parport_pc
parport ohci_hcd ehci_hcd lock_dlm dlm cman gfs lock_harness dm_mod floppy
uhci_hcd usbcore thermal processor fan button battery asus_acpi ac ext3 jbd
loop ide_cd cdrom qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
i2o_block i2o_core
Sep 13 15:05:43 L1_OAS56_B kernel: CPU: 2
Sep 13 15:05:43 L1_OAS56_B kernel: EIP: 0060:[<c013a1f6>] Not tainted
Sep 13 15:05:43 L1_OAS56_B kernel: EFLAGS: 00010083 (2.6.7)
Sep 13 15:05:43 L1_OAS56_B kernel: EIP is at find_get_pages+0x41/0x5a
Sep 13 15:05:43 L1_OAS56_B kernel: eax: 00000001 ebx: d6d2de4c ecx: 00000010
edx: 00000004
Sep 13 15:05:43 L1_OAS56_B kernel: esi: f274a724 edi: e00f2240 ebp: d6d2ddfc
esp: d6d2dde4
Sep 13 15:05:43 L1_OAS56_B kernel: ds: 007b es: 007b ss: 0068
Sep 13 15:05:43 L1_OAS56_B kernel: Process lock_dlm (pid: 1575,
threadinfo=d6d2c000 task=f7b945c0) Sep 13 15:05:43 L1_OAS56_B kernel: Stack:
f274a728 d6d2de4c 00000000 00000010 d6d2de44 f274a724 d6d2de18 c01441ed
Sep 13 15:05:43 L1_OAS56_B kernel: f274a724 00000000 00000010 d6d2de4c
00000000 d6d2dea0 c01444d0 d6d2de44
Sep 13 15:05:43 L1_OAS56_B kernel: f274a724 00000000 00000010 c3207870
00000000 d6d2c000 00000000 00000000
Sep 13 15:05:43 L1_OAS56_B kernel: Call Trace:
Sep 13 15:05:43 L1_OAS56_B kernel: [<c0106c6b>] show_stack+0x80/0x96 Sep 13
15:05:43 L1_OAS56_B kernel: [<c0106e02>] show_registers+0x15f/0x1ae Sep 13
15:05:43 L1_OAS56_B kernel: [<c0106f77>] die+0x8d/0xfb Sep 13 15:05:43
L1_OAS56_B kernel: [<c0117e86>] do_page_fault+0x270/0x579 Sep 13 15:05:43
L1_OAS56_B kernel: [<c0106911>] error_code+0x2d/0x38 Sep 13 15:05:43
L1_OAS56_B kernel: [<c01441ed>] pagevec_lookup+0x2c/0x35 Sep 13 15:05:43
L1_OAS56_B kernel: [<c01444d0>] truncate_inode_pages+0x71/0x29f Sep 13
15:05:43 L1_OAS56_B kernel: [<fa9bdc40>] gfs_inval_buf+0x45/0x88 [gfs] Sep
13 15:05:43 L1_OAS56_B kernel: [<fa9cd06b>] inode_go_inval+0x45/0x4f [gfs]
Sep 13 15:05:43 L1_OAS56_B kernel: [<fa9c9ec3>] drop_bh+0x15f/0x1d6 [gfs]
Sep 13 15:05:43 L1_OAS56_B kernel: [<fa9cb4bd>] gfs_glock_cb+0x167/0x1f4
[gfs] Sep 13 15:05:43 L1_OAS56_B kernel: [<fa928ace>]
process_complete+0x103/0x34c [lock_dlm] Sep 13 15:05:43 L1_OAS56_B kernel:
[<fa928ee2>] dlm_async+0x1cb/0x290 [lock_dlm] Sep 13 15:05:43 L1_OAS56_B
kernel: [<c0104291>] kernel_thread_helper+0x5/0xb Sep 13 15:05:43 L1_OAS56_B
kernel:
Sep 13 15:05:43 L1_OAS56_B kernel: Code: f0 ff 40 04 83 c2 01 39 ca 72 f2 c6
46 10 01 fb 83 c4 10 5b

------

Any idea of what's wrong or what we should we check next?
Is it possible to "unlock" the machines after such an error without reboot?

The release version is DEVEL.1090589850.

Thanks for your help,

Stéphane Messerli

stephane messerli urbanet ch
Senior Support & Project Engineer, Technology Europe
24/7 Real Media (NASDAQ: TFSM)
Route de la Pierre
1024 Ecublens
Switzerland
tel. +41 21 695 97 46
fax +41 21 695 97 01




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]