[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] kernel oops on mount and sendmsg failed: -22



I have a two node cluster, one node (node A) runs linux kernel 2.6.11.12 while the other (node B) runs 2.6.18. both are running cman_tool version 5.0.1. I first start up node A, then node B joins. node A can mount the GFS file systems, but when node B tries that, it gets a kernel oops, which is pasted at the end of the email (see "KERNEL OOPS output"). So I reboot node B and try to rejoin, but it seems to not be able to communicate with node A correctly, as if the cluster is in some stale state (see "node B rejoin kernel messages"). Upon viewing node A, it seemed to have received the join message, but it looks like it didn't send an ack or something, and then node A simply quits...(see "node A kernel messages").

I think the problem lies in my use of two different cluster software versions (even though --version doesn't say so), but the newest -rSTABLE doesn't compile with 2.6.11.12 anymore. What is the recommended solution for a cluster that must run different kernel versions?

tia,
dan

---

<KERNEL OOPS output>

BUG: unable to handle kernel NULL pointer dereference at virtual
address 0000001c
printing eip:
c01825e6
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2xxx
firmware_class scsi_transport_fc ppdev parport_pc lp parport sg sd_mod
scsi_mod ide_generic ide_cd cdrom evdev i2c_piix4 psmouse i2c_core
serio_raw sworks_agp agpgart rtc pcspkr ext3 jbd mbcache dm_mirror
dm_snapshot dm_mod ide_disk serverworks generic ohci_hcd ide_core
usbcore tg3 thermal processor fan unix
CPU:    2
EIP:    0060:[<c01825e6>]    Tainted: GF     VLI
EFLAGS: 00010293   (2.6.18 #1)
EIP is at do_add_mount+0x66/0x130
eax: 0000000c   ebx: f3843f24   ecx: c24fbac0   edx: f443f550
esi: df907200   edi: 00000000   ebp: 00000000   esp: f3843df4
ds: 007b   es: 007b   ss: 0068
Process mount (pid: 14922, ti=f3842000 task=f443f550 task.ti=f3842000)
Stack: c0394388 00000000 00000000 f49a1000 f3843f24 00000000 c018321d df907200 f3843f24 00000000 00000000 f49a1000 df907200 c033a5c0 fffffffe 00000000 c0175080 c24fbac0 f3843ef8 00000050 f4998000 dfb98c40 c24fbac0 df98330c
Call Trace:
[<c018321d>] do_mount+0x33d/0x760
[<c0175080>] link_path_walk+0x80/0x100
[<c01507e3>] __handle_mm_fault+0x233/0x980
[<c0150a86>] __handle_mm_fault+0x4d6/0x980
[<c0147cdf>] __alloc_pages+0x4f/0x2f0
[<c0147fad>] __get_free_pages+0x2d/0x40
[<c0181ed7>] copy_mount_options+0x47/0x130
[<c01836dd>] sys_mount+0x9d/0xe0
[<c01031fb>] syscall_call+0x7/0xb
Code: e4 89 e0 8b 4b 04 25 00 e0 ff ff 8b 10 8b 41 64 3b 82 58 04 00
00 0f 85 a1 00 00 00 8b 41 14 3b 46 14 0f 84 ac 00 00 00 8b 46 10 <8b>
40 10 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 55 8b 44
EIP: [<c01825e6>] do_add_mount+0x66/0x130 SS:ESP 0068:f3843df4

<node B rejoin kernel messages>
CMAN: Waiting to join or form a Linux-cluster
CMAN: sending membership request (message repeated 30 times)
CMAN: Been in JOINWAIT for too long - giving up
CMAN: sendmsg failed: -22

<node A kernel messages>
CMAN: node blade14 rejoining
CMAN: too many transition restarts - will die
CMAN: we are leaving the cluster. Inconsistent cluster view


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]