[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] GFS2 + NFS crash BUG: Unable to handle kernel NULL pointer deference



Hello everyone!

I've set up a cluster in order to use GFS2. The cluster works really well ;)
Then, I've exported the GFS2 filesystem via NFS to share with machines outside the cluster, and in a read fashion it works OK, but as soon as I try to write in it, the filesystem seems to hang:

root file03:~# mount filepro01:/mnt/gfs /mnt/tmp -o soft
root file03:~# ls /mnt/tmp/
algo  caca  caca2  testa
root file03:~# mkdir /mnt/tmp/otracosa

at this point, the NFS stopped working. I can see in the nfs client:

[11132241.127470] nfs: server filepro01 not responding, timed out

however, the directory was indeed created, and the other node can continue using the gfs2 filesystem (locally) On the NFS server (filepro01) looking at the logs I found some nasty things. This first part is mounting the filesystem, which is OK:

[6234925.738508] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[6234925.787305] NFSD: starting 90-second grace period
[6234925.825811] GFS2 (built Feb  7 2011 16:11:33) installed
[6234925.826698] GFS2: fsid=: Trying to join cluster "lock_dlm", "wtn_cluster:file01" [6234925.886991] GFS2: fsid=wtn_cluster:file01.0: Joined cluster. Now mounting FS... [6234925.975113] GFS2: fsid=wtn_cluster:file01.0: jid=0, already locked for use [6234925.975116] GFS2: fsid=wtn_cluster:file01.0: jid=0: Looking at journal... [6234926.075105] GFS2: fsid=wtn_cluster:file01.0: jid=0: Acquiring the transaction lock... [6234926.075152] GFS2: fsid=wtn_cluster:file01.0: jid=0: Replaying journal... [6234926.076200] GFS2: fsid=wtn_cluster:file01.0: jid=0: Replayed 8 of 9 blocks
[6234926.076204] GFS2: fsid=wtn_cluster:file01.0: jid=0: Found 1 revoke tags
[6234926.076649] GFS2: fsid=wtn_cluster:file01.0: jid=0: Journal replayed in 1s
[6234926.076800] GFS2: fsid=wtn_cluster:file01.0: jid=0: Done
[6234926.076945] GFS2: fsid=wtn_cluster:file01.0: jid=1: Trying to acquire journal lock... [6234926.078723] GFS2: fsid=wtn_cluster:file01.0: jid=1: Looking at journal...
[6234926.257645] GFS2: fsid=wtn_cluster:file01.0: jid=1: Done
[6234926.258187] GFS2: fsid=wtn_cluster:file01.0: jid=2: Trying to acquire journal lock... [6234926.260966] GFS2: fsid=wtn_cluster:file01.0: jid=2: Looking at journal...
[6234926.549636] GFS2: fsid=wtn_cluster:file01.0: jid=2: Done
[6234930.789787] ipmi message handler version 39.2

and when we try to write from nfs client, bang:

[6235083.656954] BUG: unable to handle kernel NULL pointer dereference at 00000024
[6235083.656973] IP: [<ee2d6c1e>] gfs2_drevalidate+0xe/0x200 [gfs2]
[6235083.656992] *pdpt = 0000000001831027 *pde = 0000000000000000
[6235083.657003] Oops: 0000 [#1] SMP
[6235083.657012] last sysfs file: /sys/module/dlm/initstate
[6235083.657018] Modules linked in: ipmi_msghandler xenfs gfs2 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dlm configfs nfsd e xportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc drbd lru_cache lp parport [last unloaded: scsi_transport_iscsi]
[6235083.657090]
[6235083.657095] Pid: 1497, comm: nfsd Tainted: G W 2.6.38-2-virtual #29~lucid1-Ubuntu /
[6235083.657103] EIP: 0061:[<ee2d6c1e>] EFLAGS: 00010282 CPU: 0
[6235083.657115] EIP is at gfs2_drevalidate+0xe/0x200 [gfs2]
[6235083.657120] EAX: eb9d7180 EBX: eb9d7180 ECX: ee2ec000 EDX: 00000000
[6235083.657127] ESI: eb924580 EDI: 00000000 EBP: c1dc5c68 ESP: c1dc5c20
[6235083.657133]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
[6235083.657139] Process nfsd (pid: 1497, ti=c1dc4000 task=c1b18ca0 task.ti=c1dc4000)
[6235083.657145] Stack:
[6235083.657150] c1dc5c28 c0627afd c1dc5c68 c0242314 00000000 c1dc5c7c ee2dba0c ee2c02d0 [6235083.657170] 00000001 eb924580 c1a47038 c1dc5cb0 eb9d7188 00000004 14a2fc97 eb9d7180 [6235083.657190] eb924580 00000000 c1dc5c7c c023a18f eb9d7180 eb924580 eb925000 c1dc5ca0
[6235083.657210] Call Trace:
[6235083.657220]  [<c0627afd>] ? _raw_spin_lock+0xd/0x10
[6235083.657230]  [<c0242314>] ? __d_lookup+0xf4/0x150
[6235083.657242]  [<ee2dba0c>] ? gfs2_permission+0xcc/0x120 [gfs2]
[6235083.657253]  [<ee2c02d0>] ? gfs2_check_acl+0x0/0x80 [gfs2]
[6235083.657263]  [<c023a18f>] d_revalidate+0x1f/0x60
[6235083.657271]  [<c023a2e2>] __lookup_hash+0xa2/0x180
[6235083.657284]  [<edd8e266>] ? encode_post_op_attr+0x86/0x90 [nfsd]
[6235083.657292]  [<c023a4c3>] lookup_one_len+0x43/0x80
[6235083.657303]  [<edd8d13f>] compose_entry_fh+0x9f/0xe0 [nfsd]
[6235083.657315]  [<edd8e491>] encode_entryplus_baggage+0x51/0xb0 [nfsd]
[6235083.657327]  [<edd8e795>] encode_entry+0x2a5/0x2f0 [nfsd]
[6235083.657338]  [<edd8e820>] nfs3svc_encode_entry_plus+0x40/0x50 [nfsd]
[6235083.657349]  [<edd8366d>] nfsd_buffered_readdir+0xfd/0x1a0 [nfsd]
[6235083.657361]  [<edd8e7e0>] ? nfs3svc_encode_entry_plus+0x0/0x50 [nfsd]
[6235083.657372]  [<edd852a0>] nfsd_readdir+0x70/0xb0 [nfsd]
[6235083.657383]  [<edd8bd58>] nfsd3_proc_readdirplus+0xd8/0x200 [nfsd]
[6235083.657394]  [<edd8e7e0>] ? nfs3svc_encode_entry_plus+0x0/0x50 [nfsd]
[6235083.657405]  [<edd7f3a3>] nfsd_dispatch+0xd3/0x210 [nfsd]
[6235083.657423]  [<edd0fd83>] svc_process_common+0x2e3/0x590 [sunrpc]
[6235083.657438]  [<edd1c86d>] ? svc_xprt_received+0x2d/0x40 [sunrpc]
[6235083.657452]  [<edd1cd0b>] ? svc_recv+0x48b/0x750 [sunrpc]
[6235083.657465]  [<edd1010c>] svc_process+0xdc/0x140 [sunrpc]
[6235083.657474]  [<c0627010>] ? down_read+0x10/0x20
[6235083.657483]  [<edd7fa54>] nfsd+0xb4/0x140 [nfsd]
[6235083.657493]  [<c0143b9e>] ? complete+0x4e/0x60
[6235083.657503]  [<edd7f9a0>] ? nfsd+0x0/0x140 [nfsd]
[6235083.657513]  [<c0173354>] kthread+0x74/0x80
[6235083.657520]  [<c01732e0>] ? kthread+0x0/0x80
[6235083.657528]  [<c010af3e>] kernel_thread_helper+0x6/0x10
[6235083.657533] Code: 8b 53 08 e8 75 d4 0a d2 f7 d0 89 03 31 c0 5b 5d c3 8d b6 00 00 00 00 8d bf 00 00 00 00 55 89 e5 57 56 53 83 ec 3c 3e 8d 74 26 00 <f6> 42 24 40 89 c3 b8 f6 ff ff ff 74 0d 83 c4 3c 5b 5e 5f 5d c3 [6235083.657652] EIP: [<ee2d6c1e>] gfs2_drevalidate+0xe/0x200 [gfs2] SS:ESP 0069:c1dc5c20
[6235083.865070] CR2: 0000000000000024
[6235083.865077] ---[ end trace 2dfc9195648a185b ]---
[6235099.205542] dlm: connecting to 2


Is this a bug?
Is it known?
Are there any workarounds?

The gfs2+nfs server is a xen client, with ubuntu 10.04 and kernel 2.6.38-2-virtual
# gfs2_tool version
gfs2_tool 3.0.12 (built Jul  5 2011 16:52:20)
Copyright (C) Red Hat, Inc.  2004-2010  All rights reserved.
# cman_tool version
6.2.0 config 2011070805

Here's also the cluster.conf file, just in case ;)
<?xml version="1.0"?>
<cluster name="wtn_cluster" config_version="2011070805">
<quorumd interval="5" tko="6" label="filepro-qdisk" votes="1"/>
<cman expected_votes="3" two_node="0"/>
<totem consensus="72000" token="60000"/>
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="filepro01" votes="1" nodeid="1">
<fence>
<method name="xen">
<device name="xen" nodename="filepro01" U="abcdefghijk" action="reboot"/>
</method>
</fence>
</clusternode>
<clusternode name="filepro02" votes="1" nodeid="2">
<fence>
<method name="xen">
<device name="xen" nodename="filepro02" U="qwertyuiop" action="reboot"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="xen" agent="fence_xen"/>
</fencedevices>
</cluster>

Thanks in advance :)

--
Javi Polo
Administrador de Sistemas
Tel  93 734 97 70
Fax 93 734 97 71
jpolo wtransnet com


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]