[Linux-cluster] GFS2 + NFS crash BUG: Unable to handle kernel NULL pointer deference

Fri Jul 8 16:22:18 UTC 2011

Hi,

On Fri, 2011-07-08 at 17:41 +0200, Javi Polo wrote:
> Hello everyone!
> 
> I've set up a cluster in order to use GFS2. The cluster works really well ;)
> Then, I've exported the GFS2 filesystem via NFS to share with machines 
> outside the cluster, and in a read fashion it works OK, but as soon as I 
> try to write in it, the filesystem seems to hang:
> 
> root at file03:~# mount filepro01:/mnt/gfs /mnt/tmp -o soft
> root at file03:~# ls /mnt/tmp/
> algo  caca  caca2  testa
> root at file03:~# mkdir /mnt/tmp/otracosa
> 
> at this point, the NFS stopped working. I can see in the nfs client:
> 
> [11132241.127470] nfs: server filepro01 not responding, timed out
> 
> however, the directory was indeed created, and the other node can 
> continue using the gfs2 filesystem (locally)
> On the NFS server (filepro01) looking at the logs I found some nasty 
> things. This first part is mounting the filesystem, which is OK:
> 
Currently we don't recommend using NFS on a GFS2 filesystem which is
also being used locally. That will hopefully change in the future,
however, in the mean time I'd suggest using the localflocks mount option
on all the mounts (and be aware the fcntl/flock locking is then node
local) to avoid problems that you are otherwise likely to hit during
recovery. Also...

> [6234925.738508] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state 
> recovery directory
> [6234925.787305] NFSD: starting 90-second grace period
> [6234925.825811] GFS2 (built Feb  7 2011 16:11:33) installed
> [6234925.826698] GFS2: fsid=: Trying to join cluster "lock_dlm", 
> "wtn_cluster:file01"
> [6234925.886991] GFS2: fsid=wtn_cluster:file01.0: Joined cluster. Now 
> mounting FS...
> [6234925.975113] GFS2: fsid=wtn_cluster:file01.0: jid=0, already locked 
> for use
> [6234925.975116] GFS2: fsid=wtn_cluster:file01.0: jid=0: Looking at 
> journal...
> [6234926.075105] GFS2: fsid=wtn_cluster:file01.0: jid=0: Acquiring the 
> transaction lock...
> [6234926.075152] GFS2: fsid=wtn_cluster:file01.0: jid=0: Replaying 
> journal...
> [6234926.076200] GFS2: fsid=wtn_cluster:file01.0: jid=0: Replayed 8 of 9 
> blocks
> [6234926.076204] GFS2: fsid=wtn_cluster:file01.0: jid=0: Found 1 revoke tags
> [6234926.076649] GFS2: fsid=wtn_cluster:file01.0: jid=0: Journal 
> replayed in 1s
> [6234926.076800] GFS2: fsid=wtn_cluster:file01.0: jid=0: Done
> [6234926.076945] GFS2: fsid=wtn_cluster:file01.0: jid=1: Trying to 
> acquire journal lock...
> [6234926.078723] GFS2: fsid=wtn_cluster:file01.0: jid=1: Looking at 
> journal...
> [6234926.257645] GFS2: fsid=wtn_cluster:file01.0: jid=1: Done
> [6234926.258187] GFS2: fsid=wtn_cluster:file01.0: jid=2: Trying to 
> acquire journal lock...
> [6234926.260966] GFS2: fsid=wtn_cluster:file01.0: jid=2: Looking at 
> journal...
> [6234926.549636] GFS2: fsid=wtn_cluster:file01.0: jid=2: Done
> [6234930.789787] ipmi message handler version 39.2
> 
That all looks ok, but...

> and when we try to write from nfs client, bang:
> 
> [6235083.656954] BUG: unable to handle kernel NULL pointer dereference 
> at 00000024
> [6235083.656973] IP: [<ee2d6c1e>] gfs2_drevalidate+0xe/0x200 [gfs2]
> [6235083.656992] *pdpt = 0000000001831027 *pde = 0000000000000000
> [6235083.657003] Oops: 0000 [#1] SMP
> [6235083.657012] last sysfs file: /sys/module/dlm/initstate
> [6235083.657018] Modules linked in: ipmi_msghandler xenfs gfs2 ib_iser 
> rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp 
> libiscsi scsi_transport_iscsi dlm configfs nfsd e
> xportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc drbd lru_cache lp 
> parport [last unloaded: scsi_transport_iscsi]
> [6235083.657090]
> [6235083.657095] Pid: 1497, comm: nfsd Tainted: G        W   
> 2.6.38-2-virtual #29~lucid1-Ubuntu /
> [6235083.657103] EIP: 0061:[<ee2d6c1e>] EFLAGS: 00010282 CPU: 0
> [6235083.657115] EIP is at gfs2_drevalidate+0xe/0x200 [gfs2]

this should not happen. It looks like we are trying to look up something
that is 24 (hex) bytes into a structure. Does the fs have posix acls
enabled or selinux or something else using xattrs?

Steve.