[Linux-cluster] GFS1 filesystem consistency error

Thu Aug 6 21:02:46 UTC 2009

Hi all,

While I'm waiting for a gfs_fsck to complete, I thought I'd send this  
in the list's direction and ask if anyone else had any thoughts about  
it.

Aug  6 04:51:13 s12n01 kernel: 493 [RAIDarray.mpp]FastT1:1:0:3 Cmnd  
failed-retry the same path. vcmnd SN 18724261484 pdev H3:C0:T1:L3  
0x00/0x00/0x00 0x00020000 mpp_status:2
Aug  6 04:51:13 s12n01 kernel: 493 [RAIDarray.mpp]FastT1:1:0:3 Cmnd  
failed-retry the same path. vcmnd SN 18724261488 pdev H3:C0:T1:L3  
0x00/0x00/0x00 0x00020000 mpp_status:2
Aug  6 04:51:13 s12n01 kernel: 493 [RAIDarray.mpp]FastT1:1:0:3 Cmnd  
failed-retry the same path. vcmnd SN 18724261490 pdev H3:C0:T1:L3  
0x00/0x00/0x00 0x00020000 mpp_status:2
[...]
Aug  6 05:01:14 s12n01 kernel: GFS: fsid=s12:scratch13.2: fatal:  
filesystem consistency error
Aug  6 05:01:14 s12n01 kernel: GFS: fsid=s12:scratch13.2:   inode =  
4918461516/4918461516
Aug  6 05:01:14 s12n01 kernel: GFS: fsid=s12:scratch13.2:   function =  
dinode_dealloc
Aug  6 05:01:14 s12n01 kernel: GFS: fsid=s12:scratch13.2:   file = / 
builddir/build/BUILD/gfs-kernel-2.6.9-75/smp/src/gfs/inode.c, line = 529
Aug  6 05:01:14 s12n01 kernel: GFS: fsid=s12:scratch13.2:   time =  
1249549274
Aug  6 05:01:14 s12n01 kernel: GFS: fsid=s12:scratch13.2: about to  
withdraw from the cluster
Aug  6 05:01:14 s12n01 kernel: GFS: fsid=s12:scratch13.2: waiting for  
outstanding I/O
Aug  6 05:01:14 s12n01 kernel: GFS: fsid=s12:scratch13.2: telling LM  
to withdraw
Aug  6 05:01:15 s12n03 kernel: GFS: fsid=s12:scratch13.0: jid=2:  
Trying to acquire journal lock...
Aug  6 05:01:15 s12n03 kernel: GFS: fsid=s12:scratch13.0: jid=2:  
Looking at journal...
Aug  6 05:01:15 s12n02 kernel: GFS: fsid=s12:scratch13.1: jid=2:  
Trying to acquire journal lock...
Aug  6 05:01:15 s12n02 kernel: GFS: fsid=s12:scratch13.1: jid=2: Busy
Aug  6 05:01:15 s12n03 kernel: GFS: fsid=s12:scratch13.0: jid=2:  
Acquiring the transaction lock...
Aug  6 05:01:15 s12n03 kernel: GFS: fsid=s12:scratch13.0: jid=2:  
Replaying journal...
Aug  6 05:01:21 s12n03 kernel: GFS: fsid=s12:scratch13.0: jid=2:  
Replayed 10050 of 11671 blocks
Aug  6 05:01:21 s12n03 kernel: GFS: fsid=s12:scratch13.0: jid=2:  
replays = 10050, skips = 472, sames = 1149
Aug  6 05:01:21 s12n03 kernel: GFS: fsid=s12:scratch13.0: jid=2:  
Journal replayed in 7s
Aug  6 05:01:21 s12n03 kernel: GFS: fsid=s12:scratch13.0: jid=2: Done
Aug  6 05:01:21 s12n01 kernel: lock_dlm: withdraw abandoned memory
Aug  6 05:01:21 s12n01 kernel: GFS: fsid=s12:scratch13.2: withdrawn
Aug  6 05:01:21 s12n01 kernel:   mh_magic = 0x01161970
Aug  6 05:01:22 s12n01 kernel:   mh_type = 4
Aug  6 05:01:22 s12n01 kernel:   mh_generation = 133
Aug  6 05:01:22 s12n01 kernel:   mh_format = 400
Aug  6 05:01:22 s12n01 kernel:   mh_incarn = 0
Aug  6 05:01:22 s12n01 kernel:   no_formal_ino = 4918461516
Aug  6 05:01:22 s12n01 kernel:   no_addr = 4918461516
Aug  6 05:01:22 s12n01 kernel:   di_mode = 0664
Aug  6 05:01:22 s12n01 kernel:   di_uid = 690
Aug  6 05:01:22 s12n01 kernel:   di_gid = 2017
Aug  6 05:01:22 s12n01 kernel:   di_nlink = 0
Aug  6 05:01:22 s12n01 kernel:   di_size = 0
Aug  6 05:01:22 s12n01 kernel:   di_blocks = 119
Aug  6 05:01:22 s12n01 kernel:   di_atime = 1248334920
Aug  6 05:01:22 s12n01 kernel:   di_mtime = 1249549274
Aug  6 05:01:22 s12n01 kernel:   di_ctime = 1249549274
Aug  6 05:01:22 s12n01 kernel:   di_major = 0
Aug  6 05:01:22 s12n01 kernel:   di_minor = 0
Aug  6 05:01:22 s12n01 kernel:   di_rgrp = 4918433973
Aug  6 05:01:22 s12n01 kernel:   di_goal_rgrp = 4918433973
Aug  6 05:01:22 s12n01 kernel:   di_goal_dblk = 27528
Aug  6 05:01:22 s12n01 kernel:   di_goal_mblk = 27528
Aug  6 05:01:22 s12n01 kernel:   di_flags = 0x00000000
Aug  6 05:01:22 s12n01 kernel:   di_payload_format = 0
Aug  6 05:01:22 s12n01 kernel:   di_type = 1
Aug  6 05:01:22 s12n01 kernel:   di_height = 0
Aug  6 05:01:22 s12n01 kernel:   di_incarn = 0
Aug  6 05:01:22 s12n01 kernel:   di_pad = 0
Aug  6 05:01:22 s12n01 kernel:   di_depth = 0
Aug  6 05:01:22 s12n01 kernel:   di_entries = 0
Aug  6 05:01:22 s12n01 kernel:   no_formal_ino = 0
Aug  6 05:01:22 s12n01 kernel:   no_addr = 0
Aug  6 05:01:22 s12n01 kernel:   di_eattr = 0
Aug  6 05:01:22 s12n01 kernel:   di_reserved =
Aug  6 05:01:22 s12n01 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00
Aug  6 05:01:22 s12n01 last message repeated 2 times
Aug  6 05:01:22 s12n01 kernel: 00 00 00 00 00 00 00 00
Aug  6 05:01:35 s12n01 clurgmgrd: [6943]: <err> clusterfs:gfs- 
scratch13: Mount point is not accessible!
Aug  6 05:01:35 s12n01 clurgmgrd[6943]: <notice> status on  
clusterfs:gfs-scratch13 returned 1 (generic error)
Aug  6 05:01:35 s12n01 clurgmgrd[6943]: <notice> Stopping service  
scratch13
Aug  6 05:01:35 s12n01 clurgmgrd: [6943]: <info> Removing IPv4 address  
10.14.12.5 from bond0
Aug  6 05:01:45 s12n01 clurgmgrd: [6943]: <err> /scratch13 is not a  
directory
Aug  6 05:01:45 s12n01 clurgmgrd[6943]: <notice> stop on nfsclient:nfs- 
scratch13 returned 2 (invalid argument(s))
Aug  6 05:01:45 s12n01 clurgmgrd[6943]: <crit> #12: RG scratch13  
failed to stop; intervention required
Aug  6 05:01:45 s12n01 clurgmgrd[6943]: <notice> Service scratch13 is  
failed

I don't think that the FastT messages I've included caused the  
problem, since I've seen them at times without the file system  
crashing.  It's not great to be getting those sorts of messages, but  
the file system didn't crash for another 10 minutes after that.  I  
can't rule it out, though.

A bit later, I tried to disable the scratch13 service so I could work  
on its associated file system.  The "clusvcadm -d" failed, as below,  
but the service was still disabled.  Any thoughts?

Aug  6 05:50:10 s12n01 clurgmgrd[6943]: <notice> Stopping service  
scratch13
Aug  6 05:50:10 s12n01 clurgmgrd: [6943]: <err> /scratch13 is not a  
directory
Aug  6 05:50:10 s12n01 clurgmgrd[6943]: <notice> stop on nfsclient:nfs- 
scratch13 returned 2 (invalid argument(s))
Aug  6 05:50:10 s12n01 clurgmgrd[6943]: <alert> Marking scratch13 as  
'disabled', but some resources may still be allocated!
Aug  6 05:50:10 s12n01 clurgmgrd[6943]: <notice> Service scratch13 is  
disabled

I've made it through pass1, 1b, 1c, and 2 in gfs_fsck, and I believe  
I'm in pass3 right now.  I'm getting pages and pages of the following  
messages for the last few hours... but then, I always have, whenever  
I've needed to run gfs_fsck.  I'm not too worried since I've seen this  
before, but I'd still like to understand it better.  Could someone  
enlighten me as to their meaning, and if I should be more concerned?

Converting 366 unused metadata blocks to free data blocks...
Converting 192 unused metadata blocks to free data blocks...
Converting 88 unused metadata blocks to free data blocks...
Converting 87 unused metadata blocks to free data blocks...
Converting 681 unused metadata blocks to free data blocks...
Converting 339 unused metadata blocks to free data blocks...
Converting 256 unused metadata blocks to free data blocks...
Converting 441 unused metadata blocks to free data blocks...
Converting 375 unused metadata blocks to free data blocks...
Converting 315 unused metadata blocks to free data blocks...
Converting 173 unused metadata blocks to free data blocks...
Converting 118 unused metadata blocks to free data blocks...
Converting 69 unused metadata blocks to free data blocks...
Converting 396 unused metadata blocks to free data blocks...
Converting 331 unused metadata blocks to free data blocks...
Converting 397 unused metadata blocks to free data blocks...
Converting 275 unused metadata blocks to free data blocks...
Converting 439 unused metadata blocks to free data blocks...

Anyone have any insight they could share with me?  It's been since  
December that I last had major problems with a GFS file system, but  
it's already happened three times this week.  This storage cluster is  
running CentOS 4.6 and GFS1.

Thanks,

James