[Linux-cluster] Resource group corruption

Thu May 19 18:15:32 UTC 2005

The oops should be fixed in CVS now.  I don't see how that would have
caused the RG problem, though.  Was GFS printing errors to the
console other than the oops?

On Wed, May 18, 2005 at 01:05:18PM +0200, Birger Wathne wrote:
> I have a problem (again)...
> 
> Running FC3 in a cluster with only one node operative (but 2 nodes 
> defined in the config).
> Kernel is 2.6.11-1.14_FC3smp
> 
> I created 4 gfs file systems, and tested NFS and samba setups for a 
> while. Now, I copied some larger amounts of data over to the file 
> systems using rsync. Everything seemed normal until I stared movind 
> directories around using mv and cp. Suddenly I got
> May 18 09:54:27 bacchus kernel: Unable to handle kernel NULL pointer 
> dereference at virtual address 00000004
> May 18 09:54:27 bacchus kernel:  printing eip:
> May 18 09:54:27 bacchus kernel: c018587d
> May 18 09:54:27 bacchus kernel: *pde = 237dc001
> May 18 09:54:27 bacchus kernel: Oops: 0000 [#1]
> May 18 09:54:27 bacchus kernel: SMP
> May 18 09:54:27 bacchus kernel: Modules linked in: lock_dlm(U) gfs(U) 
> lock_harness(U) nfsd exportfs nfs lockd parport_pc lp parport autofs4 
> dlm(U) cman(U) md5 ipv6 sunrpc ipt_LOG iptable_filter ip_tables video 
> button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core 
> snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer 
> snd soundcore snd_page_alloc 3c59x mii tg3 floppy ext3 jbd dm_mod 
> aic7xxx ata_piix libata sd_mod scsi_mod
> May 18 09:54:27 bacchus kernel: CPU:    0
> May 18 09:54:27 bacchus kernel: EIP:    0060:[<c018587d>]    Tainted: 
> GF     VLIMay 18 09:54:27 bacchus kernel: EFLAGS: 00010246   
> (2.6.11-1.14_FC3smp)
> May 18 09:54:27 bacchus kernel: EIP is at posix_acl_valid+0xf/0xac
> May 18 09:54:27 bacchus kernel: eax: 00000000   ebx: 00000000   ecx: 
> 00000008 edx: 00000001
> May 18 09:54:27 bacchus kernel: esi: 00000000   edi: 00000000   ebp: 
> e16a17e8 esp: ccf08d60
> May 18 09:54:27 bacchus kernel: ds: 007b   es: 007b   ss: 0068
> May 18 09:54:27 bacchus kernel: Process cp (pid: 17953, 
> threadinfo=ccf08000 task=d2af5020)
> May 18 09:54:27 bacchus kernel: Stack: 00000000 f911aebc 00000000 
> f90e9035 ccf08e0c f911aebc ccf08e64 f90f1d06
> May 18 09:54:27 bacchus kernel:        00000000 00000000 ccf08db0 
> ccf08e53 e16a17e8 ccf08db0 ffffffff ccf08e0c
> May 18 09:54:27 bacchus kernel:        e16a17e8 ccf08db0 f90f389d 
> ccf08db0 e2d4cf38 e2d4cf38 e2d4cf10 d2af5020
> May 18 09:54:27 bacchus kernel: Call Trace:
> May 18 09:54:27 bacchus kernel:  [<f90e9035>] 
> gfs_acl_validate_set+0x35/0xa2 [gfs]
> May 18 09:54:27 bacchus kernel:  [<f90f1d06>] system_eo_set+0xb6/0xdc [gfs]
> May 18 09:54:27 bacchus kernel:  [<f90f389d>] gfs_ea_set+0xbd/0xc1 [gfs]
> May 18 09:54:27 bacchus kernel:  [<f910eb5f>] gfs_setxattr+0x8a/0x96 [gfs]
> May 18 09:54:27 bacchus kernel:  [<c017a962>] setxattr+0x142/0x183
> May 18 09:54:27 bacchus kernel:  [<c0171cc4>] dput+0x7f/0x1d3
> May 18 09:54:27 bacchus kernel:  [<c016984b>] link_path_walk+0xbb5/0xe27
> May 18 09:54:27 bacchus kernel:  [<c016528a>] cp_new_stat64+0xf0/0x102
> May 18 09:54:27 bacchus kernel:  [<c0169d46>] path_lookup+0x96/0x196
> May 18 09:54:27 bacchus kernel:  [<c0169f98>] __user_walk+0x42/0x52
> May 18 09:54:27 bacchus kernel:  [<c017a9ea>] sys_setxattr+0x47/0x58
> May 18 09:54:27 bacchus kernel:  [<c0103f0f>] syscall_call+0x7/0xb
> May 18 09:54:27 bacchus kernel: Code: c1 e9 02 f3 a5 f6 c3 02 74 02 66 
> a5 f6 c3 01 74 01 a4 c7 00 01 00 00 00 5b 5e 5f c3 57 8d 48 08 31 ff 56 
> 31 f6 ba 01 00 00 00 53 <8b> 40 04 8d 1c c1 39 d9 73 32 0f b7 41 02 a9 
> f8 ff 00 00 75 27
> 
> 
> Rebooting brings up the services again, including mounted file systems, 
> but if I umount them (which involves disabling services and rebooting, 
> as I get file system busy errors when rebooting, then stopping the 
> services) and run gfs_fsck I get this on 3 of the 4 file systems:
> # gfs_fsck /dev/raid5/pakke
> Initializing fsck
> Buffer #4296077767 (5 of 134571750) is neither GFS_METATYPE_RB nor 
> GFS_METATYPE_ RG.
> Resource group is corrupted.
> Unable to read in rgrp descriptor.
> Unable to fill in resource group information.
> 
> How do I try to fix this? Why is one file system unaffected by this? And 
> most importantly: Why did this happen in the first place?
> 
> I am running the RHEL4 branch of the code.
> 
> The file systems were created with 2 journals, but there has never been 
> more than one node connected to the storage.
> 
> -- 
> birger
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Ken Preslan <kpreslan at redhat.com>