[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] falure during gfs2_grow caused node crash & data loss

I just had a serious problem with gfs2_grow which caused a loss of data and a 
cluster node reboot.

I was attempting to grow a gfs2 volume from 50GB => 145GB. The volume was 
mounted on both cluster nodes at the start of running "gfs2_grow". When I 
umounted the volume from _one_ node (not where gfs2_grow was running), the 
macine running gfs2_grow rebooted and the filesystem is damaged.

The sequence of commands was as follows. Each command was successful until the 

	gfs_fsck /dev/mapper/legacy
	gfs2_convert /dev/mapper/legacy
	fsck.gfs2 /dev/mapper/legacy
	mount /dev/mapper/legacy /shared/legacy
	lvextend vg /dev/mapper/pv_vol10
	mount /dev/mapper/legacy /shared/legacy

	gfs2_grow /shared/legacy

	umount /shared/legacy

Immediately after umounting the volume from node1, node2 hung then rebooted
(probably fenced by node1)

	upon reboot, /dev/mapper/legacy seemed to be mounted (it was
		listed in /etc/fstab), but showed the previous size
		via "df". I did not examine the contents of /shared/legacy,
		but umounted it and attempted to fsck.gfs2 the volume.

Running fsck.gfs2 reports:
		Initializing fsck
		Recovering journals (this may take a while)...
		Validating Resource Group index.
		Level 1 RG check.
		(level 1 failed)
		Level 2 RG check.
		L2: number of rgs in the index = 85.
		WARNING: rindex file is corrupt.
		(level 2 failed)
		Level 3 RG check.

		  RG 1 at block 0x11 intact [length 0x3b333]
		  RG 2 at block 0x3B344 intact [length 0x3b32f]
		  RG 3 at block 0x76673 intact [length 0x3b32f]
		  RG 4 at block 0xB19A2 intact [length 0x3b32f]
		  RG 5 at block 0xECCD1 intact [length 0x3b32f]
		  RG 6 at block 0x128000 intact [length 0x3b32f]
		* RG 7 at block 0x16332F *** DAMAGED *** [length 0x3b32f]
		* RG 8 at block 0x19E65E *** DAMAGED *** [length 0x3b32f]
		* RG 9 at block 0x1D998D *** DAMAGED *** [length 0x3b32f]
		* RG 10 at block 0x214CBC *** DAMAGED *** [length 0x3b32f]
		Error: too many bad RGs.
		Error rebuilding rg list.
		(level 3 failed)
		RG recovery impossible; I can't fix this file system.

Attempting to fsck the filesystem with the "experimental" version
announced by Bob Peterson on Monday produces a segfault as soon as it
gets to the Level 3 RG check.

Running "gfs2_edit savemeta" seemed to work...at least it didn't exit with an 
error, and it produced about 540MB of file of type "GLS_BINARY_MSB_FIRST".

	CentOS 5.4 (2.6.18-164.11.1.el5)

Any suggestions or hope of data recovery before I reformat the volume?



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]