[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[linux-lvm] LVM Recovery Help



Hi, I need some urgent help with recovering a system from disk/MD/LVM failure.

Background
=========
The system had 6 disks, 2x750GB, 2x1TB, 2x1.5TB. The 750GB were the original drives and partitioned as
/boot (MD0), /tmp, /VolumeGroup
/boot(MD0), /swap, /VolumeGroup

The VolGroup01 is PV on  /dev/md1 an MD Raid 1 array. Logical groups for /root (LV_Root), /home (LV_Home), /var (LV_Var) and /home/data (LV_Data) were created.

The 1TB and 1.5TB were added subsequently as MD Raid 1 PV to the same VolumeGroup to expand only LV_Data.

Note: The 1TB were added to the system with a reformat/re-install because I screwed up the manual hot add so ananconda did most of it. But with the 1.5TB, having learnt more, was manually added and I made the mistake of somehow creating it on /dev/md  without any node number.

Situation
======
For 4 months everything was fine and dandy until something hit the server (which I now guess is the 1.5TB Seagate finally deciding to do the fall out of raid array thing people were talking about) causing smb to freeze and a high WA% (0.5~0.6) to be experienced. At this point both smartd and proc/mdstat indicates no disk problem though.

Rebooting the machine caused it to freeze on boot until one of the 1TB was removed, upon which LVM choke because it cannot find one of the physical devices in the Volgroup. This struck me as odd since all PV are raid 1 so it shouldn't matter if I took out one drive.

Replacing the drive with a new and cloned replacement did not solve this problem.

Further investigation indicates the missing physical device UUID pointed to the PV using /dev/md aka the 1.5TB Seagates. So thinking that maybe the array totally crapped out with both drives dropping out, I tried to recreate the md device.

Attempts so far
===========
Restoring the MD arrays
-----------------------------------

Following instructions on line, using the Centos 5.1 DVD linux rescue mode, I used mdadm --examine --scan to determine the MD members and recreate a /etc/mdadm.conf. mdadm -A -s correctly loads up the MD devices each with only 1 device for now to keep the other disk safe from my meddling.

Note: prior to finding the correct instructions, I had changed the MD uuid of two of the arrays by mistakenly trying to recreate the md devices manually using mdadm --create. However, this appears to be fine as --create did not initialize the drives, the data are still intact as later events prove. I'm only concerned that the changed uuid might be a contributing factor later to the LVM situation.

Following that, I followed other instructions to determine the original/latest LVM definitions by extracting the meta data using dd to dump the first 255 bytes. This was when I noted the problem of one of the PV using /dev/md since I did it correctly this time round and the 1.5TB is using /dev/md3 instead.

I tried to change the md config, manually creating the /dev/md node with mknod and editing mdadm.conf to use that for the 1.5TB. That worked and I could activate the volume group, mount  LV_Root and access it.

However after updating LV_Root/etc/mdadm.conf, a reboot faced the same problem. LVM still could not find the physical device used by the PV.

So I thought maybe that's because md would not recreate the /dev/md device since it's an abnormality.

Hence I redit the md conf to use /dev/md3, followed instructions online to update the LVM configuration, updating the PV with the problematic uuid to use /dev/md3 instead of /dev/md

After all this was done, confirmed to work in rescue mode, I rebooted again... and got the same problem.

Since I cannot boot even into a command line normally, there was no way for me to determine if LVM was still trying to find /dev/md or did md failed to load the new md configuration, or something else was going on.

After more probing, I realized that in rescue mode, I could mount all my LV except LV_Var, which mount complains is not a valid file or directory.

Since LV_Var was just /var and pehaps mistakenly I thought it was not crucial, I dropped LV_Var and recreated the LV. I did not make any fs on it because I could not figured out how, since makefs was not available in rescue mode. Not sure if that will affect things.

But mount still won't mount the newly created LV, same error message.

At this point I have no idea what else could I try or do except plan for the worst, i.e. buy 2x 1.5TB drives and spend wait 10 hours to copy everything in rescue mode and redo the whole system again.

However, before that, I would still like to know if anything still can be done to restore the system and learn what was it I did wrong or what went wrong so that I could be prepared for it if it happens again.

Thanks for reading this long chunk of noob misadventure :)


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]