[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [linux-lvm] progress, but... - re. fixing LVM/md snafu


It seems like what's probably happened is that LVM detected the raw device instead of the MD device at some point early in the boot process.  This may be because the MD detection happened after LVM setup.  I'm unsure if it's possible for LVM to "steal" the device from MD.

Depending on your distribution, this may require different things to fix.  Stop worrying about downtime.  If the data is important, just don't worry about downtime.  If downtime is really important, build a second machine, get it working right, and transfer the data.  Being in a hurry and attempting to "optimize" the recovery process is a really good way to lose the data.

Assuming that you're going to try to fix this setup, I'd start out with a backup.  This is critical.  Everybody always says to do a backup.  Nobody ever does it.  Really, do one.  Get an S3 account, use an S3 backup utility.  There's just not an excuse these days.  Your data is one-MD-mistake away from oblivion.

So, right now MD should have sda/sdb but only has sda.  sdb is now newer than sda and may have important data if this server stores anything like that.  The challenge is that, according to MD, sda is newer.  Since MD isn't handling writes to sdb, it won't be updating its metadata to know that it's newer.  There are two options that I can think of, both ugly.  Pick one of:

1.  Destroy the MD.  Create a new one with the same UUID and sdb3 as the source. (which you listed, the UUID part can trip you up)
2.  Sync the updated data from sdb3 onto md2.  Wipe sdb3.  Add it back into md2. (might be less downtime depending on data size, doesn't nuke MD)
3.  Build another machine.  Get it working right.  Transfer data with Rsync. (least downtime, most expensive)

In the first two cases, this only sets you up for it to break again.  The core problem is figuring out what happened during boot.  In a perfect world, you would just tell LVM to only consider MD devices.  That's not hard, but it's complicated by the fact that you have LVM on /.  This means that the configuration that's used is likely not the version on / but a copy of it that is made when you set up your boot ramdisk (a.k.a. initrd, or possibly an initramfs).  Even if we get LVM locked down to use just MDs and get that config used to boot-time, there's the possibility that the MD won't get assembled (since it already may not have been when LVM was first activated) and the system won't boot.  Again, fraught with peril.

If you want to fix the MD, first steps will be using a rescue LiveCD to boot up and do all of this.  With that LiveCD, you can also adjust the LVM configuration and update the initrd (or whatever is used for boot).  You may need to chroot into the system and/or trick the initrd into seeing the right devices.  I don't really think I can walk you through this via an e-mail.

The LVM part is pretty easy.  Just set a filter line (you only get one, so disable any other filter lines) in <root of system>/etc/lvm.conf to:

filter = [ "a|^/dev/md.*$|", "r/.*/" ]

That will prevent you from using anything but the MD.

To update the initrd with this information depends on distro (and distro version).  It's usually either some invocation of "mkinitrd" or some script that wraps it.  It will get the LVM configuration available at boot-time.  This *MIGHT* sort out the MD problem.  It might not.  If it doesn't, I'm not sure where to tell you to start.  If mdadm is being used by your initrd, you'll need to tweak its configuration.  If it's relying on MD autodetection, you might have turned that off in your kernel.  If you have an IDE controller that takes too long to initialize, that can also cause this sort of thing (although that's REALLY unlikely these days).

I hope that some of this helps.  Although, it will be hard for anyone to give you really solid advice without a little more insight into why the MD isn't getting assembled prior to LVM's scan.

On Apr 5, 2009, at 10:05 AM, Miles Fidelman wrote:

Hello again Folks,

So.. I'm getting closer to fixing this messed up machine.

Where things stand:

I have root defined as an LVM2 LV, that should use /dev/md2 as it's PV.
/dev/md2 in turn is a RAID1 array built from /dev/sda3 /dev/sdb3 and /dev/sdc3

Instead, LVM is reporting: "Found duplicate PV 2ppSS2q0kO3t0tuf8t6S19qY3ypWBOxF: using /dev/sdb3 not /dev/sda3"
and the /dev/md2 is reporting itself as inactive (cat /proc/mdstat) and active,degraded (mdadm --detail)

I'm guessing that, during boot:

- the raid array failed to start
- LVM found both copies of the PV, and picked one (/dev/sdb3)
- everything then came up and my server is humming away

but: the md array can't rebuild because the most current device in it is already in use

so...  I'm looking for the right sequence of events, with the minimum downtime to:

1. stop changes to /dev/sdb3 (actually, to / - which complicates things)
2. rebuild the RAID1 array, making sure to use /dev/sdb3 as the starting point for current data
3. restart in such a way that LVM finds /dev/md2 as the right PVM instead of one of its components

Each of these is just tricky enough that I'm sure there are lots of gotchas to watch out for.

So.. any suggestions?

Thanks very much,

Miles Fidelman

linux-lvm mailing list
linux-lvm redhat com
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Jayson Vantuyl
Founder and Architect
1 866 518 9275 ext 204
IRC (freenode): kagato

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]