It seems like what's probably happened is that LVM detected the raw device instead of the MD device at some point early in the boot process. This may be because the MD detection happened after LVM setup. I'm unsure if it's possible for LVM to "steal" the device from MD.
Depending on your distribution, this may require different things to fix. Stop worrying about downtime. If the data is important, just don't worry about downtime. If downtime is really important, build a second machine, get it working right, and transfer the data. Being in a hurry and attempting to "optimize" the recovery process is a really good way to lose the data.
Assuming that you're going to try to fix this setup, I'd start out with a backup. This is critical. Everybody always says to do a backup. Nobody ever does it. Really, do one. Get an S3 account, use an S3 backup utility. There's just not an excuse these days. Your data is one-MD-mistake away from oblivion.
So, right now MD should have sda/sdb but only has sda. sdb is now newer than sda and may have important data if this server stores anything like that. The challenge is that, according to MD, sda is newer. Since MD isn't handling writes to sdb, it won't be updating its metadata to know that it's newer. There are two options that I can think of, both ugly. Pick one of:
1. Destroy the MD. Create a new one with the same UUID and sdb3 as the source. (which you listed, the UUID part can trip you up)
2. Sync the updated data from sdb3 onto md2. Wipe sdb3. Add it back into md2. (might be less downtime depending on data size, doesn't nuke MD)
3. Build another machine. Get it working right. Transfer data with Rsync. (least downtime, most expensive)
In the first two cases, this only sets you up for it to break again. The core problem is figuring out what happened during boot. In a perfect world, you would just tell LVM to only consider MD devices. That's not hard, but it's complicated by the fact that you have LVM on /. This means that the configuration that's used is likely not the version on / but a copy of it that is made when you set up your boot ramdisk (a.k.a. initrd, or possibly an initramfs). Even if we get LVM locked down to use just MDs and get that config used to boot-time, there's the possibility that the MD won't get assembled (since it already may not have been when LVM was first activated) and the system won't boot. Again, fraught with peril.
If you want to fix the MD, first steps will be using a rescue LiveCD to boot up and do all of this. With that LiveCD, you can also adjust the LVM configuration and update the initrd (or whatever is used for boot). You may need to chroot into the system and/or trick the initrd into seeing the right devices. I don't really think I can walk you through this via an e-mail.
The LVM part is pretty easy. Just set a filter line (you only get one, so disable any other filter lines) in <root of system>/etc/lvm.conf to:
filter = [ "a|^/dev/md.*$|", "r/.*/" ]
That will prevent you from using anything but the MD.
To update the initrd with this information depends on distro (and distro version) . It's usually either some invocation of "mkinitrd" or some script that wraps it. It will get the LVM configuration available at boot-time. This *MIGHT* sort out the MD problem. It might not. If it doesn't, I'm not sure where to tell you to start. If mdadm is being used by your initrd, you'll need to tweak its configuration. If it's relying on MD autodetection, you might have turned that off in your kernel. If you have an IDE controller that takes too long to initialize, that can also cause this sort of thing (although that's REALLY unlikely these days).
I hope that some of this helps. Although, it will be hard for anyone to give you really solid advice without a little more insight into why the MD isn't getting assembled prior to LVM's scan.
On Apr 5, 2009, at 10:05 AM, Miles Fidelman wrote: