dmraid comments and a warning

Tue Feb 7 22:34:00 UTC 2006

On Tue, 2006-02-07 at 00:17 -0700, Dax Kelson wrote:
> On Mon, 2006-02-06 at 21:02 -0500, Peter Jones wrote:
> > On Mon, 2006-02-06 at 13:08 -0700, Dax Kelson wrote:
> > > The standard root=LABEL=/ was used on the kernel command line and what
> > > happened is that it booted up to one side of the mirror. All the updates
> > > and new packages (including a new kernel install which modified the
> > > grub.conf) activity just happened on that one side of the mirror.

Are you sure about this?  Your blkid.tab looks very much like you  used
the default layout on Jan 13...

> > This should be fixed in the current rawhide tree.
> 
> And now it uses root=/dev/mapper/$DEV ?

No, it still uses root=LABEL=/ (assuming no lvm), but the label
searching mechanism early in the boot process is now the same as that
used by mount, umount, swapon, etc.,  and it currently gives
device-mapper devices a higher "priority", which should guarantee that,
assuming it's possible to build the raid, all of those tools will use
the dm device instead of the normal disks.

So your blkid.tab says:

> <device DEVNO="0xfd01" TIME="1139069826" PRI="40"
TYPE="swap">/dev/dm-1</device>
> <device DEVNO="0xfd05" TIME="1137182541" PRI="40" TYPE="swap">/dev/dm-5</device>
> <device DEVNO="0xfd02" TIME="1137182541" PRI="40" TYPE="ntfs">/dev/dm-2</device>
> <device DEVNO="0xfd04" TIME="1137182541" PRI="40" UUID="faffb8d3-2562-4489-a1f8-a7e0077e1e6c" SEC_TYPE="ext2" TYPE="ext3">/dev/dm-4</device>
> <device DEVNO="0x0801" TIME="1137182541" TYPE="ntfs">/dev/sda1</device>
> <device DEVNO="0x0802" TIME="1139162151" LABEL="/boot" UUID="f49b0225-bdd4-430a-a3b0-f0f7c20daaff" SEC_TYPE="ext2" TYPE="ext3">/dev/sda2</device>
> <device DEVNO="0x0811" TIME="1137182541" TYPE="ntfs">/dev/sdb1</device>
> <device DEVNO="0x0812" TIME="1137182541" LABEL="/boot" UUID="f49b0225-bdd4-430a-a3b0-f0f7c20daaff" SEC_TYPE="ext2" TYPE="ext3">/dev/sdb2</device>
> <device DEVNO="0x0813" TIME="1137182541" TYPE="swap">/dev/sdb3</device>
> <device DEVNO="0xfd03" TIME="1137182541" TYPE="swap">/dev/dm-3</device>
> <device DEVNO="0xfd01" TIME="1139162137" TYPE="swap">/dev/VolGroup00/LogVol01</device>

OK, archeology time.  On Jan 13, 2006 at about 8pm GMT you installed
with the disk layout something like:

/dev/sda  /dev/sdb  -> dm-1 (which would not have gotten an entry
                            in blkid.tab)
/dev/sda1 /dev/sdb1 -> dm-2 ntfs (PRI=40, wheras sda1 and sdb1 have
                            PRI=0)
/dev/sda2 + /dev/sdb2 -> VolGroup00 (no device node, thus no entry)
VolGroup00 ->
  dm-3 (LogVol01) -> swap
  dm-4 (LogVol00) -> /

(dm-3 vs dm-4 reflects the order they were activated, not necessarily
the order on disk)

*something happened here, no idea what*

Sometime around Feb 4, 2006, at 4pm GMT you rebooted, and the raid
didn't get started.  This looks like one of your disks wasn't connected
at all, and the other was doing weird things.  LVM brought up LogVol01,
but if both disks were there it would have been complaining about
inconsistent VG metadata for VolumeGroup00.  For whatever reason,
LogVol00 _didn't_ come back up.  /boot may or may not have been mounted,
we can't say.

25 hours later you walked back into the room and power cycled the box.
Then about 26 hours later you rebooted again.  This time for some reason
the /boot record on /dev/sda2 was modified.  This may indicate that sda2
was missing the previous time we booted far enough to get / mounted rw.
Once again VolGroup00/LogVol01 was activated correctly, but / was not.

The last 2 lines have no PRI= section, that's weird, and might mean my
leaf-node test in libblkid is broken.  That shouldn't cause the other
failures we've seen, though.

>From what you say below I'm assuming something went wrong making your
initrd on the 4th.

> GRUB always sees the "activated" RAID because of the BIOS RAID driver.
> When it reads the "grub.conf" it is interleaving pieces of the two (now
> different) grub.conf files and the result most likely has bogus syntax
> and content.

Well, yes and no.  It sees a disk as 0x80, and when it does int 13h, the
bios decides which disk it's going to send that to.  How it decides is
anybody's guess; I'm sure it varies wildly between bioses.

> Jan 14th 2006 rawhide for event one, and jan 14th 2006 initial install
> with yum updates every couple days for event two.

Looks like the 13th, but either should be sufficient.

> > > On bootup I noticed an error flash by something to the effect of "LVM
> > > ignoring duplicate PV".

This is the inconsistent metadata error I mentioned above, FWIW.

> I booted to the rescue environment with a Jan 14th boot.iso and NFS
> tree. The rescue environment properly activated the dmraid and
> "pvdisplay" showed "/dev/mapper/nvidia-foo"
> 
> I looked inside the two initrd files I had:
> 
> 2.6.15-1.1884 = dm commands inside "init"

OK, so that should work assuming you didn't move the disks around, etc.
(I'm working on making moving the disks around ok, but it's a bit
complicated so it might take a while)

> 2.6.15-1.1889 - no dm commands inside "init" -- dated Feb 4th on my box

OK, so if you boot this you're going to get /dev/sda* accessed.  Any
idea what versions of e2fsprogs, lvm2, util-linux, device-mapper, and
mkinitrd were installed?  (I'll understand if you don't...)

So that means when you installed that, you were either already booted
without using raid, or mkinitrd (or one of the many tools it uses) was
broken.

> > One interesting note is that given any of these you should be getting
> > the same disk mounted each time.  Which means there's a good chance that
> > sda and sdb are both fine, one of them just happens to represent your
> > machine 3 weeks ago.
> 
> It installed OK on Jan 14th, and has been successfully booting and using
> the dmraid until (I think) Feb 4th.

Looks like there was at least one problem before that, or mkinitrd
couldn't find the raid devices when you updated that day.

> > Do you still have this disk set, or have you wiped it and reinstalled
> > already?  If you've got it, I'd like to see /etc/blkid.tab from either
> > disk (both if possible).
> 
> Since the / filesystem is in a LVM LV sitting ontop of a dmraid
> partition PV, it seems non-trivial to force the PV for the LV to change
> back and both to access the separate files. If you know a way, let me
> know.

export the metadata, use vim to rename the volume group, reimport the
metadata.  I can't recall the commands off the top of my head right
now...

alternately you can add this to the "device" subsection
of /etc/lvm/lvm.conf :

filter = [r|sda|]

and it'll no longer look at anything with "sda" in the name.

> > > There needs to be more checks in place to prevent booting off of one
> > > half of the mirror, or at a minimum only allowing a read-only boot on
> > > one side of the mirror. Dead systems are no fun. Loosing your personal
> > > data is hell.
> > 
> > Well, we should have the appropriate checks there at this point -- so
> > I'd be curious to find out exactly which versions you installed with.
> > It could be that one of the checks was introduced after you installed,
> > and the "yum update" process caused it to believe it was *not* a raid
> > system.
> 
> As I noted above I discovered the initramfs for 1884 was OK and had dm
> activation commands but the 1889 initramfs did not. Why the change? I
> don't know. I've only run yum on the box and haven't touched the LVM or
> device mapper config myself.

Yeah, that appears to be the 64 kilobuck question.

> > (I haven't been extensively checking to make sure every daily rawhide
> > would work perfectly as an update from the previous one, just that
> > they'd install if possible...)
> > 
> > > This isn't purely a Linux problem. Any operating system using fake RAID1
> > > needs to be robust in this regard. I saw a Windows box using 'fake'
> > > motherboard RAID and the motherboard BIOS got flashed which reset the
> > > "Use RAID" setting to 'off'. Then Windows booted off of half the RAID.
> > 
> > That's interesting.  It means there's some way to query the BIOS to tell
> > if it's installed the int13 "raid" hook or not.  I wish I knew what that
> > magic is.
> 
> Are you sure that's what it means? The motherboard BIOS upgrade turned
> off RAID and Windows still booted. That wasn't surprising. The writes to
> one side of the mirror and the subsequent re-activation of the mirror
> without a proper re-sync in the RAID bios utility caused total foobage.

The surprising part is that windows didn't come up exactly as it would
with the bios functionality turned on -- RAID included.  *nothing* on
the disk changes when you switch from "RAID on" to "RAID off" -- only if
you edit the volumes.  RAID on vs off is a question of if they're
hooking into int 13h so the bootloader's disk accesses will go through
the BIOS's raid code.

> > > The rules are:
> > 
> > > 1. Don't boot off half of the RAID1 in read-write mode
> > 
> > Yeah, we definitely still need some fallback stuff here.
> 
> Excellent. We don't want users complaining that FC5 ate their data.

Indeed.  Hopefully I can even finish it by then ;)

In theory we should (eventually) also be able to read the sync state
from several vendors metadata, and do live syncing + metadata update.
That's still a ways down the road, though.

> > > 3. There is no way to recover from a violated rule 1 without
> > > reinstalling.
> > 
> > That's not the case -- you can go into the bios and sync from the
> > "newer" disk to the older one.  Or if your bios is total junk, you can
> > boot some other media and (carefully) re-sync each partition with "dd".
> 
> This should have occurred to me. Since my RAID bios utility is rather
> limiting (junk as you say) I overlooked this good suggestion (also noted
> by Reuben Farrelly)

Note however that it is important to sync the _partitions_.  If you do
"dd if=/dev/sda of=/dev/sdb", you'll copy the RAID metadata, which isn't
necessarily the same on each disk.

Also note that this method won't work for RAID0 (obviously ;), nor for
RAID5, which we don't support that with dmraid at this time anyway.

-- 
  Peter