[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[linux-lvm] PV that's present marked as missing?

I have a fairly complex LVM2/mdadm setup that I'm in the middle of turning into a simpler setup. I made a mistake along the way, though, and have landed in a confusing place.

This is kind of long, and I apologize for that -- trying to describe completely how I got here. The complex setup I started with:

/dev/md5 is a RAID5 of /dev/sd{b,d,e,f}5
/dev/md6 is a RAID5 of /dev/sd{b,d,e,f}6
etc on up to /dev/md14
/dev/md99 is a RAID1 of /dev/sdg and /dev/sdh

/dev/md{5-14} plus /dev/md99 are all assembled into a volume group (creatively called vglinux), which has three logical volumes. Only one, lvstore, is relevant: the other two are getting destroyed as part of the simplication.

The goal is to end with a RAID6 of /dev/sd{b,d,e,f,g,h}, and no multiple-partition madness (it's there from the days of old, when mdadm couldn't reshape arrays). The next step was to free up /dev/sdf, starting with

    pvmove /dev/md5
    reshape md5 as a RAID5 of /dev/sd{b,d,e}5 (freeing /dev/sdf5)
    lather, rinse, and repeat for the other mds.

The VG has plenty of free space for this; it's slow, but that's OK.

The problem: while md{5,6,7} went fine, I botched the pvmove for md8 and ended up starting to reshape the array _before the pvmove happened._ Specifically, I did all of these:

mdadm --grow /dev/md8 --array-size 292730880 # it was 439489920
pvresize /dev/md8
mdadm --grow /dev/md8 --raid-devices 3 --backup-file ~/backup

_without_ having moved data off. Once I figured out what was going on, I did

umount (all the filesystems in the VG)
vgchange -a n vglinux
mdadm --stop /dev/md8

which halted the reshape about 5% of the way done. Then (with some help from NeilBrown and a buncha experiments with loopback devices) I used the most recent mdadm snapshot to revert the reshape.

mdadm --assemble --update=revert-reshape /dev/md8 /dev/sd{b,d,e,f}8

NOTE WELL: I KNOW THAT THIS HAS DESTROYED SOME DATA. That's not the question. [ :) ] There will be damage, yes, I know that, and I should be able to detect that and correct it.

At this point /dev/md8 is back to 4 devices, array-size 439489920, and can be started. Next step is to fsck lvstore to get a handle on the damage before proceeding -- but vgchange -a y vglinux doesn't start lvstore:

# vgchange  -a y vglinux
  Incorrect metadata area header checksum
  Refusing activation of partial LV lvstore. Use --partial to override.
  2 logical volume(s) in volume group "vglinux" now active

(The two LVs that it did start are the irrelevant ones.)

So things are confusing:

First, it'd be awesome to know where exactly that "incorrect metada area header checksum" is coming from. Maybe, y'know, a device to look at, or some further hint of where to start tracking things down? [ :) ]

Second, if I look in /etc/lvm/archive for vglinux's latest, I find this bit buried in there:

    pv2 {
        id = "4F3rcV-sS8p-E6t2-hjGm-gLVB-C6wl-4McUhc"
        device = "/dev/md8"     # Hint only

        status = ["ALLOCATABLE"]
        flags = ["MISSING"]
        dev_size = 878979840    # 419.13 Gigabytes
        pe_start = 384
        pe_count = 107297       # 419.129 Gigabytes

which seems to be why it's complaining about 'partial PV lvstore'. But, uh, 4F3rcV-sS8p-E6t2-hjGm-gLVB-C6wl-4McUhc _is_ the UUID of /dev/md8:

# pvs -o +uuid --unit=4m
  Incorrect metadata area header checksum
  Unable to find "/dev/sdb5" in volume group "vglinux"
  PV         VG      Fmt  Attr PSize      PFree      PV UUID
/dev/md10 vglinux lvm2 a- 107297.00U 0U LO5KoK-1AjU-iXb0-fkLo-lUKR-Yo9P-wDZQPP /dev/md11 vglinux lvm2 a- 107297.00U 0U gBGcjz-DmIb-pAj9-CWnb-jopW-Wd19-iIs1ur /dev/md125 vglinux lvm2 a- 107297.00U 8607.00U 5JlNTx-yT14-271r-NMAm-a17W-FKe4-pXoOW4 /dev/md13 vglinux lvm2 a- 107297.00U 0U MJlTQO-lCyE-bP80-FlvE-m1nM-DD2x-qhlIQK /dev/md14 vglinux lvm2 a- 107297.00U 0U XDpA1D-kxbq-SEck-ozTl-rP4Y-bMws-MBwNNf /dev/md5 lvm2 a- 71467.50U 71467.50U 39oFQs-9tlf-ywT4-YgtX-nfcm-rAEq-pAPsdR /dev/md6 vglinux lvm2 a- 71531.00U 35856.00U ufKOpM-02YG-12rJ-mt1r-DbEm-xoJu-onzEtr /dev/md7 vglinux lvm2 a- 71531.00U 71531.00U NpAKLQ-4Irn-wDA4-0ZDI-ydW6-eY9n-rDp50e /dev/md8 vglinux lvm2 a- 107297.00U 0U 4F3rcV-sS8p-E6t2-hjGm-gLVB-C6wl-4McUhc /dev/md9 vglinux lvm2 a- 107297.00U 0U hRmTMN-Mx17-uUEX-rF1Z-hQ1J-8iDd-S7S2t7 /dev/md99 vglinux lvm2 a- 357667.00U 178748.00U jUgxoF-mvwR-6C8A-wzjP-K0Xu-MPf8-XewqUE

Finally, note that "Unable to find /dev/sdb5 in vglinux" complaint, and note that /dev/md5 is _not_ listed as part of vglinux. md5 shouldn't be part of vglinux right now, and sdb5 has never been a PV on its own (it's only ever been a part of the md5 PV). WTFO? As it happens, I didn't actually reshape /dev/md5: after the pvmove, I shredded the md and recreated it instead. I suppose it's possible that I forgot to vgreduce before doing that?

Googling and reading indicates that I need to clear that MISSING flag, and that vgcfgrestore is the only tool for that job -- but editing that archive file to remove the MISSING flag and trying vgcfgrestore with that doesn't work:

# vgcfgrestore --debug --verbose --test --file wtfvglinux vglinux
  Test mode: Metadata will NOT be updated.
  Incorrect metadata area header checksum
  Incorrect metadata area header checksum
  Restore failed.
    Test mode: Wiping internal cache
    Wiping internal VG cache

so, at this point, some guidance would be most welcome.

(Also note that before I did the revert-reshape, I dd'd /dev/sd{b,d,e,f}8 to spare partitions as a backup. It may be relevant that there are two copies of the metadata for md8's devices?)

Thanks very much,

The trick is to keep breathing.              (Garbage, from _Version 2.0_)

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]