[linux-lvm] HELP - Activating a VG kernel panicking the system

Sun Feb 7 19:29:05 UTC 2010

I was pvmove-ing around data on a large (75TB), clustered (2-node) VG when the process hung:

[CHI (UTC+0000) root at dicombackup01 ~]# pvmove -b /dev/dm-10:0-2000
[CHI (UTC+0000) root at dicombackup01 ~]# pvmove -v -i600
    Checking progress every 600 seconds
    Finding all volume groups
    Finding volume group "vg_store"
    Finding volume group "vg00"
    Finding volume group "vg_dicomstore"
  /dev/dm-10: Moved: 0.1%
    Executing: /sbin/modprobe dm-log-clustered
    Updating volume group metadata
  Error locking on node dicombackup01-int.chi.nighthawkrad.net: Command timed out
  Failed to suspend lv_NRS_20090405

With no progress 8 hours later (and no IO to the relevant devices), I decided to kill the pvmove process and reboot the host.  On reboot, starting clvmd kernel panicked the system.  After messing around for some time - and setting locking = 0 in /etc/lvm/lvm.conf to avoid having to start up all the clustering infrastructure - I discovered that it was activating the VG that was causing the problem.  

I assume that something has gotten corrupted in the metadata and is crashing the system when it is read.  I have "archive" copies of the metadata that were created during the pvmove process.  Ie:

[CHI (UTC+0000) root at dicombackup01 ~]# head -20 /etc/lvm/backup/vg_dicomstore
# Generated by LVM2 version 2.02.46-RHEL5 (2009-09-15): Sun Feb  7 08:18:09 2010
contents = "Text Format Volume Group"
version = 1
description = "Created *after* executing 'pvmove -b /dev/dm-10:0-2000'"
creation_host = "dicombackup01.chi.nighthawkrad.net"    # Linux dicombackup01.chi.nighthawkrad.net 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST 2010 x86_64
creation_time = 1265530689      # Sun Feb  7 08:18:09 2010
vg_dicomstore {
        id = "w7YvIp-bjYd-sNag-m0DD-t2fL-ShXd-dssXTY"
        seqno = 2047
        status = ["RESIZEABLE", "READ", "WRITE", "CLUSTERED"]
        flags = []
        extent_size = 2097152           # 1024 Megabytes
        max_lv = 0
        max_pv = 0
        physical_volumes {

[CHI (UTC+0000) root at dicombackup01 ~]# head -20 /etc/lvm/archive/vg_dicomstore_00820.vg
# Generated by LVM2 version 2.02.46-RHEL5 (2009-09-15): Sun Feb  7 08:18:02 2010
contents = "Text Format Volume Group"
version = 1
description = "Created *before* executing 'pvmove -b /dev/dm-10:0-2000'"
creation_host = "dicombackup01.chi.nighthawkrad.net"    # Linux dicombackup01.chi.nighthawkrad.net 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST 2010 x86_64
creation_time = 1265530682      # Sun Feb  7 08:18:02 2010
vg_dicomstore {
        id = "w7YvIp-bjYd-sNag-m0DD-t2fL-ShXd-dssXTY"
        seqno = 2046
        status = ["RESIZEABLE", "READ", "WRITE", "CLUSTERED"]
        flags = []
        extent_size = 2097152           # 1024 Megabytes
        max_lv = 0
        max_pv = 0
        physical_volumes {

Should I just try and manually overwrite the metadata on the PVs with /etc/lvm/archive/vg_dicomstore_00820.vg (the one created before running the last pvmove) ?  Will that make the failed pvmove "disappear" (like a pvmove --abort would) ?

Any help would be greatly appreciated.

-- 
Christopher Smith

UNIX Team Leader
NightHawk Radiology Services
Suite 600, 4900 N Scottsdale Road
Scottsdale, 85251, USA
http://www.nighthawkrad.net
USA Toll free:    866 241 6635

Email:          csmith at nighthawkrad.net
IP Extension:   4483
Sydney Mobile:  +61 4 0739 7563
Sydney Phone:   +61 2 8211 2363
US Mobile/Cell: +1 480 717 9562
US Phone:       +1 480 822 4483
US Fax:         +1 208 763 3643

All phones forwarded to my current location, however, please consider the local time in Arizona before calling from abroad.