[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[linux-lvm] strange behavior with 1.0.5 on Linux 2.4.19?



I'm not sure what I found, or why it's happening, but I managed to
excersize some or another bug in LVM 1.0.5...

We use home-rolled scripts for doing our system backups, and one of the
steps creates snapshots of our database filesystems, so that we can dump
the snapshots to tape and get a consistent backup image.  These scripts
were misconfigured, and attempted to create a snapshot of a volume on a
volume group that did not exist.

This machine is running Linux 2.4.19, patched with Broadcomm Gigabit
drivers and LVM 1.0.5 (linux-2.4.19-VFS-lock.patch and
lvm-1.0.5-2.4.19-1.burpr.patch, generated by running make in
/usr/src/LVM/1.0.5/PATCHES).  I then compiled and installed the LVM
userland tools from the sources.

This machine has one volume group, vg00, consisting of a single physical
volume, /dev/sda4, which is itself a partition of ~100GB on a hardware
RAID-10 array.

--->8--[ Cut Here ]--->8--
root burpr(pts/1):~ 34 # ls -al /dev/vg00
total 47
dr-xr-xr-x    2 root     root          232 Oct  2 02:55 ./
drwxr-xr-x   15 root     root        46926 Oct  2 02:55 ../
brw-rw----    1 root     disk      58,   5 Oct  2 02:55 dat
brw-rw----    1 root     disk      58,   6 Oct  2 02:55 db1
brw-rw----    1 root     disk      58,   7 Oct  2 02:55 db2
crw-r-----    1 root     disk     109,   0 Oct  2 02:55 group
brw-rw----    1 root     disk      58,   3 Oct  2 02:55 home
brw-rw----    1 root     disk      58,   0 Oct  2 02:55 root
brw-rw----    1 root     disk      58,   1 Oct  2 02:55 tmp
brw-rw----    1 root     disk      58,   4 Oct  2 02:55 u
brw-rw----    1 root     disk      58,   8 Oct  2 02:55 unifytmp
brw-rw----    1 root     disk      58,   2 Oct  2 02:55 var
--->8--[ Cut Here ]--->8--

The command which was errantly run was:

--->8--[ Cut Here ]--->8--
lvcreate --size 8G --snapshot --name db1_snap vg01
--->8--[ Cut Here ]--->8--

I got this output:

--->8--[ Cut Here ]--->8--
lvcreate -- "/etc/lvmtab.d/vg01" doesn't exist
lvcreate -- can't create logical volume: volume group "vg01" doesn't
exist
--->8--[ Cut Here ]--->8--

That's all well and good, and expected.  Well, I saw the backup scripts
trying to do this, so I killed them off as cleanly as possible, fixed
the configuration, and restarted them.  Only now, they got stuck on the
first vgscan they tried to run.

Running vgdisplay by hand now, I seem to have "lost" 8GB from my vg. 
vgdisplay shows 8GB less free than should be there if you add up the
allocations to all the existing lv's.  lvscan segfaults, and vgscan
hangs while trying to open /dev/lvm.  lvcreate hangs as well.  Running
strace:

--->8--[ Cut Here ]--->8--
root burpr(pts/1):~ 51 # strace lvcreate --size 256M --snapshot --name
unifytmp_snap /dev/vg00/unifytmp vg00
--->8--[ Cut Here ]--->8--

ends up with a hang, and this is the last few lines of the trace:

--->8--[ Cut Here ]--->8--
open("/dev/vg00/group", O_RDONLY)       = 3
ioctl(3, 0xc004fe05, 0x80a40b8)         = 0
close(3)                                = 0
stat64("/dev/lvm", {st_mode=S_IFCHR|0640, st_rdev=makedev(109, 0), ...})
= 0
open("/dev/lvm", O_RDONLY)              = 3
ioctl(3, 0x8004fe98, 0xbfffec22)        = 0
close(3)                                = 0
stat64("/dev/lvm", {st_mode=S_IFCHR|0640, st_rdev=makedev(109, 0), ...})
= 0
open("/dev/lvm", O_RDONLY)              = 3
ioctl(3, 0xff00 <unfinished ...>
--->8--[ Cut Here ]--->8--

The <unfinished ...> is when I gave up after 5 minutes and hit
<control>-c.

I have complete straces available of vgscan, lvscan, and lvcreate, as
well as the output of lvdisplay for each of the lv's I've got.  I also
have a core file for lvscan, if that would help, too.

We are going to reboot the server over lunch today, hopefully that will
clear out whatever kernel structures are gorked, but I'm really not
happy that this happened in the first place, and hope someone here can
point me to an answer.

The hardware is a Dell PowerEdge 6600 with PERC3/DC RAID controller (LSI
MegaRAID), 6 15krpm 36GB disks in a RAID-10, 8GB memory, four 1.6GHz
Xeon CPUs.  Running SuSE Linux Enterprise Server 7 (essentially a
stripped-down SuSE 7.2), kernel.org's 2.4.19 + Broadcom and LVM patches,
and LVM 1.0.5.

I haven't had any problems yet on another server (PowerEdge 2450, 2x
P-III 1GHz, 2GB ram, same kernel & lvm, different raid controller).

I've tried to be thourough in my data collection; let me know if there's
something more needed to debug this.


TIA

--
Gregory K. Ade <gkade bigbrother net>
http://bigbrother.net/~gkade
OpenPGP Key ID: EAF4844B  keyserver: pgpkeys.mit.edu


Attachment: signature.asc
Description: This is a digitally signed message part


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]