[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [linux-lvm] Linux LVM - half baked?

At 07:00 PM 10/12/2005, Michael Loftis wrote:
Both of these sound more like RAID problems, not LVM.  What sort of RAID are you using?  MD?  If not MD what RAID controller are you using?

Both of these failures are on Redhat ES 4.1 systems using MD.  Both are testing prior to Oracle installs.  In the second system, the x86 system that loops through a reboot without giving me access to the problem file system to fsck:

/dev/VolGroup00/LogVol02 UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY (i.e. without -a or -p options)
Give root password for maintenance (or type Control-D to continue)
<and then reboots in any case>

it is quite a simple install with 2 36GB SCSI disks.  Each has a RAID-1 mirror for /boot
and a RAID-1 mirror for VolGroup00 that includes LogVol02 for /, LogVol01 for swap,
and LogVol00 for /var.  In the kickstart syntax used to build the system:

clearpart --all --initlabel
part raid.10 --size=1024 -- --asprimary
part raid.11 --size=1024 -- --asprimary
part raid.20 --size=1024 --grow --> part raid.21 --size=1024 --grow --> raid /boot --fstype ext2 --level=RAID1 raid.10 raid.11
raid pv.1000 --level=RAID1 raid.20 raid.21
volgroup VolGroup00 --pesize=32768 pv.1000
logvol / --fstype ext3 --name=LogVol02 --vgname=VolGroup00 --size=10240 --grow
logvol /var --fstype ext3 --name=LogVol01 --vgname=VolGroup00 --size=8192
logvol swap --fstype swap --name=LogVol00 --vgname=VolGroup00 --size=2048

While it says to run fsck manually, when I bring up the
linux rescue system from the CD (Redhat ES 4.1) there is
/dev/VolGroup00/LogVol02 file system to run fsck on.  Apparently the LVM
layer hasn't made it available?  What next?

While you suggest that these sound like RAID problems, I've been using MD RAID
on many systems (15+) for 3-4 years now, mostly on Redhat ES 2.1 systems, without
any problems of this sort.  During that time I've had numerous disk failures (12-15)
and even a controller failure and I've always been able to recover without even
taking systems out of production.  I've never had a problem like this where I couldn't
recover a file system or even boot a system - though I could deal with that.

These problems might be related to a combination of LVM over MD RAID, but that's
the way I need to run these systems.  If I have to give up either MD RAID or LVM, at
this time I choose to give up LVM - "half baked".  My problem could also be a result
of some ignorance on my part about LVM.  That's why I'm posting these messages.
I'd be delighted if somebody would say something to the effect that, "Didn't you know
that you can use xyz to make that logical volume visible and then run fsck on it?"

In terms of replicating such a problem for testing and correcting, r.e.:

At 04:18 AM 10/13/2005, Robin Green wrote:
Did you file a bug about this? It's rather hard to fix bugs if people don't
file reproducable test cases in the relevant bug database.

It is indeed 'hard' as you say.  In the above case there were two hard disk
failures (/dev/sdb) that precipitated the problem.  After the first disk failure
I put in a replacement disk and apparently synchronized the RAID-1 pair
successfully.  However, some time about 1/2 day later (I often use rather
old disks for early development projects like this) that replacement
/dev/sdb also failed.  It was after that failure that I was unable to
boot the system off /dev/sda - an apparently still fully functional
disk (I tested it looking at it both from another system and of course
from the linux rescue system from the CD) and am stuck unable to
fsck the / file system that's nominally in /dev/VolGroup00/LogVol02
as above.

Of course I have the disk, so I can replicate the problem, as evidence
by the final state of that disk, at will.  I could dd the 36GB of that disk
to a file that I could make available (e.g. on the Web) so that somebody
else could copy it to an equivalent disk and replicate the problem or I could even
send that disk to somebody who was serious about working on the problem -
assuming I could get security approval to do so.  There wasn't much relevant
on that disk when the problem occurred.  I'm about out of ideas for working on it.

Here's what fdisk (when viewed from an essentially identically configured
system which is of course working) says about that disk in case anybody
would like to know more details about its configuration when considering
the dd proposal above:

[root helpb ~]# fdisk /dev/sdb
The number of cylinders for this disk is set to 4462.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)
Command (m for help): p
Disk /dev/sdb: 36.7 GB, 36703918080 bytes
255 heads, 63 sectors/track, 4462 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1         131     1052226   fd  Linux raid autodetect
/dev/sdb2             132        4462    34788757+  fd  Linux raid autodetect
Command (m for help): q
[root helpb ~]#

When you refer to filing "reproducible test cases in the relevant bug database",
what do you suggest I do in this case?  My experience with such bug databases
has been poor.  To me they have mostly looked like black holes - though I admit
that's with a small number of experiences (~5) and there are some exceptions
(e.g. Sun Microsystems seems to follow up pretty well).  Still, I don't consider it
a good use of my time or particularly useful for LVM.

The basic problem is, How do I get at that logical volume file system to work
on it - e.g. to recover it?

The other apparent LVM failure that I'm dealing with is a bit more problematic
to replicate.  It's similar in some ways (the base partitioning of /dev/sda and
/dev/sdb are similar but on 146GB SCSI disks).  It also has a RAID-10 configuration
(mentioned earlier in the sysinit bug) on four other disks.  However, the RAID-10
just holds a data logical volume that the system can, in principle, come up
without.  I think I need to do a bit more work on that system trying to
recover it before I send even more email about it.  It's also a 64 bit system
that might complicate things a bit.  I was hoping to get lucky and perhaps have
somebody recognize it's symptoms:
4 logical volume(s) in volume group "VolGroup00" now active
ERROR: failed in exec of defaults
ERROR: failed in exec of ext3
mount: error 2 mounting none
switchroot: mount failed: 23
ERROR: ext3 exited abnormally! (pid 284)
...  <three more similar to the above>
kernel panic - not syncing: Attempted to kill init!

and have some ideas on routes to pursue to try to recover that
system.  Again it is a test system, but if I can't recover problems
with test systems I certainly don't want to run the same LVM
software in our production systems.

I consider myself lucky to have run into such problems while
testing.  I sent my initial message partly in the hope somebody
might have ideas I could use to try to recover these systems and
partly to share my experiences so others might better be able to
evaluate whether they want to go use LVM on their production

--Jed http://www.nersc.gov/~jed/

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]