Rawhide does not boot since 2.6.27-0.398

Mon Oct 13 15:42:36 UTC 2008

--- On Mon, 10/13/08, Michael H. Warfield <mhw at WittsEnd.com> wrote:

> From: Michael H. Warfield <mhw at WittsEnd.com>
> Subject: Re: Rawhide does not boot since 2.6.27-0.398
> To: olivares14031 at yahoo.com, "For testers of Fedora Core development releases" <fedora-test-list at redhat.com>
> Cc: mhw at WittsEnd.com
> Date: Monday, October 13, 2008, 7:56 AM
> On Sat, 2008-10-11 at 12:48 -0700, Antonio Olivares wrote:
> > --- On Sat, 10/11/08, Bruno GARDIN
> <bgardin at gmail.com> wrote:
> 
> > > From: Bruno GARDIN <bgardin at gmail.com>
> > > Subject: Rawhide does not boot since 2.6.27-0.398
> > > To: "For testers of Fedora Core development
> releases" <fedora-test-list at redhat.com>
> > > Date: Saturday, October 11, 2008, 10:55 AM
> > > I am testing rawhide for a few month now but i
> have problem
> > > of boot
> > > since kernel 2.6.27-0.398. My rawhide is a
> virtual system
> > > on vmware
> > > server now in version 2.0. Whenever i try to
> boot, i got
> > > the following
> > > errors at the end :
> > > Activating logical volumes
> > >     VOlume group "VolGroup00" not found
> > > Unable to access resume device
> (/dev/VolGroup00/LogVol01)
> > > Creating root device
> > > Mounting root file system
> > > mount: error mounting /dev/root on /sysroot as
> ext3. No
> > > such file or directory
> > > Setting up other filesystems
> > > setuproot: moving /dev failed:No such file or
> directory
> > > setuproot: error mounting/proc: No such file or
> directory
> > > setuproot: error mounting /sys: No such file or
> directory
> > > Mount failed for selinuxfs on /selinux: No such
> file or
> > > directory
> > > Switching to new root and running init
> > > swithroot: mount failed: No such file or
> directory
> > > Booting has failed
>  
> > > Boot works fine with kernel 2.6.27-0.382 but
> fails also
> > > with 2.6.27-1.
> > > I have looked at the thread related to ext4 but i
> am using
> > > ext3. I
> > > have also tried a new mkinitrd on 2.6.27-1 but no
> change.
> > > Any idea of
> > > what the problem could be ?
> 
> 	The real source of the problem was much earlier in the
> messages than
> what was originally provided.  I've been trying to
> track this down
> myself.  Here is the critical bit of information:
> 
> Kernel that boots:
> 
> Loading dm-mirror module
> scsi 2:0:0:0: Direct-Access
> scsi target2:0:0: Beginning Domain Validation
> scsi target2:0:0: Domain Validation skipping write testes
> scsi target2:0:0: Ending Domain Validation: 1204k
> scsi target2:0:0 FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns,
> offset 127)
> Loading dm-zero module
> sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885
> MB)
> sd 2:0:0:0: [sda] Write Protect is off
> sd 2:0:0:0: [sda] Cache data unavailable
> sd 2:0:0:0: [sda] Assuming drive cache: write through
> Loading dm-snapshot module
> sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885
> MB)
> sd 2:0:0:0: [sda] Write Protect is off
> sd 2:0:0:0: [sda] Cache data unavailable
> sd 2:0:0:0: [sda] Assuming drive cache: write through
>  sda: sda1 sda2
> sd 2:0:0:0: [sda] Attached SCSI disk
> sd 2:0:0:0: Attached scsi generic sg1 type 0
> Making device-mapper control node
> Scanning logical volumes
>   Reading all physical volumes.  This make take a while...
>   Found volume group "VolGroup00" using metadata
> type lvm2
> 
> Kernel that fails:
> 
> Scanning logical volumes
> scsi 2:0:0:0: Direct-Access
> scsi target2:0:0: Beginning Domain Validation
> scsi target2:0:0: Domain Validation skipping write testes
> scsi target2:0:0: Ending Domain Validation: 1204k
> scsi target2:0:0 FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns,
> offset 127)
>   Reading all physical volumes.  This make take a while...
> sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885
> MB)
> sd 2:0:0:0: [sda] Write Protect is off
> sd 2:0:0:0: [sda] Cache data unavailable
> sd 2:0:0:0: [sda] Assuming drive cache: write through
> sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885
> MB)
> sd 2:0:0:0: [sda] Write Protect is off
> sd 2:0:0:0: [sda] Cache data unavailable
> sd 2:0:0:0: [sda] Assuming drive cache: write through
>  sda: sda1 sda2
> sd 2:0:0:0: [sda] Attached SCSI disk
> sd 2:0:0:0: Attached scsi generic sg1 type 0
> Activating logical volumes
>   Volume group "VolGroup00" not found
> Unable to access resume device (/dev/VolGroup00/LogVol01)
> 
> 	Note that in the kernel that's failing LVM is starting
> to scan for
> logical volumes before the SCSI devices have stabilized! 
> It doesn't
> find any PV's and so it doesn't find
> "VolGroup00".  That's the killer.
> Now.  Look closely at the one that booted above. 
> What's interspersed
> with the scsi startup messages?  Loading dm-mirror, loading
> dm-zero,
> loading dm-snapshot.  Then we see the "Scanning
> logical volumes".  Well,
> guess what.  Those modules are not present in the latest
> kernel as
> modules.  They created enough of a time delay that lvm
> started after the
> scsi drivers had settled.  Now they are not there, we have
> a race
> condition and, if lvm starts to early, lvm can find the
> drives.  The
> "Scanning logical volumes" is starting exactly
> where we see the "Loading
> dm-mirror" in the kernel which does boot.  I would
> call that a smoking
> gun.
> 
> 	What's really interesting (to me) is if you look at
> the 2.6.26 kernels
> under F9 (and I'm testing this 2.6.27 kernel under F9
> as well as F10).
> You find ALL of the "Loading dm-*" messages AFTER
> the scsi drivers have
> settled.  Something has changed here in the 2.6.27 kernels
> where the
> scsi drivers either are not settling as fast (debugging
> messages and
> checks perhaps) or the insmod is returning sooner (before
> the drivers
> have settled) creating this race condition which did not
> exist at all in
> 2.6.26.
> 
> 	I think this is a bug in mkinitrd and it's not
> emitting a wait when
> it's needed in this case.  Down in mkinitrd around line
> 1483 is a check
> for conditions under which it wants to issue a wait for the
> scsi to
> setting.  I think that either needs to be made
> unconditional or at least
> expanded to include other scsi devices like the VMware
> ones.  I cheated.
> Up at line 1411 I changed
> "wait_for_scsi="no"" to
> "wait_for_scsi="yes""
> and rebuild the initrd.  Problem goes away and lvm starts
> AFTER the scsi
> devices have settled.  Another way to do this might be to
> force the
> including of usb-storage (mkinitrd --with-usb) which has
> it's own delay
> both in loading and settling.
> 
> > > -- 
> > > BeGe
> > > 
> > > -- 
> 
> > Try uninstalling the kernel and reinstalling it via
> yum.
> > I tried several times and succeeded :) 
> 
> 	No you didn't.  You got lucky and won the race that
> time.  That's the
> problem with race conditions.  Sometimes you win through
> just plain dumb
> luck.  You may well find it fails on a subsequent reboot
> and you're
> screwed.  Then again, you may well have changed something
> else that
> changes the timing and changes the probability and it then
> works for
> you.
> 
> 	I'll be filing a bugzilla bug on this later if nobody
> else gets to it
> first.
> 
> > Regards,
> 
> > Antonio 
> 
> 	Mike
> -- 
> Michael H. Warfield (AI4NB) | (770) 985-6132 | 
> mhw at WittsEnd.com
>    /\/\|=mhw=|\/\/          | (678)
> 463-0932 |  http://www.wittsend.com/mhw/
>    NIC whois: MHW9          | An optimist believes we live
> in the best of all
>  PGP Key: 0xDF1DD471        | possible worlds.  A pessimist
> is sure of it!

Michael,

You have a very valid point.  My problem was that the partition(not the boot partition) was ext4 and had ext4dev, and it was changed to plainly ext4.  I had to change /etc/fstab from ext4dev to ext4 then reinstalled the kernel and it worked :)

But you are correct with the scsi/dm Volume group(lvms)

Regards,

Antonio