[PATCH] mkinitrd rescue mode

Mon Jun 6 19:31:42 UTC 2005

On Mon, 2005-06-06 at 14:27 -0400, Jeffrey Layton wrote:
> On Mon, 2005-06-06 at 14:06 -0400, Peter Jones wrote:
> > How often is that a real problem?  I don't think I've seen any
> > significant number of bug reports on this in recent memory.
> 
> That's just a "for instance". Most of them aren't actual bugs, and so
> you won't see BZ tickets on them.

I don't buy that at all.  If, as you state below, we're seeing lots of
cases where machines don't boot after upgrades, then there's a bug in
something.  Period.

> Problems booting after kernel upgrades are common, however. We deal with
> them on a daily basis in production support. The problem for us is that
> there is virtually no simple way to troubleshoot the case where the
> rootfs isn't getting mounted for some reason, the switchroot fails, and
> the kernel panics.

And if we're getting those, then there *is* a real mkinitrd bug, or a
real module_upgrade bug, or a real new-kernel-pkg bug.  If Engineering
isn't seeing BZ tickets on them, then Engineering and GPS both lose.  No
new nash/mkinitrd features will help that at all.

> The most helpful messages telling us what the problem is usually have 
> scrolled off the screen. We generally have to fix this based on experience
> and guesswork.

One (much more acceptable) solution for that would be a much simpler
switchroot change -- make it look for a command line option "pause", and
if it finds it, wait for the user to hit "enter" before executing init.
For the overwhelmingly vast majority of boxes that'll get you enough
scrollback in shift+pageup to see everything since kernel started.

> You can have the user set up a serial console, but users who are savvy
> enough to figure out how to do that are generally able to troubleshoot
> their own booting problems.  Being able to tell a user to add "rescue" 
> to the command line and then to walk them through some commands (like 
> dmesg) to try to determine the problem would be very helpful for a
> number of different reasons.

This is something of a contradiction -- you've just said the user isn't
savvy enough to debug boot problems, and then suggested a step that'd
make them only very marginally easier.  

> It would also give people the ability to try to rescue corrupted root
> filesystems without needing special infrastructure (like a PXE server) 
> and without having to physically be near the machine (with a CD boot).

This is a strawman -- your scenario is that they've just installed or
upgraded, in which case they've already set up this infrastructure or
are already close to the box.

> Since we're discussing this, I posted a proposed patch this morning to
> nash to clean out the initramfs prior to the switchroot:
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=159636

Thank you for doing this.

> If nash is able to clean out the initramfs before switching the root, is
> there any reason _not_ to have some useful tools in it?

Added complexity is bad.  Sometimes we have to add some, and that sucks.
When we don't have to, in general the answer is "no".

So if you *really* think this is worth doing, I'm more likely to take a
change to add a command line argument which causes nash to execute
something on a second initramfs cpio ball in lieu of switchroot/init,
and then an entirely separate image (unrelated to mkinitrd) to do your
rescue stuff.

-- 
        Peter