Trying to kill a process that just won't die...

Tom Mitchell mitch48 at sbcglobal.net
Wed Jan 28 23:58:27 UTC 2004


On Tue, Jan 20, 2004 at 08:07:31AM -0500, Dave Goldblatt wrote:
> Gregory Gulik wrote:
> 
> >
> >F S   UID   PID  PPID  C PRI  NI ADDR    SZ WCHAN  TTY          TIME CMD
> >1 D     0 11952     1  0  80   2    -  1341 wait_o ?        00:00:00 gtar
> 
> That's what I suspected - the app is in an I/O wait (technically, 
> "uninterruptible sleep").  Usually this is due to an NFS hang, although 
> it can be caused by other media which has wedged but not timed out.

What does "lsof" tell you?  What IO is the process waiting on?

For NFS the mount flag "intr" can help (next time).  Umount -k may
also kill it this time.

You may want to look at a back trace for the process and see what the last
IO request was.  There are two stacks of interest, the user space
side and the kernel space side of the system call.

Depending on library support and the precise action inside system
calls a read() of ten characters will not return until ten characters
are present.  A read of a line will not return until a new line
marker....  And for some things the end of file is important.

Make sure you understand what "gtar" is being asked to do.  It's
standard input file descriptor may be hung (no new line/eof).  It may
be hung on IO for a file read or write or scratch file.  Has anyone
attempted to backup /dev/random, /dev/zero or a named pipe?  Devzero
took weeks on my old infinity computer and check sums on devrandom
never matched.

Once a system call dives down deep into the OS the handling of user
space signals (kill -N) gets murky.  As I noted above NFS has a flag
"intr" to permit the interrupt in user space to end the file IO.  For
NFS this may be possible at multiple levels because timers for network
IO trigger and wake NFS up.  For other IO devices there may not be a
failsafe timer to wake up the driver and return an error.

Because "stuck in IO" often translates into stuck in a hardware driver
it is important to report the specifics of the device involved and the
specifics of the driver.  Note that RAID devices are layered.  The
pseudo hardware that is the raid depends on real hardware.  Thus there
are multiple places where a 'hang' could be generated.

For some IO, make sure things have not gotten so slow that no useful
progress is being made.  Error recovery, Raid recovery or swap IO can
make things look broken.  The machine is working, but no progress is
being made and each time you "look" the process is stuck in IO.  You
may be missing baby step IO (but now you should be able to kill it).

-- 
	T o m  M i t c h e l l 
	mitch48-at-sbcglobal-dot-net





More information about the fedora-list mailing list