[Crash-utility] crash CPU bound waiting for user response

Thu Jul 5 13:48:34 UTC 2007

D. Hugh Redelmeier wrote:
> | From: Dave Anderson <anderson at redhat.com>
> 
> | D. Hugh Redelmeier wrote:
> 
> | > ==> Worse: while it is awaiting my RETURN, it is burning 100% of the CPU!
> | > 
> | > Here is what "ps laxgwf" says about the crash process and its child.
> | > 
> | > F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
> | > 4     0  4426  4406  25   0 416812 332764 -     R+   pts/5     80:36
> | > |               |           \_ crash --readnow
> | > /usr/lib/debug/lib/modules/2.6.21-1.3228.fc7/vmlinux
> | > /var/crash/2007-07-02-13:42/vmcore
> | > 0     0  4989  4426  18   0  73976   740 -      S+   pts/5      0:00
> | > |               |               \_ /usr/bin/less -E -X -Ps -- MORE --
> | > forward\: <SPACE>, <ENTER> or j  backward\: b or k  quit\: q
> | > 
> | > strace of the crash process shows an infinite sequence of:
> | >     wait4(4989, 0x7fffcd9cae90, WNOHANG, NULL) = 0
> | >     wait4(4989, 0x7fffcd9cae90, WNOHANG, NULL) = 0
> | >     wait4(4989, 0x7fffcd9cae90, WNOHANG, NULL) = 0
> | >     wait4(4989, 0x7fffcd9cae90, WNOHANG, NULL) = 0
> | > 
> | > This is very wasteful.
> | > 
> | > There are other ways to get into this state.  Other places less is
> | > being used and is waiting.  Probably wherever less is used even if it
> | > isn't waiting.
> | > 
> | > I just tested: this problem exists when using a normal xterm.
> | 

Again, what exactly do you do to reproduce it?  I just cannot get the 100%
cpu-time waiting on the "less" sub-shell.

> | Yeah, I have seen this on occasions, but I have never been able
> | to reproduce it on demand.  There was a patch suggestion a while ago,
> | but I deferred it until I could reliably reproduce it for testing
> | before taking it in.
> 
> I've put gdb on the case.  The CPU burning that I'm currently experiencing is
> in cmdline.c:restore_sanity.  The actuall code in question is:
>     while (!waitpid(pc->stdpipe_pid, &waitstatus, WNOHANG))
>                                 ;
> That sure looks like a busy-wait.
> 
> If you execute this code, you should get a busy-wait too.
> 
> If you replaced WNOHANG with 0, I think that the wait would have the
> same result but not be busy.  You would then want to loop in the case
> where waitpid returns a -1 with errno == EINTR.
> 
> Here's what I'd try (UNTESTED!):
>     do ; while (waitpid(pc->stdpipe_pid, &waitstatus, 0) == -1 && errno == EINTR);
> 
> All the uses of WNOHANG in that function look suspicious.

I understand.  I also remember that the WNOHANG's were originally added
there on purpose because of hangs I was seeing.  But that's not to say
it's the best way of doing things.

As I mentioned before, there was a patch posted by someone (as I recall
who preferred using gdb and gdb scripts with kdump vmcores), but going
back a year and a half into the archives, I can't find it.

Anyway, I'm going to have to be able to reproduce it and test any
changes thoroughly before potentially re-introducing the hangs I
used to see.

Thanks,
   Dave