[Crash-utility] Missing PID 1 is crash problem with losing tasks

Dave Anderson anderson at redhat.com
Thu Aug 26 13:31:58 UTC 2010


----- "Bob Montgomery" <bob.montgomery at hp.com> wrote:

> Well, I've been picking at this some more.  PID 1 is in the system, but
> crash misses it when it's building its table of tasks in
> refresh_hlist_task_table_v2().  In fact, on my particular dump, it loses
> track of at least 3 processes. 
> 
> The attached patch changes that behavior.  It has to do with collisions
> on the pid_hash table where an early item on the chain has a NULL task
> pointer which causes the code to ignore subsequent items on that
> collision chain.  I'm not sure what it means when the tasks[0].first
> pointer in the struct pid is NULL, but that's what triggers the problem
> and keeps crash from following the pid_chain pointer to the next struct
> pid.  I am not confident that this whole area is correct yet, just
> closer to correct than it was. 
> 
> These now appear in the ps output:
> 
> crash-5.0.6-fix2> ps 1 8144 998
>    PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
>       1      0   1  ffff81012bd3c780  IN   0.0    6124    688  init
>    8144   6257   0  ffff81011996e140  RU   0.7  108876  35016  mirrorclient
>     998     11   0  ffff81012a9cd780  IN   0.0       0      0  [fc_dl_1]
> 
> where before:
> 
> crash-5.0.6-fix> ps 1 8144 998
> ps: invalid task or pid value: 1
> 
> ps: invalid task or pid value: 8144
> 
> ps: invalid task or pid value: 998
> 
> This might have been some transition behavior of the pid hash design in
> the kernel, because I've got two dumps based on 2.6.18 kernels that show
> missing processes (this one had 3 out of 532, the other had 1 out of
> 146), but my new patched crash doesn't reveal any missing processes in
> 2.6.29 and newer dumps (I checked 4 dumps, with process counts ranging
> from 362 to 926).  Only my recent 2.6.18 dump was lucky enough to be
> missing PID 1, with me being lucky enough to try crash's mount command,
> or we'd still not know about it :-)

Yeah, I agree that it must be catching a kernel transition.

And it's probably not being seen in your 2.6.29-and-newer dumps because
2.6.24-and-later kernels use refresh_hlist_task_table_v3().
 
> The patch is simple, but has lots of lines because I moved the indent.

The patch looks reasonable and safe.  I'll run it against my stable of
sample dumpfiles to see if I can find one...

Anyway, nice catch Bob -- and thanks again for tracking down yet another
gnarly issue,
  Dave




More information about the Crash-utility mailing list