[Crash-utility] Missing PID 1 is crash problem with losing tasks
Dave Anderson
anderson at redhat.com
Thu Aug 26 13:31:58 UTC 2010
----- "Bob Montgomery" <bob.montgomery at hp.com> wrote:
> Well, I've been picking at this some more. PID 1 is in the system, but
> crash misses it when it's building its table of tasks in
> refresh_hlist_task_table_v2(). In fact, on my particular dump, it loses
> track of at least 3 processes.
>
> The attached patch changes that behavior. It has to do with collisions
> on the pid_hash table where an early item on the chain has a NULL task
> pointer which causes the code to ignore subsequent items on that
> collision chain. I'm not sure what it means when the tasks[0].first
> pointer in the struct pid is NULL, but that's what triggers the problem
> and keeps crash from following the pid_chain pointer to the next struct
> pid. I am not confident that this whole area is correct yet, just
> closer to correct than it was.
>
> These now appear in the ps output:
>
> crash-5.0.6-fix2> ps 1 8144 998
> PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 1 0 1 ffff81012bd3c780 IN 0.0 6124 688 init
> 8144 6257 0 ffff81011996e140 RU 0.7 108876 35016 mirrorclient
> 998 11 0 ffff81012a9cd780 IN 0.0 0 0 [fc_dl_1]
>
> where before:
>
> crash-5.0.6-fix> ps 1 8144 998
> ps: invalid task or pid value: 1
>
> ps: invalid task or pid value: 8144
>
> ps: invalid task or pid value: 998
>
> This might have been some transition behavior of the pid hash design in
> the kernel, because I've got two dumps based on 2.6.18 kernels that show
> missing processes (this one had 3 out of 532, the other had 1 out of
> 146), but my new patched crash doesn't reveal any missing processes in
> 2.6.29 and newer dumps (I checked 4 dumps, with process counts ranging
> from 362 to 926). Only my recent 2.6.18 dump was lucky enough to be
> missing PID 1, with me being lucky enough to try crash's mount command,
> or we'd still not know about it :-)
Yeah, I agree that it must be catching a kernel transition.
And it's probably not being seen in your 2.6.29-and-newer dumps because
2.6.24-and-later kernels use refresh_hlist_task_table_v3().
> The patch is simple, but has lots of lines because I moved the indent.
The patch looks reasonable and safe. I'll run it against my stable of
sample dumpfiles to see if I can find one...
Anyway, nice catch Bob -- and thanks again for tracking down yet another
gnarly issue,
Dave
More information about the Crash-utility
mailing list