[Crash-utility] Crash faults when determining panic task

Thu Sep 29 14:52:50 UTC 2011

Dave,

Adding --no_elf_notes to the crash invocation does indeed start crash
with without issue.  Do you think that I am dealing with a
corrupted/incomplete vmcore (as evident in that extremely large n_descsz
value) or is this a bug that crash could more gracefully handle?

As far as the kernel is concerned,
2.6.32-131.0.15.el6.exp10.bz16586.x86_64 was a stock RH 2.6.32-131.0.15
with an added patch for handling an MD Raid bug (RHBZ-707268).   Stratus
does load a driver to track dirty VM pages for harvesting purposes, but
does not change general VM behavior.

FWIW, this is the only vmcore that I've seen ELF note faulting or
invalid section numbers.

Thanks,

-- Joe

-----Original Message-----
From: crash-utility-bounces at redhat.com
[mailto:crash-utility-bounces at redhat.com] On Behalf Of Dave Anderson
Sent: Wednesday, September 28, 2011 5:15 PM
To: Discussion list for crash utility usage,maintenance and development
Subject: Re: [Crash-utility] Crash faults when determining panic task

Hi Joe,

It pretty clear it's due to this change in 5.1.5:

         - Implemented the capability of using the NT_PRSTATUS ELF note
data
           that is saved in version 4 compressed kdump headers to
determine the
           starting stack and instruction pointer hooks for x86 and
x86_64
           backtraces when they cannot be determined in the traditional
manners.
           (wang.chao at cn.fujitsu.com, wency at cn.fujitsu.com)

What happens if you run it like so:

  $ crash --no_elf_notes vmlinux vmcore

As far as this message:

  WARNING: sparsemem: invalid section number: 137438888923

That should be outside the realm of Fujitsu's ELF notes patch.  Does
this kernel
have some kind of Stratus VM modification?

Dave

----- Original Message -----
> 
> Crash faults when determining panic task
> 
> I have a vmcore generated on RHEL6.1 that newer versions of crash
> have trouble analyzing (5.1.1-2.el6 seems to work ok) .
> 
> 
> 
> I can provide additional binary files if needed, just let me know
> what convention best suits the list (ftp, private email attachment,
> etc.)
> 
> 
> 
> Crash Version : OS: Result:
> 
> crash 5.1.8 Debian wheezy faults
> 
> crash 5.1.7-1.el6 RHEL6.2 Alpha faults
> 
> crash 5.1.1-2.el6 RHEL6.1 ok
> 
> 
> Kernel:
> 
> 2.6.32-131.0.15.el6.exp10.bz16586.x86_64 ( 2.6.32-131.0.15 + a fix
> for Red Hat bz - 707268)
> 
> 
> Interesting warnings when starting crash:
> 
> WARNING: sparsemem: invalid section number: 137438888923
> 
> WARNING: sparsemem: invalid section number: 137438888923
> 
> 
> First fault, null pointer deference:
> 
> please wait... (determining panic task)
> 
> Program received signal SIGSEGV, Segmentation fault.
> 
> x86_64_get_dumpfile_stack_frame (rsp=0x7fffffffcc58,
> rip=0x7fffffffcc50,
> 
> bt_in=0x7fffffffcce0) at x86_64.c:4183
> 
> 4183 ur_rip = ULONG(user_regs +
> 
> (gdb) p user_regs
> 
> $1 = 0x0
> 
> 
> Workaround, check that bt->machdep is not NULL:
> 
> diff -Nupr crash-5.1.8/x86_64.c crash-5.1.8.new/x86_64.c
> 
> --- crash-5.1.8/x86_64.c 2011-09-16 15:01:12.000000000 -0400
> 
> +++ crash-5.1.8.new/x86_64.c 2011-09-28 14:12:45.347188571 -0400
> 
> @@ -4178,7 +4178,7 @@ x86_64_get_dumpfile_stack_frame(struct b
> 
> goto skip_stage;
> 
> }
> 
> }
> 
> - } else if (ELF_NOTES_VALID()) {
> 
> + } else if (ELF_NOTES_VALID() && bt->machdep) {
> 
> user_regs = bt->machdep;
> 
> ur_rip = ULONG(user_regs +
> 
> OFFSET(user_regs_struct_rip));
> 
> 
> Second fault, a curiously large n_descsz in elf note header:
> 
> please wait... (determining panic task)
> 
> Program received signal SIGSEGV, Segmentation fault.
> 
> get_regs_from_note (note=0xd26472 "\b", ip=0x7fffffffc4e0,
> sp=0x7fffffffc4e8)
> 
> at netdump.c:2221
> 
> 2221 *sp = ULONG(user_regs + offset_sp);
> 
> (gdb) p *(Elf64_Nhdr *)note
> 
> $1 = {n_namesz = 8, n_descsz = 3438804992, n_type = 8}
> 
> 
> Workaround, do not attempt reading registers from elf notes (this
> chunk of code was not present in crash 5.1.1):
> 
> diff -Nupr crash-5.1.8/netdump.c crash-5.1.8.new/netdump.c
> 
> --- crash-5.1.8/netdump.c 2011-09-16 15:01:12.000000000 -0400
> 
> +++ crash-5.1.8.new/netdump.c 2011-09-28 14:14:43.687183734 -0400
> 
> @@ -2286,7 +2286,7 @@ get_netdump_regs_x86_64(struct bt_info *
> 
> 
> 
> bt->machdep = (void *)user_regs;
> 
> }
> 
> -
> 
> +#if 0
> 
> if (ELF_NOTES_VALID() &&
> 
> (bt->flags & BT_DUMPFILE_SEARCH) && DISKDUMP_DUMPFILE() &&
> 
> (note = (Elf64_Nhdr *)
> 
> @@ -2305,7 +2305,7 @@ get_netdump_regs_x86_64(struct bt_info *
> 
> 
> 
> bt->machdep = (void *)user_regs;
> 
> }
> 
> -
> 
> +#endif
> 
> machdep->get_stack_frame(bt, ripp, rspp); }
> 
> 
> Given the warning messages at the beginning of the process, I'm sure
> if I' m dealing with a corrupted or incomplete vmcore image. Let me
> know what additional info could be useful if this seems worth
> debugging further.
> 
> 
> 
> Thanks,
> 
> -- Joe Lawrence
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
> 

--
Crash-utility mailing list
Crash-utility at redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility