[Crash-utility] crash 4.0-2.8 fails on 2.6.14-rc5 (EM64T)
Badari Pulavarty
pbadari at us.ibm.com
Fri Oct 28 22:19:27 UTC 2005
On Fri, 2005-10-28 at 17:55 -0400, Dave Anderson wrote:
> Badari Pulavarty wrote:
> > On Thu, 2005-10-27 at 14:36 -0400, Dave Anderson wrote:
> > >
> > >
> > > #ifdef X86_64
> > > #define _64BIT_
> > > #define MACHINE_TYPE "X86_64"
> > >
> > > #define USERSPACE_TOP 0x0000008000000000
> > > #define __START_KERNEL_map 0xffffffff80000000
> > > #define PAGE_OFFSET 0x0000010000000000
> > >
> > > #define VMALLOC_START 0xffffff0000000000
> > > #define VMALLOC_END 0xffffff7fffffffff
> > > #define MODULES_VADDR 0xffffffffa0000000
> > > #define MODULES_END 0xffffffffafffffff
> > > #define MODULES_LEN (MODULES_END - MODULES_VADDR)
> > >
> > > So I believe the place to start would be to make these
> > > values into x86_64-specific variables that get initialized
> > > early on based upon the symbol values gathered during
> > > symtab_init(), which is called by main(). After it
> > > completes, machdep_init(PRE_GDB) is called, i.e. x86_64_init():
> > >
> > > /*
> > > * Initialize various subsystems.
> > > */
> > > fd_init();
> > > buf_init();
> > > cmdline_init();
> > > mem_init();
> > > machdep_init(PRE_SYMTAB);
> > > symtab_init();
> > > machdep_init(PRE_GDB);
> > > kernel_init(PRE_GDB);
> > > verify_version();
> > > datatype_init();
> > >
> > > In x86_64_init(PRE_GDB), the former hardwired #defines would need
> > > to be variables, initialized properly based upon clues in the
> > symbol
> > > list.
> > >
> > > Interested in taking a look into this?
> > >
> > > Dave
> >
> > Well, I took a stab at it. Here are the changes I made to "defs.h"
> > looking at Documentation/x86_64/mm.txt. We need to some how put
> > this under "#if THIS_KERNEL_VERSION > 2.6.10".
> >
> >
> First off -- thanks very much for all you've done so far. I
> really appreciate the effort.
>
> Anyway, what I meant was that -- for x86_64 specifically -- things
> like USERSPACE_TOP, PAGE_OFFSET, VMALLOC_START, etc. should no longer
> be hardwired #defines, but instead, they should be references to
> x86_64 data variables define in x86_64.c. So, for example,
> USERSPACE_TOP would be defined something like:
>
> #define USERSPACE_TOP (x86_64_userspace_top)
>
> and there would be one x86_64_xxx variable per virtual address
> item. And each of the variables would be initialized in
> machdep_init(PRE_GDB), which is called just after symtab_init().
> The fact that symtab_init() has been done is important because
> the variables behind "THIS_KERNEL_VERSION" haven't even been
> initialized yet. So instead, I would look at the symbol_value()
> of "_stext", or some known kernel text symbol, and based upon its
> value, it would be obvious whether to use the "old" or "new"
> virtual address values to then set up the each of the x86_64_xxxx
> virtual address values.
>
> But for testing the new addresses, what you've done below should
> suffice.
> >
There is no simple way to add #if KERNEL_VERSION > 2.6.10
in the header file and leave the hardcoded values there ?
> >
> > --- defs.h.org 2005-10-28 13:43:11.000000000 -0700
> > +++ defs.h 2005-10-28 13:53:58.000000000 -0700
> > @@ -1740,14 +1740,14 @@ struct load_module {
> > #define _64BIT_
> > #define MACHINE_TYPE "X86_64"
> >
> > -#define USERSPACE_TOP 0x0000008000000000
> > +#define USERSPACE_TOP 0x0000800000000000
> > #define __START_KERNEL_map 0xffffffff80000000
> > -#define PAGE_OFFSET 0x0000010000000000
> > +#define PAGE_OFFSET 0xffff810000000000
> >
> > -#define VMALLOC_START 0xffffff0000000000
> > -#define VMALLOC_END 0xffffff7fffffffff
> > -#define MODULES_VADDR 0xffffffffa0000000
> > -#define MODULES_END 0xffffffffafffffff
> > +#define VMALLOC_START 0xffffc20000000000
> > +#define VMALLOC_END 0xffffe1ffffffffff
> > +#define MODULES_VADDR 0xffffffff88000000
> > +#define MODULES_END 0xfffffffffff00000
> > #define MODULES_LEN (MODULES_END - MODULES_VADDR)
> >
> > #define PTOV(X) ((unsigned long)(X)+(machdep-
> > >kvbase))
> >
> > Even with these changes, I am not sure if crash is running
> > fine. Its seem doesn't show any useful stacks + there is a
> > warning on start (about exception stacks).
> >
> >
> I'm wondering whether the per-cpu calculations are being
> done correctly? The exception stack addresses come from the
> same per-cpu tss_struct code that started this whole mess,
> and if the per-cpu address calculations needed to find those data
> structures were incorrect, it would lead to exception stack
> error message that you're seeing. This is the old code, but if
> the readmem() of 7 ebase addresses below came from the
> wrong place, the error message you're seeing would result:
>
> } else if (symbol_exists("per_cpu__init_tss")) {
> for (c = 0; c < NR_CPUS; c++) {
> if ((kt->flags & SMP) && (kt->flags &
> PER_CPU_OFF)) {
> if (kt->__per_cpu_offset[c] == 0)
> break;
> vaddr = symbol_value
> ("per_cpu__init_tss") +
> kt->__per_cpu_offset[c];
> } else
> vaddr = symbol_value
> ("per_cpu__init_tss");
>
> vaddr += OFFSET(tss_struct_ist);
>
> readmem(vaddr, KVADDR, &ms->stkinfo.ebase
> [c][0],
> sizeof(ulong) * 7, "tss_struct ist
> array",
> FAULT_ON_ERROR);
>
> if (ms->stkinfo.ebase[c][0] == 0)
> break;
> }
> }
>
> The error message only error checks the contents of cpu 0's array
> of exception stack addresses, the first of which should be a
> pointer to the "boot_exception_stacks" array in the kernel.
I will take a closer look.
>
> >
> >
> > [root at localhost crash-4.0-2.8]# ./crash
> >
> > crash 4.0-2.8
> > Copyright (C) 2002, 2003, 2004, 2005 Red Hat, Inc.
> > Copyright (C) 2004, 2005 IBM Corporation
> > Copyright (C) 1999-2005 Hewlett-Packard Co
> > Copyright (C) 1999, 2002 Silicon Graphics, Inc.
> > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
> > This program is free software, covered by the GNU General Public
> > License,
> > and you are welcome to change it and/or distribute copies of it
> > under
> > certain conditions. Enter "help copying" to see the conditions.
> > This program has absolutely no warranty. Enter "help warranty" for
> > details.
> >
> > GNU gdb 6.1
> > Copyright 2004 Free Software Foundation, Inc.
> > GDB is free software, covered by the GNU General Public License, and
> > you
> > are
> > welcome to change it and/or distribute copies of it under certain
> > conditions.
> > Type "show copying" to see the conditions.
> > There is absolutely no warranty for GDB. Type "show warranty" for
> > details.
> > This GDB was configured as "x86_64-unknown-linux-gnu"...
> >
> > WARNING: cpu 0 first exception stack: cccccccccccccccc
> > boot_exception_stacks: ffffffff8052ce80
> >
> > KERNEL: /usr/src/linux-2.6.14-rc5-madv/vmlinux
> > DUMPFILE: /dev/mem
> > CPUS: 2
> > DATE: Fri Oct 28 13:58:50 2005
> > UPTIME: 06:32:12
> > LOAD AVERAGE: 0.11, 0.10, 0.06
> > TASKS: 66
> > NODENAME: localhost.localdomain
> > RELEASE: 2.6.14-rc5
> > VERSION: #10 SMP Wed Oct 26 15:58:51 PDT 2005
> > MACHINE: x86_64 (3000 Mhz)
> > MEMORY: 4.6 GB
> > PID: 1460
> > COMMAND: "crash"
> > TASK: ffff810122c9f0c0 [THREAD_INFO: ffff810113442000]
> > CPU: 0
> > STATE: TASK_RUNNING (ACTIVE)
> >
> > crash>
> > crash> bt 13939
> > PID: 13939 TASK: ffff810119123740 CPU: 0 COMMAND: "vi"
> > #0 [ffff810114535c78] schedule at ffffffff803b12b3
> > RIP: 000000377c7beb95 RSP: 00007ffffff402d8 RFLAGS: 00010246
> > RAX: 0000000000000017 RBX: ffffffff8010dc26 RCX:
> > 00007ffffff40388
> > RDX: 0000000000000000 RSI: 00007ffffff400a0 RDI:
> > 0000000000000001
> > RBP: 0000000000000000 R8: 0000000000000000 R9:
> > 00007ffffff40020
> > R10: 00007ffffff40020 R11: 0000000000000246 R12:
> > 000000000058b0e0
> > R13: 000000000058b0e0 R14: 0000000000000058 R15:
> > 0000000000000001
> > ORIG_RAX: 0000000000000017 CS: 0033 SS: 002b
> >
> > It shows only "schedule" for all processes. Doesn't seem to show
> > any more stack traces.
> >
> >
> I don't really have any suggestions here, other than to determine
> why the x86_64_low_budget_back_trace_cmd() section that walks the
> process stack is only finding/printing the schedule() line.
> Does "bt -t" work?
>
> I note that this one doesn't show the "cannot access vmalloc space"
> message. Can you read vmalloc and user space addresses? Does "mod"
> work? How about "runq", which is one of the places that depends
> upon being able to read per-cpu data?
bt -t seems to better.
crash> bt 3144
PID: 3144 TASK: ffff81011dd1e100 CPU: 0 COMMAND: "mingetty"
#0 [ffff81011d6b9c68] schedule at ffffffff803b12b3
RIP: 000000377c7b85b2 RSP: 00007fffff87a110 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff8010dc26 RCX: 00007fffff87a7b0
RDX: 0000000000000001 RSI: 00007fffff87a8c7 RDI: 0000000000000000
RBP: 00007fffff87aca0 R8: 00002aaaaaac9b00 R9: 0000000000000000
R10: 0000000000000001 R11: 0000000000000246 R12: 00007fffff87a900
R13: 0000000000502d20 R14: 0000000000000000 R15: 000000007c92d8c0
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
crash> bt -t 3144
PID: 3144 TASK: ffff81011dd1e100 CPU: 0 COMMAND: "mingetty"
START: thread_return (schedule) at ffffffff803b12b3
[ffff81011d6b9d10] do_con_write at ffffffff802689da
[ffff81011d6b9d80] schedule_timeout at ffffffff803b1e4e
[ffff81011d6b9db0] _spin_lock_irqsave at ffffffff803b28ce
[ffff81011d6b9dc0] add_wait_queue at ffffffff8014cf5c
[ffff81011d6b9de0] read_chan at ffffffff8025d1f7
[ffff81011d6b9e48] default_wake_function at ffffffff80130c90
[ffff81011d6b9e78] default_wake_function at ffffffff80130c90
[ffff81011d6b9e90] tty_ldisc_deref at ffffffff802571c4
[ffff81011d6b9ed0] tty_read at ffffffff802575ee
[ffff81011d6b9f10] vfs_read at ffffffff80183a46
[ffff81011d6b9f40] sys_read at ffffffff80183e03
[ffff81011d6b9f80] system_call at ffffffff8010dc26
RIP: 000000377c7b85b2 RSP: 00007fffff87a110 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff8010dc26 RCX: 00007fffff87a7b0
RDX: 0000000000000001 RSI: 00007fffff87a8c7 RDI: 0000000000000000
RBP: 00007fffff87aca0 R8: 00002aaaaaac9b00 R9: 0000000000000000
R10: 0000000000000001 R11: 0000000000000246 R12: 00007fffff87a900
R13: 0000000000502d20 R14: 0000000000000000 R15: 000000007c92d8c0
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
crash>
vmalloc space seems ok:
crash> mod
MODULE NAME SIZE OBJECT FILE
ffffffff88011f80 floppy 77896 (not loaded) [CONFIG_KALLSYMS]
ffffffff8801db80 i2c_core 29056 (not loaded) [CONFIG_KALLSYMS]
ffffffff88022800 i2c_i801 11796 (not loaded) [CONFIG_KALLSYMS]
ffffffff88025900 hw_random 7968 (not loaded) [CONFIG_KALLSYMS]
ffffffff88030500 ehci_hcd 39688 (not loaded) [CONFIG_KALLSYMS]
ffffffff8803ae80 uhci_hcd 38048 (not loaded) [CONFIG_KALLSYMS]
ffffffff88086380 ipv6 309760 (not loaded) [CONFIG_KALLSYMS]
ffffffff8809a380 dm_mod 70232 (not loaded) [CONFIG_KALLSYMS]
ffffffff880a3100 dm_mirror 26504 (not loaded) [CONFIG_KALLSYMS]
ffffffff880cc300 sunrpc 177096 (not loaded) [CONFIG_KALLSYMS]
ffffffff880d8100 autofs4 26376 (not loaded) [CONFIG_KALLSYMS]
ffffffff880e5100 parport 46988 (not loaded) [CONFIG_KALLSYMS]
ffffffff880ea800 lp 17616 (not loaded) [CONFIG_KALLSYMS]
ffffffff880f6c80 parport_pc 33768 (not loaded) [CONFIG_KALLSYMS]
crash> runq
RUNQUEUES[0]: ffff8100050ee6e0
ACTIVE PRIO_ARRAY: ffff8100050ee760
[115] PID: 30383 TASK: ffff810110c720c0 CPU: 0 COMMAND: "crash"
PID: 30227 TASK: ffff810115f1f8c0 CPU: 0 COMMAND: "sshd"
EXPIRED PRIO_ARRAY: ffff8100050ef040
RUNQUEUES[1]: ffff8100050f66e0
ACTIVE PRIO_ARRAY: ffff8100050f6760
[117] PID: 3505 TASK: ffff81011cec4780 CPU: 1 COMMAND: "crash"
EXPIRED PRIO_ARRAY: ffff8100050f7040
Have a nice weekend, we can take a look at it on Monday.
Thanks,
Badari
More information about the Crash-utility
mailing list