[Crash-utility] Question on online/present/possible CPUS

Dave Anderson anderson at redhat.com
Thu Sep 23 20:55:11 UTC 2010


----- "Jeffrey Hagen" <Jeffrey.Hagen at teradata.com> wrote:

> Hi Dave,
> 
> 	Attached is our suggested patch for the issue with CPU count in
> an NMI switch induced coredump.  Basically the change uses the
> cpu_present_mask instead of the cpu_online_mask in x86_64_per_cpu_init
> and x86_64_get_smp_cpus.

I understand why you need to do it that way, but to make a change like
this makes me a little nervous because nobody's ever reported this
situation before, and I'm somewhat paranoid it may lead to unexpected
behavior.  Plus there are old kernels that don't even have a cpu_present_map.

> 	In answer to your question below: "Are you saying that the NMI
> switch shutdown handler takes the other cpus offline?" --- Yes!!

Where exactly?  Can you point me to the kernel code that does that?

Dave


> 
> Thanks,
> 
> Jeff
> 
> 
> -----Original Message-----
> From: crash-utility-bounces at redhat.com
> [mailto:crash-utility-bounces at redhat.com] On Behalf Of Dave Anderson
> Sent: Thursday, August 12, 2010 6:22 AM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re: [Crash-utility] Question on online/present/possible CPUS
> 
> 
> ----- "Jeffrey Hagen" <Jeffrey.Hagen at teradata.com> wrote:
> 
> > Hi Petr and Dave,
> > 
> > I have a couple of comments on Petr's email regarding CPU count.
> > 
> > When the dump is the result of an NMI (nmi switch pressed) due to a
> hung
> > system, one often needs to analyze the state and backtrace for all
> the
> > CPU's.  Since the kernel halts all but CPU0, the crash utility
> cannot
> > see the other "offline" CPU's.
> 
> I've never seen that behavior before.  Probably because I've never
> seen
> an x86_64 dumpfile that was created as a result of the NMI switch
> being
> pressed?  Anyway, are you saying that the NMI switch shutdown handler
> 
> takes the other cpus offline?
>  
> > This behavior has changed for the x86 architecture somewhere
> between
> > 2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the
> x8664_pda
> > structure.  
> > The function x86_64_init (in x86_64.c) now calls
> x86_64_per_cpu_init
> > which doesn't count the offline CPUS when calculating the number of
> > CPU's.  Previously, x86_64_cpu_pda_init (called if x8664_pda
> exists),
> > didn't check for online/offline status.
> 
> Again -- I've never seen this behaviour before.
> 
> In any case, I'll look at any patch suggestions you guys have in
> mind.
> 
> Thanks,
>   Dave
> 
>  
> > Regarding #3 in Petr's email.  It appears that the set command
> won't
> > accept a value >= kt_cpus (number of CPUS).  It doesn't check if
> the
> CPU
> > is offline or not.
> > 
> > Thanks,
> > 
> > Jeff Hagen
> > 
> > 
> > 
> > >
> > > Hi all,
> > >
> > > before making a larger cleanup, I want to ask here for your
> > opinion.
> > It
> > > seems that there is quite a bit of confusion about the meaning of
> > CPU
> > > count printed out by the crash utility.
> > >
> > > 1. Number of CPUs
> > >
> > > Some people think that crash should always output the number of
> > CPUs
> > in
> > > the system (ie. a quad-core server should always output 'CPUS:
> 4'),
> > > while other people think that only online CPUs should be counted.
> > >
> > > 2. CPU numbering
> > >
> > > For example, if there are 4 CPUs in the system, but some of them
> > are
> > > taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the
> number
> > of
> > > online CPUs, it would print out 'CPUS: 2'. It's not easy to find
> > out
> > > that valid CPU numbers are 0 and 2 in this case.
> > 
> > Hi Petr,
> > 
> > For all but ppc64, the number shown by the initial banner and the
> > "sys" command is essentially "the-highest-cpu-number-plus-one".
> > For ppc64 (as requested and implemented by the IBM/ppc64
> > maintainers),
> > it shows the number of online cpus.  There's reasons for doing it
> > either of the two ways, but I'm on vacation now, and you can
> research
> > the list archives for the various arguments for-and-against doing
> it
> > either way.  Check the changelog.html for when it was changed for
> > ppc64, and then cross-reference the revision date with the list
> > archives.
> > 
> > > 3. Examining offline CPU
> > >
> > > Sometimes, it may be useful to examine the state of an offline
> CPU.
> > Now,
> > > I know that the saved state is most likely stale, but it can be
> > useful
> > > in some cases (e.g. a crash after dropping to kdb). The crash
> > utility
> > > currently refuses to select an offline CPU with 'set -c #'. Are
> > there
> > > any concerns about allowing it?
> > 
> > I tend to agree with you, but the only thing that's useful and
> > available from an offline cpu is the swapper task for that cpu
> > and the runqueue for that cpu.  And both of those entities are
> > readily accessible if you really need them.  Although I don't know
> > anything about kdb status, so maybe there's something of per-cpu
> > interest, but I don't know why it would be necessary to "set"
> > that cpu?
> > 
> > In any case, like I said before, I'm just temporarily online while
> > on vacation, and will be back to work on the 9th.
> > 
> > Thanks,
> >   Dave
> > 
> > --
> > Crash-utility mailing list
> > Crash-utility at redhat.com
> > https://www.redhat.com/mailman/listinfo/crash-utility
> 
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
> 
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility




More information about the Crash-utility mailing list