[Crash-utility] infinite loop in crash due to double-NMI on x86_64 system

Mon Jun 28 21:26:45 UTC 2010

The dumpsw_notify function is part of a driver that was added to our
systems to trigger kernel panics when an NMI occurs. In the version of
the kernel we are using (SLES 10 SP1) this was necessary to cause an
actual panic to happen and a dump to be saved when an NMI occurred
(especially due to a dump switch being pressed, hence the name).

That driver registers a callback (dumpsw_notify) into the die_chain and
calls panic() if the die code is a DIE_NMI.

-Lucas

> -----Original Message-----
> From: crash-utility-bounces at redhat.com 
> [mailto:crash-utility-bounces at redhat.com] On Behalf Of Dave Anderson
> Sent: Monday, June 28, 2010 2:15 PM
> To: Discussion list for crash utility usage,maintenance and 
> development
> Subject: Re: [Crash-utility] infinite loop in crash due to 
> double-NMI on x86_64 system
> 
> 
> ----- "Lucas Silacci" <Lucas.Silacci at teradata.com> wrote:
> 
>  
> > Sorry, guess I wasn't clear. Nobody hit the dump switch on these
> > systems. They simply had multiple hardware errors that apparently
> > triggered the NMI more than once. That's what I was trying 
> to show with
> > the SEL records, that the multiple NMIs were straight from 
> hardware with
> > no human intervention.
> > 
> > The systems went through a panic (due to multiple NMIs), 
> 
> That's what I'm trying to figure out -- when and how was it 
> decided that
> the machine should panic instead of continuing to handle the 
> stream of NMIs?
> 
> In other words, this "dumpsw_notify" function -- why was it called?
> 
> > > PID: 0      TASK: ffffffff8038c340  CPU: 0   COMMAND: "swapper"
> > >  #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> > >  #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> > >  #2 [ffffffff8046dde0] panic at ffffffff801327fa
> > >  #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> > >  #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> > >  #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> > >  #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> > >  #7 [ffffffff8046df50] nmi at ffffffff8032268f
> > >     [exception RIP: smp_send_stop+84]
> > >     RIP: ffffffff80116e44  RSP: ffffffff8046ddd8  RFLAGS: 00000246
> > >     RAX: 00000000000000ff  RBX: ffffffff8831c1f8  RCX: 
> 000041049c7256e8
> > >     RDX: 0000000000000005  RSI: 000000005238a938  RDI: 
> 00000000002896a0
> > >     RBP: ffffffff8046df08   R8: 00000000000040fb   R9: 
> 000000005238a7e8
> > >     R10: 0000000000000002  R11: 0000ffff0000ffff  R12: 
> 000000000000000c
> > >     R13: 0000000000000000  R14: 0000000000000000  R15: 
> 0000000000000000
> > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > > --- <NMI exception stack> ---
> > >  #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
> 
> >From what you're implying, there is no physical "dump switch".
> So I'm trying figure out where that "dumpsw_notify()" function
> comes from?  Whose module is that and what is its purpose? 
> 
> Dave
>  
> 
> > a reboot, and
> > then crash was run on the resulting dump. In fact crash was
> > automatically run via a startup script and there was no human
> > intervention until after it was noticed that crash was 
> filling up the
> > root file system with a temporary file due to the inifinite loop.
> 
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
>