Re: [libvirt] [Qemu-devel] [PATCH] vl: allow "cont" from panicked state

one question below

On 08/21/13 14:01, Paolo Bonzini wrote:
> After reporting the GUEST_PANICKED monitor event, QEMU stops the VM.
> The reason for this is that events are edge-triggered, and can be lost if
> management dies at the wrong time.  Stopping a panicked VM lets management
> know of a panic even if it has crashed; management can learn about the
> panic when it restarts and queries running QEMU processes.  The downside
> is of course that the VM will be paused while management is not running,
> but that is acceptable if it only happens with explicit "-device pvpanic".
> Upon learning of a panic, management (if configured to do so) can pick a
> variety of behaviors: leave the VM paused, reset it, destroy it.  In
> addition to all of these behaviors, it is possible dumping the VM core
> from the host.
> However, right now, the panicked state is irreversible, and can only be
> exited by resetting the machine.  This means that any policy decision
> is entirely in the hands of the host.  In particular there is no way to
> use the "reboot on panic" option together with pvpanic.
> This patch makes the panicked state reversible (and removes various
> workarounds that were there because of the state being irreversible).
> With this change, management has a wider set of possible policies: it
> can just log the crash and leave policy to the guest, it can leave the
> VM paused.  In particular, the "log the crash and continue" is implemented
> simply by sending a "cont" as soon as management learns about the panic.
> Management could also implement the "irreversible paused state" itself.
> And again, all such actions can be coupled with dumping the VM core.
> Unfortunately we cannot change the behavior of 1.6.0.  Thus, even if
> it uses "-device pvpanic", management should check for "cont" failures.
> If "cont" fails, management can then log that the VM remained paused
> and urge the administrator to update QEMU.
> I suggest that this patch be included in an 1.6.1 release as soon as
> possible, and perhaps in the 1.5 branch too.
> Cc: qemu-stable nongnu org
> Signed-off-by: Paolo Bonzini <pbonzini redhat com>
> ---
>  gdbstub.c | 3 ---
>  vl.c      | 6 ++----
>  2 files changed, 2 insertions(+), 7 deletions(-)
> diff --git a/gdbstub.c b/gdbstub.c
> index 35ca7c2..747e67d 100644
> --- a/gdbstub.c
> +++ b/gdbstub.c
> @@ -372,9 +372,6 @@ static inline void gdb_continue(GDBState *s)
>      s->running_state = 1;
>  #else
> -    if (runstate_check(RUN_STATE_GUEST_PANICKED)) {
> -        runstate_set(RUN_STATE_DEBUG);
> -    }

Undoes bc7d0e66. Makes sense -- what we're allowing now is laxer
(includes the above).

>      if (!runstate_needs_reset()) {
>          vm_start();
>      }
> diff --git a/vl.c b/vl.c
> index 25b8f2f..818d99e 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -637,9 +637,8 @@ static const RunStateTransition runstate_transitions_def[] = {

I don't understand why this used to be here.

So, why? (*)


This is the "cont" we care about -- it should allow the guest to kexec
or reboot (ie. the panic notifier will return).


Undoes bc7d0e66. OK.

>  };
> @@ -685,8 +684,7 @@ int runstate_is_running(void)
>  bool runstate_needs_reset(void)
>  {
>      return runstate_check(RUN_STATE_INTERNAL_ERROR) ||
> -        runstate_check(RUN_STATE_SHUTDOWN) ||
> -        runstate_check(RUN_STATE_GUEST_PANICKED);
> +        runstate_check(RUN_STATE_SHUTDOWN);
>  }
>  StatusInfo *qmp_query_status(Error **errp)

This last hunk in effect reverts the runstate_needs_reset() changes of
the initial pvpanic commit ede085b3.

(*) Hm I think I understand why. main_loop_should_exit(), when a reset
was requested *and* runstate_needs_reset() evaluated to true, used to
set the runstate to PAUSED -- I guess temporarily.

Since PANICKED was included in runstate_needs_reset(), this generic code
could request a transition from PANICKED to PAUSED (**). As PANICKED is
being removed from runstate_needs_reset(), the PANICKED->PAUSED
transition is not required any longer.

(**) I don't know why the generic code moves to PAUSED temporarily (from
INTERNAL_ERROR and SHUTDOWN), but I'll just accept that as status quo.

Reviewed-by: Laszlo Ersek <lersek redhat com>

(Note that my R-b is mostly worthless: similarly to the ACPI table move,
I've been happily acking patches with opposite goals here, and that
seriously questions whether my review adds any value (beyond the lowest
technical level).)


