[libvirt] [PATCH] libvirt: add daemon itself as shutdown reason

Thu Oct 18 15:30:10 UTC 2018

On 10/17/18 4:16 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 16.10.2018 15:48, John Ferlan wrote:
>>
>>
>> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>>> Let's introduce shutdown reason "daemon" which means we have to
>>> kill running domain ourselves as the best action we can take at
>>> that moment. Failure to pick up domain on daemon restart is
>>> one example of such case. Using reason "crashed" is a bit misleading
>>> as it is used when qemu is actually crashed or qemu destroyed on
>>> guest panic as result of user choice that is the problem was
>>> in qemu/guest itself. So I propose to use "crashed" only for
>>> qemu side issues and introduce "daemon" when we have to kill the qemu
>>> for good.
>>
>> How about (although reading below may shed some more context):
>>
>> This patch introduces a new shutdown reason "daemon" in order
>> to indicate that the daemon needed to force shutdown the domain
>> as the best course of action to take at the moment.
>>
>> This action would occur during Reconnect processing when it's
>> determined that either the QEMU monitor no longer exists for
>> some unknown reason or creation of a thread to reconnect the
>> domain failed.
> 
> Mostly I'm fine with this one except "monitor no longer exists"
> is condition for reason "crashed" or "unknown" (Martin's contribution)
> 
>>
>>>
>>> There is another example where "daemon" will be useful. If we can
>>> not reboot domain we kill it and got "crashed" reason now. Looks
>>> like good candidate for "daemon" reason.
>>
>> If you feel this way, then a followup patch could/should be posted;
>> otherwise, this'll be lost to commit message heaven where all good ideas
>> go to die ;-).
> 
> Sure!
> 
>>
>>>
>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy at virtuozzo.com>
>>> ---
>>>  include/libvirt/libvirt-domain.h |  1 +
>>>  src/conf/domain_conf.c           |  3 ++-
>>>  src/qemu/qemu_process.c          | 11 ++++-------
>>>  tools/virsh-domain-monitor.c     |  3 ++-
>>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
>>> index fdd2d6b..11fdab5 100644
>>> --- a/include/libvirt/libvirt-domain.h
>>> +++ b/include/libvirt/libvirt-domain.h
>>> @@ -145,6 +145,7 @@ typedef enum {
>>>      VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
>>>      VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
>>>                                             * taken while domain was shutoff */
>>> +    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon have to kill domain */
>>
>> s/have to/decides to/
>>
>>>  # ifdef VIR_ENUM_SENTINELS
>>>      VIR_DOMAIN_SHUTOFF_LAST
>>>  # endif
>>> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
>>> index 9911d56..e441723 100644
>>> --- a/src/conf/domain_conf.c
>>> +++ b/src/conf/domain_conf.c
>>> @@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
>>>                "migrated",
>>>                "saved",
>>>                "failed",
>>> -              "from snapshot")
>>> +              "from snapshot",
>>> +              "daemon")
>>>  
>>>  VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
>>>                "unknown",
>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>> index e9c7618..c4bc9ca 100644
>>> --- a/src/qemu/qemu_process.c
>>> +++ b/src/qemu/qemu_process.c
>>> @@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
>>>          /* We can't get the monitor back, so must kill the VM
>>>           * to remove danger of it ending up running twice if
>>>           * user tries to start it again later
>>
>> Since we're changing anyway, let's put the stop there, e.g.
>> "s/later/later./"
>>
>>> -         * If we couldn't get the monitor since QEMU supports
>>> -         * no-shutdown, we can safely say that the domain
>>> -         * crashed ... */
>>> -        state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>> -        /* If BeginJob failed, we jumped here without a job, let's hope another
>>> +         * If BeginJob failed, we jumped here without a job, let's hope another
>>>           * thread didn't have a chance to start playing with the domain yet
>>>           * (it's all we can do anyway).
>>>           */
>>> -        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
>>> +        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>> +                        QEMU_ASYNC_JOB_NONE, stopFlags);
>>>      }
>>>      goto cleanup;
>>>  }
>>> @@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>>           * is no thread that could be doing anything else with the same domain
>>>           * object.
>>>           */
>>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>>                          QEMU_ASYNC_JOB_NONE, 0);
>>>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
>>>  
>>
>> When qemuProcessReconnectHelper was introduced (commit d38897a5d)
>> reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was
>> changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or
>> VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was
>> removed in commit fe35b1ad6 the conditional state was just left at
>> VIR_DOMAIN_SHUTOFF_CRASHED.
>>
>> Still I agree w/ Martin's thoughts, which unfortunately wasn't commented
>> on in the code leading to the latter revert commit losing that unknown
>> reason. The reason for the "-no-shutdown" *did* change and that probably
>> should have been reflected in the qemuProcessReconnect logic.
>>
>> So I think perhaps we should have an initial patch which references or
>> uses that first paragraph in order to change the code to:
>>
>>         /* We cannot get the monitor back, so kill the VM to
>>          * remove possibility of it ending up running twice if
>>          * user tries to start it again later.
>>          *
>>          * If we cannot get to the monitor when the QEMU command
>>          * line used -no-shutdown, then we can safely say that the
>>          * domain crashed; otherwise we don't really know. */
>>         if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>>             state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>         else
>>             state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
>>
>> and uses the short commit msg of "qemu: Restore lost shutdown reason".

I've posted a patch for the above, see:

https://www.redhat.com/archives/libvir-list/2018-October/msg00945.html

while we work through the rest.

>>
>> Then this followup would change the "UNKNOWN" to "DAEMON" keeping the
>> CRASHED otherwise.
>>
> 
> Then we will get "crashed" reason always for modern qemu and domains
> that can reboot which is not true.
> 
> However I see now that the issue is more complicated then I thought. I agree
> with Martin too but we have to additionally check that connect(2) for domain
> monitor socket was called and error was ECONNREFUSED. So finally state is
> resolved like this:
> 
> if (connect(2) was not called) {
>     state = VIR_DOMAIN_SHUTOFF_UNKNOWN;> else if (connect(2) result is NOT ECONNREFUSED) {
>     state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
> } else if (connect(2) result is ECONNREFUSED) {
>     if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>         state = VIR_DOMAIN_SHUTOFF_CRASHED;
>     else
>         state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
> } else {
>     state = VIR_DOMAIN_SHUTOFF_DAEMON;
> }

Do we really want to dig into why qemuConnectMonitor failed? Or is it
good enough to do something because it did fail. IOW: What condition of
using DAEMON as the reason is most important?

Failure before we try to open monitor?
Failure to open the monitor?
Failure after opening the monitor?

Once we open the monitor, priv->mon is set and we could use that.

> 
> Moreover we should not later kill process by remembered pid in all cases except
> last one in the above code because we can kill some other process.

Not quite sure I follow. This is the reconnect path to connect so some
existing/running domain. The {vm|obj}->pid would be something filled via
/var/run/libvirt/qemu processing for {domain}.{xml|pid}. Different hunk
of code. Maybe I'm misreading your comment.

John

> 
> Looks like a bit complicated, not sure should we pursuit this one or just left
> VIR_DOMAIN_SHUTOFF_UNKNOWN only (in this case we can introduce
> VIR_DOMAIN_SHUTOFF_DAEMON for reboot failures where reason is obvisous).
> 
> Nikolay
> 
>>
>>> diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
>>> index 3a26363..f0ad558 100644
>>> --- a/tools/virsh-domain-monitor.c
>>> +++ b/tools/virsh-domain-monitor.c
>>> @@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
>>>                N_("migrated"),
>>>                N_("saved"),
>>>                N_("failed"),
>>> -              N_("from snapshot"))
>>> +              N_("from snapshot"),
>>> +              N_("daemon"))
>>>  
>>>  VIR_ENUM_DECL(virshDomainCrashedReason)
>>>  VIR_ENUM_IMPL(virshDomainCrashedReason,
>>>