[libvirt] [PATCH 01/10] qemu: agent: unset error flag in EOF handler

Thu Oct 27 10:04:02 UTC 2016

On 27.10.2016 11:21, Nikolay Shirokovskiy wrote:
> 
> 
> On 26.10.2016 22:48, John Ferlan wrote:
>>
>>
>> On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
>>> Usually on domain shutdown event comes first from qemu
>>> and the flag is unset in processSerialChangedEvent. However
>>> there is a race between this function and qemuProcessHandleAgentEOF
>>> because the former is executed in a thread pool and the latter
>>> executed synchronously. qemuProcessHandleAgentEOF do not unset
>>> the flag thus if it was triggered before it remains set after domain
>>> shutdown/start and thus qemu agent becomes unavailable.
>>
>>
>> What error occurred that would be avoided by clearing the agentError
>> flag unconditionally? Considering that agentError being set is primarily
> 
> In short qemu agent becomes unavailable after shutdown. 
> Long error case description is next:
> 
> 1. agent error occurred and error flag is set.
> 2. domain is shutted down. Shutdown follows the path on which error flag stays set.
> (Actually error flag can be set in the process of shutdown,
> which is I guess the case that I observed.)
> 
> 3. domain is started, on any attempt to use agent you got 
> "QEMU guest agent is not available due to an error".
> 
>   As nobody clears error flag on start up it stays set. Obviously
> we don't need to clear on the first domain start after daemon start
> because qemuDomainObjPrivate is allocated and cleared to zeros.
> But restarts rely upon assumption that this flag is cleared on shutdown.
> And this is not true. There are shutdown paths where flags is not cleared.
> This patch addresses one of this paths:
> 
> 
> 1. shutdown started, in the process VSERPORT_CHANGE is delivered in
> IO loop thread. However it is handled in a different thread (
> from driver->workerPool).
> 2. in the same IO loop thread we got EOF from agent and synchronously
> execute qemuProcessHandleAgentEOF. It closes agent, set priv->agent
> to NULL but forgets to clear agentError.
> 3. Now we execute VSERPORT_CHANGE handler(processSerialChangedEvent).
> It is delayed because it is asynchronous to VSERPORT_CHANGE itself. 
> As priv->agent is NULL handler does not touch agentError flag.
> 
> I guess usually time gap between VSERPORT_CHANGE and EOF is enough
> and step 3 is executed before step 2 and agentError is cleared
> upon shutdown.
> 
> Basically it is enough to clear the flag on start up (patch 3) but I
> thought just for the case we would want to clear the flag on agent cleanup
> too. 
> 
>> used by "new connection" request processing (e.g. callers to
>> qemuDomainAgentAvailable) in order to ascertain whether they should
>> actually attempt the connection or fail because of some condition that
>> would eventually lead to failure.  I would think EOF from qemu_monitor
>> for the qemu_agent would be one of those conditions *before* the domain
>> state is not RUNNING or the priv->agent has been fully removed.
> 
> Don't really understand the part 'f 'before state in not RUNNING'.
>>From what I've seen in code agentError has 2 meanings, they described
> in patch 5.
> 
>>
>> I think what I'm missing is what sequence of events led to this
>> condition? How was it reproduced.
>>
>>> ---
>>>  src/qemu/qemu_process.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>
>> So I understand "eventually" your goal is to remove agentError in patch
>> 6, but you need to lead down the path to get there. So let's see if I
>> can follow along...
> 
> My original intention was to fix the bug that I tried to 
> explain above. But in the process I found the other path how
> similar error can occur again (patch 2). Then I realized that
> we can protect ourselfs from similar bugs just in one place (patch 3).
> At this point I look into the code long enough to realize
> one don't need the flag altogether and decided that detail explanation
> of patch 1 and 2 helps me convince others to get rid of the flag.
> 
>>
>> The issue you chased is related to domain shutdown processing "order".
>>
>> The purpose up to this point in time for 'agentError' is to ensure
>> callers of qemuDomainAgentAvailable() will fail before trying to use the
>> agent. The qemuDomainAgentAvailable checks domain active, agentError,
>> and whether the priv->agent is available.
>>
>> So the contention is during shutdown the EOF event firing at about the
>> same time as the processSerialChangedEvent has caused an issue, but it's
>> not 100% clear just from the description the sequence of events and the
>> issue. So let's consider why these things happen:
>>
>> 1. The 'eofNotify' comes from the qemuMonitorIO function - ostensibly if
>> the guest is shutting down
>>
>> 2. The 'processSerialChangedEvent' seems to occur for multiple reasons -
>> startup, shutdown, some sort of channel failure (unix socket or pty). If
>> I'm reading things correctly, this path is only involved when VSERPORT
>> is being used.
>>
>> So, is this a case of guest startup where the initial agent
>> connect/startup occurs (VSERPORT) and a guest shutdown racing? Or is
>> this purely a shutdown ordering thing?
> 
> Purely shutdown. And error in io loop from agent before shutdown 
> or in the process.
> 
>>
>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>> index 31c8453..a581f96 100644
>>> --- a/src/qemu/qemu_process.c
>>> +++ b/src/qemu/qemu_process.c
>>> @@ -151,6 +151,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
>>>  
>>>      qemuAgentClose(agent);
>>>      priv->agent = NULL;
>>
>> Because this is cleared, then qemuDomainAgentAvailable will fail because
>> the agent is not present instead of failing with there was some error.
>> That would seem to make the following clearing "safe" since we know
>> we'll have a failure any way, but...
>>
>>> +    priv->agentError = false;
>>
>> How does this alone solve the problem you saw.  AFAICT it just changes
>> *where* the error occurs and the error message displayed.
> 
> The place that is affected is the following domain start up and agent use.
> 
>>
>> Conversely, if for some reason for either
>>
>> 1. qemuConnectAgent failed and is being "flagged" in
>>   1a. processSerialChangedEvent (for initial CONNECTED failure when
>> QEMU_CAPS_VSERPORT_CHANGE)
>>   1b. qemuMigrationFinish (for migration target)
>>   1c. qemuProcessReconnect (for libvirtd restart)
>>   1d. qemuProcessLaunch (initial domain startup before
>> QEMU_CAPS_VSERPORT_CHANGE)
>>   1e. qemuProcessAttach (for qemu-attach to existing qemu process)
>>
>> or
>>
>> 2. qemuProcessHandleAgentError - some sort of error and we're stopping
>> any further agent communication.
>>
>> then you're potentially clearing agentError and could conceivably cause
>> "extra" debugging steps to be taken only to figure out the reason it was
>> because of some error rather than the "next" connection failing and
>> indicating the failure was because of some previous error (it's a
>> startup race for some other thread desiring to use the agent).
> 
> As to case 1. AFAIU you talking about next case:
> 
> 1. agent started (qemuAgentOpen is ok)
> 2. for some reason we have to fail qemuConnectAgent (
> virSecurityManagerClearSocketLabel failed for example)
> and set error flag.
> 3. EOF is triggered and clears the error flag (in new version).
> 
> This is not possible. qemuConnectAgent will eventually remove
> agent handle from the io loop before setting the flag on failure and
> EOF can not be delivered.

I was not correct. io loop event dispatching drops io loop lock
on event delivering thus EOF can be delivered.

So we have 2 cases for the race when
agent is started and immediately gets EOF.

A. qemuConnectAgent wins

1. virSecurityManagerClearSocketLabel fails, priv->agent stays NULL,
error is set.
2. qemuProcessHandleAgentEOF exits early on priv->agent check.

B. qemuProcessHandleAgentEOF wins

1. qemuProcessHandleAgentEOF exits early on priv->agent check.
(Looks like debug message will be misleading in this case)
2. qemuConnectAgent proceeds as usual.

In both cases clearing the error flag in qemuProcessHandleAgentEOF
does not changes anything.

> 
> This subtle case is another reason why we should throw away this flag.
> It adds too much complexity to analysis. Patch 5 unbinds the flag from this case,
> case of agent failed to start up.
> 
> As to case 2.
> 
> This is one of the patch targets. If there is an error set by qemuProcessHandleAgentError
> we need to clean it up when agent is cleaned up.
> 
>>
>>>  
>>>      virObjectUnlock(vm);
>>>      return;
>>>
>>
>> So given all that - I'm not 100% clear how this really fixes the issue
>> you say.  Although, I have to wonder if the bug you're chasing is
>> perhaps that qemuConnectAgent doesn't clear agentError at the right time
>> as patch 3 attempts to fix.
> 
> Yep. The flag is not cleared. And it can be addressed both ways:
> clear on initialization to protect from "unclean" clean up,
> second - make clean up "clean".
> 
>>
>> After reading through patch 4 - I think Patch 4 and a slightly modified
>> patch 3 will fix what you're seeing, but I'd only know for sure given
>> more details about what was seen.
> 
> Patch 4 is pure refactoring IIRC and doesn't address the issue.
> 
> Nikolay
> 
> --
> libvir-list mailing list
> libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
>