[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[libvirt] Silently ignored virDomainRestore failures

Howdy, all.

I maintain a test infrastructure which makes heavy use of virDomainSave and virDomainRestore, and have been seeing occasional cases where my saved images are for some reason not restored correctly -- and, indeed, the incoming migration streams are not even read in their entirety.

While this generally appears to be caused by issues outside of libvirt's purview, one unfortunate issue is that libvirt can report success performing a restore even when the operation is effectively an abject failure.

Consider the following snippet, taken from one of my /var/log/libvirt/qemu/<domain>.log files:

LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin USER=root LOGNAME=root /usr/bin/qemu-kvm -S -M pc-0.11 -m 512 -smp 1 <...lots of arguments here...> -incoming exec:cat
cat: write error: Broken pipe

This leaves a running qemu hosting a catatonic guest -- but the libvirt client (connecting through the Python bindings) received a status of success for the operation given here.

libvirt's mechanism for validating a successful restore consists of running a "cont" command on the guest, and then checking virGetLastError(); AIUI, it is expected that the "cont" will not be able to run until the restore is completed, as the monitor should not be responsive until that time. Browsing through qemudMonitorSendCont (and qemudMonitorCommandWithHandler, which it calls), I don't see anything which looks at the log file with the stderr output from qemu to determine whether an error actually occurred. (As an aside, "info history" on the guest's monitor socket indicates that it was indeed issued this "cont").

Should the existing cont+virGetLastError() approach be sufficient to handle this class of error? If not, is there any guidance on what would comprise a better system? (I suppose we could add something to the exec: to affirmatively indicate on stderr that the decompressor [or cat, if not using one] exited successfully, and check for that marker in the log file... but that seems quite a dirty hack).


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]