[libvirt] Problems preserving lock state across suspend/resume

Fri Dec 6 00:06:25 UTC 2013

Hi folks,

I'm looking into a problem discussed back in January 2013
wherein lock/lease state isn't properly preserved across suspend/resume.

(This situation can lead to corruption if the guest's block storage is
modified elsewhere while the original guest is paused.)

For details see:

	https://www.redhat.com/archives/libvirt-users/2013-January/msg00109.html
	https://bugzilla.redhat.com/show_bug.cgi?id=906590

I'm using libvirt-1.2.0 with explicit Sanlock leases defined in the domain XML.

It appears the problematic behavior is due to virDomainLockProcessPause()
and virDomainLockProcessResume() being called twice during each
suspend/resume: once by the RPC worker thread running the suspend/resume
command, and once by the main thread in response to the QEMU events
triggered by the RPC worker's actions.

In libvirt-1.2.0, call paths for suspend are as follows:

qemuDomainObjBeginJob(suspend) -> 
	qemuDomainSuspend() -> 
		qemuProcessStopCPUs() -> 
			virDomainLockProcessPause()

qemuMonitorJSONIOProcessEvent:143 : handle STOP ->
	qemuProcessHandleStop -> 
		virDomainLockProcessPause()

The first call -- usually out of qemuProcessHandleStop but perhaps
there's a race -- properly saves state and releases locks.

However the second call queries lock status after locks have been
released, so it finds no locks are held.  This results in a null/blank
lockState saved in the domain object.

Before I start working on a solution, are these multiple invocations
of virDomainLockProcessPause()/virDomainLockProcessResume() intentional?

Thanks,
Adam Tilghman
UC San Diego