[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [libvirt] [PATCH] Avoid a race when restoring a qemu domain.



On 04/07/2010 03:43 PM, Chris Lalancette wrote:

Hm, this really doesn't seem like it's the way to fix this.

You are correct that it isn't what should be done in the long term. Short term, though, it definitely fixes bad behavior that I wouldn't want to see in an official release (on my hardware, restores will basically always fail unless the guest was paused prior to saving).

We really
should investigate what is going on in qemu, and see if it's a bug in
qemu itself (in which case we should fix qemu), or if it's a bug in the
way we communicate with qemu (in which case we should fix that).

I'm operating on information I learned in an IRC chat. Perhaps Dan Berrange can pipe up here to repeat / expand on what he said, but basically it sounds like the problem is that qemu will happily start the CPUs for us before the restore operation has begun, and there's no way for us to verify whether or not it has begun - for that qemu will need to make 'info migrate' work on the incoming side, and that's not likely to happen very quickly (of course it will take even longer if I don't whine about it, I just haven't gotten there yet ;-)

   A sleep is just hiding the problem

Yes, I dislike this solution. I'd love it if someone could tell me of an alternate way. If there is no other way to fix it entirely within libvirt, I don't think we should just report the problem to qemu and let users suffer until it gets fixed there, though; especially if that fix requires a new interface in qemu that must then be supported by libvirt, the path to reliably working domain restores could be very long indeed; and in the meantime we'd be left with delivered code that may fail in a rather bad way for someone, especially in the case of a managed save, where the image is deleted as soon as the domain is started - if it fails once, you've lost the image so you can't even try again.

(which means it can still pop up on
machines slower, or more busy, than yours!).

I'm doubtful that slower VT-capable machines exist (although I haven't checked - possibly this same problem exists when doing software emulation too). I hadn't considered if this would pop up on faster hardware that was also busier - a very good point.

(I did just do some more testing, and found that even 50msec is enough to make things work. 10msec isn't enough, though...)


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]