[libvirt] An implementation bug in qemuMigrationWaitForCompletion that introduces an unnecessary sleep of 50 ms when a migration completes

Thu Apr 9 22:32:09 UTC 2015

Hi,

It seems that I found a bug in the qemuMigrationWaitForCompletion(). It
will do a sleep of 50 ms, even when qemuMigrationUpdateJobStatus() detects
that a migration has completed.

Here is the original implementation of the while loop.

   while (jobInfo->type == VIR_DOMAIN_JOB_UNBOUNDED) {
        /* Poll every 50ms for progress & to allow cancellation */
        struct timespec ts = { .tv_sec = 0, .tv_nsec = 50 * 1000 * 1000ull
};

        if (qemuMigrationUpdateJobStatus(driver, vm, job, asyncJob) == -1)
            break;

        /* cancel migration if disk I/O error is emitted while migrating */
        if (abort_on_error &&
            virDomainObjGetState(vm, &pauseReason) == VIR_DOMAIN_PAUSED &&
            pauseReason == VIR_DOMAIN_PAUSED_IOERROR) {
            virReportError(VIR_ERR_OPERATION_FAILED,
                           _("%s: %s"), job, _("failed due to I/O error"));
            break;
        }

        if (dconn && virConnectIsAlive(dconn) <= 0) {
            virReportError(VIR_ERR_OPERATION_FAILED, "%s",
                           _("Lost connection to destination host"));
            break;
        }

        virObjectUnlock(vm);

        nanosleep(&ts, NULL);

        virObjectLock(vm);
    }

Even when qemuMigrationUpdateJobStatus() detects a migration has completed,
the nanosleep() will be called again. When the while condition is checked,
then the program breaks from the while loop. This introduces an unnecessary
sleep of 50 ms which can be avoided by doing a sleep first. The code should
likely be structured as following.

   while (jobInfo->type == VIR_DOMAIN_JOB_UNBOUNDED) {
        /* Poll every 50ms for progress & to allow cancellation */
        struct timespec ts = { .tv_sec = 0, .tv_nsec = 50 * 1000 * 1000ull
};

        // do sleep here.
        virObjectUnlock(vm);

        nanosleep(&ts, NULL);

        virObjectLock(vm);

        if (qemuMigrationUpdateJobStatus(driver, vm, job, asyncJob) == -1)
            break;

        /* cancel migration if disk I/O error is emitted while migrating */
        if (abort_on_error &&
            virDomainObjGetState(vm, &pauseReason) == VIR_DOMAIN_PAUSED &&
            pauseReason == VIR_DOMAIN_PAUSED_IOERROR) {
            virReportError(VIR_ERR_OPERATION_FAILED,
                           _("%s: %s"), job, _("failed due to I/O error"));
            break;
        }

        if (dconn && virConnectIsAlive(dconn) <= 0) {
            virReportError(VIR_ERR_OPERATION_FAILED, "%s",
                           _("Lost connection to destination host"));
            break;
        }

    }

I have filed a bug report at redhat bugzilla, with a bug id 1210502. The
link is below.
https://bugzilla.redhat.com/show_bug.cgi?id=1210502

I also prepared a patch for fixing it. It is attached.

Any comments?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20150409/f41c47cf/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-qemu_migration.c-sleep-first-before-checking-for-mig.patch
Type: application/octet-stream
Size: 1475 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20150409/f41c47cf/attachment-0001.obj>