[libvirt] (Dropping) OOM Handling in libvirt

Mon May 13 10:17:45 UTC 2019

This is a long mail about ENOMEM (OOM) handling in libvirt. The executive
summary is that it is not worth the maint cost because:

  - this code will almost never run on Linux hosts

  - if it does run it will likely have bad behaviour silently dropping
    data or crashing the process

  - apps using libvirt often do so via a non-C language that aborts/exits
    the app on OOM regardless, or use other C libraries that abort

  - we can build a system that is more reliable when OOM happens by
    not catching OOM, instead simply letting apps exit, restart and
    carry on where they left off

The long answer follows...

The background
==============

Since the first commit libvirt has attempted to handle out of memory (OOM)
errors in the same way it does for any other problem. Namely, a virError will
be raised, and the method will jump to its cleanup/error label. The error
bubbles up the stack and in theory someone or something will catch this and do
something sensible/useful. This has long been considered "best practice" for
most C libraries, especially those used for system services. This mail makes
the argument that it is in fact /not/ "best practice" to try to handle OOM.

OOM handling is very difficult to get correct because it is something that
developers almost never encounter and thus the code paths are rarely run. We
designed our memory allocation APIs such that we get compile time errors if
code forgets to check the return value for failure. This is good as it
eliminates an entire class of bugs. Our error handling goto label pattern
tries to align OOM handling with other general error handling which is more
commonly tested.

We have code in the allocators which lets us run unit tests simulating the
failure of any allocation that is made during the test. Executing this is
extraordinarily time consuming as some of our unit tests have many 1000's or
even 10's of 1000's of allocations. The running time to simulate OOM for each
allocation is O(N^2) which does not scale well. As a result we've probably
only run these OOM tests a handful of times over the years.

The tests show we generally do remarkly well at OOM handling, but none the
less we have *always* found bugs in our handling where we fail to report the
OOM (as in, continue to run but with missing data), or we crash when cleaning
up from OOM. Our unit tests don't cover anywhere near all of our codebase
though. We have great coverage of XML parsing/formatting, and sporadic coverage
of everything else. 

IOW, despite our clever API designs & coding patterns designed to improve our
OOM handling, we can not have confidence that we will actually operate correctly
in an OOM scenario.

Of course this applies to all error conditions that may arise, but OOM should
be considered special. With other error conditions from syscalls or API calls,
the effects are largely isolated to the site of the call. With OOM, the OOM
condition may well persist and so during cleanup we will hit further OOM
problems. We may well fail to even allocate the memory needed for raise a
virErrorPtr. All threads can see the OOM concurrently so the effect spreads
across the whole process very quickly.

Who benefits from OOM handling
==============================

Lets pretend for a minute though that our OOM handling is perfect and instead
ask who/what is benefitting from it ?

Libvirt's own processes. aka libvirtd, virtlockd, virtlogd
----------------------------------------------------------

For libvirtd we have essentially zero confidence that it will handle OOM
at all. It is running all virt driver code over which we have massively
inadequate unit testing cover to have any confidence that OOM will be well
handled. Even if OOM is handled in a worker thread, every iteration of the
event loop does an allocation to hold the poll FD array, so we can easily
see the event loop hitting OOM which will make the entire process shutdown,
or worse, hang during cleanup as worker threads block waiting for the event
loop todo something despite not being running anymore.

We already expect that bugs will cause libvirtd to crash and so have designed
the drivers to be robust such that it can restart and attempt to carry on as
normal afterwards. So arguably it would be fine to handle OOM by simply doing
an abort and let systemd restart the daemon.

For virtlockd and virtlogd we again have little confidence that they will
handle OOM correctly. They are, however, more critical processes that we badly
need to stay running at all times. We go to great effort to make it possible
to re-exec on upgrades keeping state open.

For virtlogd we could potentially change the way we deal with stdout/err for
QEMU. Instead of using an anonymous pipe, we could create a named fifo on disk
for each QEMU process. stdout/err would be connected to one end, and virtlogd
to the other end. This would enable us to have virtlogd restarted and re-open
the stdout/err for QEMU. This would mean we no longer need the re-exec logic
either, which is good as that's another thing that's rarely tested. This would
make abort on OOM acceptable.

For virtlockd we have the hardest problem. It fundamentally has to keep open
the FDs in order to maintain active locks. If it does crash / abort, then all
open locks are lost. It would have to manually re-acquire locks when it starts
up again, provided it had a record of which locks it was supposed to have. In
theory nothing bad should happen in this window where virtlockd is dead, as if
libvirtd tries to start another VM it will trigger auto-start of virtlockd via
its systemd socket unit, which could then cause it to reacquire previous locks
it had.  IOW, it should be possible to make virtlockd robust enough that doing
abort on OOM is tolerable. We really need this robustness regardless, because
virtlockd can of course already crash due to code bugs. If we make it robust
to all crashes, then by implication it will be robust enough for aborton OOM.

Clients of libvirt. aka oVirt, OpenStack, KubeVirt, virt-manager, cockpit
-------------------------------------------------------------------------

Clients of libvirt can consume it via a number of different APIs, all
eventually calling into libvirt.so

 - Core C API aka libvirt.so

   Catches and propagates OOM to callers.

   virsh just reports it as it would any other error. In single shot
   mode it will be exiting regardles with error code. In interactive
   mode, in theory it can carry on with processing the next command.
   In practice its CLI parsing will probably then hit OOM anyway
   causing it to exit.

   virt-viewer links to GTK and this already has abort-on-OOM behaviour
   via GLib. So in all liklihood, even if it catches the OOM from libvirt,
   it will none the less abort on OOM in a GTK/GLib call.

   libvirt-dbus is written using glib, so again even if it catches the OOM
   from libvirt it will none the less abort on OOM in a GLib call.

 - Python binding

   Python bindings will raise a MemoryError exception which propagates
   back up to the python code. In theory the app can catch this and
   take some behaviour. This mostly only works though if the cause was
   a single huge allocation, such that most other "normal" sized allocations
   continue working. In a true OOM scenario where arbitrary allocs start
   failing, the python interpretor will fail too.

 - Perl binding

   It is hard to find clear info about Perl behaviour during ENOMEM. The
   libvirt bindings assume the Perl allocation functions won't ever fail,
   so if they do we're going to reference a NULL pointer. If normal Perl
   code gets OOM, the interpretor will raise an error. In theory this can
   be caught, but in practice few, if any, apps will try so the process
   will likely quit.

 - Go binding

   Errors from libvirt are all turned into Golang errors and propagated
   to the caller. If the Go runtime gets an OOM from an allocation it will
   panic the process which can't be caught, so it will exit with stack
   trace.

 - Java binding

   The JVM tends to allocate a fixed size heap for its own use. So Java
   code can see OOM exceptions even if the OS has plenty of memory. If
   the OS does run out of memory, assuming the JVM heap was already
   fully allocated it shouldn't immediately be affect, but might suffer
   collatoral damage. But in theory libvirt OOM could get turned into a
   Java exception that an app can catch & handle nicely.

There are other bindings, but the above captures the most important usage
of libvirt.

The complication of Linux
=========================

Note that all of the above discussion had the implicit assumption that malloc
will actually return ENOMEM when memory is exhausted.

On Linux at least this is not the case in a default install. Linux tends
to enable memory overcommit causing the kernel to satisfy malloc requests
even if it exceeds available RAM + Swap. The allocated memory won't even be
paged in to physical RAM until the page is written to. If there is insufficient
to make a page resident in RAM, then the OOM killer is set free. Some poor
victim will get killed off.

It is possible to disable RAM overcommit and also disable or deprioritize the
OOM killer. This might make it possible to actually see ENOMEM from an
allocation, but few if any developers or applications run in this scenario
so it should be considered untested in reality.

With cgroups it is possible to RAM and swap usage limits. In theory you can
disable the OOM killer per-cgroup and this will cause ENOMEM to the app. In
my testing with a trivial C program that simply mallocs a massive array and
them memsets it, it is hard to get ENOMEM to happen reliably. Sometimes the
app ends up just hanging when testing this.

Other operating systems have different behaviour and so are more likely to
really report ENOMEM to application code, but even if you can get ENOMEM
on other OS the earlier points illustrate that its not worth worrying about.

The conclusion
==============

To repeat the top of this document, attempts to handle OOM are not worth
the maint cost they incur:

  - this code will almost never run on Linux hosts

  - if it does run it will likely have bad behaviour silently dropping
    data or crashing the process

  - apps using libvirt often do so via a non-C language that aborts/exits
    the app on OOM regardless, or use other C libraries that abort

  - we can build a system that is more reliable when OOM happens by
    not catching OOM, instead simply letting apps exit, restart and
    carry on where they left off.

The proposal is thus to change all libvirt code (ie both client and server
side), to abort on OOM.

This will allow us to simplify the control flow in many methods eliminating
~1500 checks for memory failure, their associated goto jumps, and many of
the cleanup/error labels. This makes the code easier to follow & maintain
and opens up new avenues for improving libvirt's future development.

The main blocking prequisite to making the change is to address the needs for
reliable restart of virtlockd and virtlogd daemons.

As a point of reference libguestfs has had abort-on-oom behaviour forever
and no one has complained about this behaviour. Apps using libvirt often
also use libguestfs in the same process.

The implementation
==================

Assuming a decision to abort on OOM, libvirt can nwo follow QEMU's lead and
make use of the GLib library. The initial impl would thus be to just link to
GLib and switch VIR_ALLOC related APIs to use g_new/g_malloc & friends
internally. Over time code can be changed to use g_new/g_malloc directly thus
removing the error handling incrementally.

Use of GLib brings a number of other significant opportunities / advantages
to ongoing development

 - Access to many standard data structures (hash tables, linked lists,
   queues, growable arrays, growable strings).

 - Access to platform portability APIs for threads, event loops, file I/O,
   filename parsing and more.

 - Access to GObject to replace our own virObject clone with something that is
   far more feature rich. This is especially true of its callback facility
   via signal handlers.

 - Access to GIO to replace much of our socket I/O portability wrappers, and
   DBus client APIs.

 - Opens the possibility of calling to code in a non-C language via the GObject
   introspection bindings for GObject & other helper APIs.

 - Ability to drop use of gnulib once we use GLib for all portability problems
   we currently rely on gnulib for, once a critical set of functionality is
   ported to the GLib APIs. Most important is the sockets / event loop stuff
   due to Windows portability.

 - Ability to drop use of autoconf/automake in favour of a more maintainable
   option like meson/ninja, once we eliminate use of gnulib.

Ultimately the big picture benefit is that we can spend less time working on
low level generic code related to platform portability, data structures, or
system services.

Note this is not claiming that all GLib's APIs are better than stuff we have
implemented in libvirt already. The belief is that even if some of the GLib
APIs are worse, it is still a benefit to the project to use them, as it will
reduce the code we maintain ourselves. This in turns lets us spend more time
working on high level code directly related to virtualization features and
be more productive when doing so.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|