[libvirt] Configuring pflash devices for OVMF firmware

Fri Jan 25 15:03:33 UTC 2019

We configure OVMF firmware for PC machine types with -drive if=pflash.
This is pretty much the last remaining use of -drive in libvirt we can't
yet replace by -blockdev.  Such a replacement is desirable, because
-blockdev + -device is more flexible than -drive if=pflash.  Also, once
we don't need -drive with new QEMU anymore, the path for deleting all
-drive code in libvirt some day is open.  As with all desirables, the
benefit needs to exceed the cost.

I'm going to describe the status quo, how we got there (briefly and much
simplified), then sketch how to replace -drive if=pflash.  I'm afraid
this is fairly long; sorry.  Please correct misunderstandings.  Beware,
my libvirt and OVMF fu is much weaker than my QEMU fu.

In the beginning, board code read the BIOS from a fixed file and mapped
it into the guest's address space.  Life was simple.

On physical hardware, the BIOS can persist a bit of state across (cold)
reboots by storing it in (non-volatile) CMOS RAM.  We didn't bother.
Simple.

Fast forward several years, and The Law of OS Envy (every program wants
to grow into a full-blown operating system) has asserted itself: PC
Firmware has grown from an 8KiB ROM using a few bytes of volatile and
non-volatile RAM into a multi-megabyte beast with much more complex
storage needs.

On today's physical PC hardware, firmware is stored in flash memory.
There's code, and there's persistent data.  For obvious reasons, the
code should be write-protected except when doing an upgrade.  "Secure
boot" additionally needs to restrict data writes to system management
mode (SMM).

Here's our first iteration of OVMF support, at QEMU level:

    -drive if=pflash,format=raw,file=/where/ever/OVMF.fd

Generic code creates a block backend for it.  Magic board code picks up
the backend, creates a frontend (a cfi.pflash01 device), and maps it
into the guest's address space.

At libvirt level:

  <loader type="pflash">/where/ever/OVMF.fd</loader>

Problem: while the flash device model provides read-only capability,
it's all-or-nothing.  You can't tell it to write-protect just the part
holding code.  The examples above don't write-protect anything.
/where/ever/OVMF.fd better be writable exclusively.

The flash device model could be enhanced, but we went down a different
path: we split the single OVMF image OVMF.fd ("unified build") into a
code image OVMF_CODE.fd and a data image OVMF_VARS.fd ("split build").
At QEMU level:

    -drive if=pflash,format=raw,readonly,file=/usr/share/OVMF/OVMF_CODE.fd
    -drive if=pflash,format=raw,file=/where/ever/OVMF_VARS.fd

OVMF_CODE.fd must be unit 0, and OVMF_VARS.fd must be unit 1.

Generic code creates two block backends.  Magic board code picks them
up, creates a frontend (a cfi.pflash01 device) for each, and maps them
into the guest's address space.

Note there are *two* virtual flash devices now, whereas physical
hardware commonly has just one.

At libvirt level:

  <loader type="pflash" readonly="yes">/usr/share/OVMF/OVMF_CODE.fd</loader>
  <nvram template="/usr/share/OVMF/OVMF_VARS.fd">/var/libvirt/nvram/${guest}_VARS.fd</nvram>

This treats OVMF_VARS.fd as a read-only template, and gives each guest
its own writable copy, which is nice.

The flash device model supports restricting writes to SMM (remember,
that's required for secure boot).  It's controlled by cfi.pflash01
property secure, off by default.  If we created the device model with
-device, we'd simply pass secure=on.  But since we create it with -drive
if=pflash, we can't.  Instead we have to use

    -global driver=cfi.pflash01,property=secure,value=on

This flips the global default value.  Awkward, but works out okay,
because (1) the flash device holding OVMF_VARS.fd wants this value, and
(2) the flash device holding OVMF_CODE.fd doesn't care (it's read-only),
and (3) there is no way to create additional flash devices.

At the libvirt level, we add secure='yes' to the loader element.

We also have to enable SMM emulation.  At QEMU level:

    -machine smm=on

At libvirt level:

    <features>
      <smm state='on'/>
    </features>

Note that the above configuration examples involve selecting OVMF
images.  A bit of an inconvenience compared to BIOS, where the default
"use the BIOS shipped with QEMU" pretty much just works.

To add annoyance to inconvenience, different distributions have
different ideas on where to install OVMF images.  And because that's not
complicated enough, we also have to pair code with data images.  And
because that's still not complicated enough, any specific machine type
may work only with a subset of the available firmwares.

The proposed way to deal with all that works as follows.

Each set of firmware images comes with a descriptor file.  These are
JSON and conform to the QAPI schema docs/interop/firmware.json.

Among the descriptors that declare support for the kind of machine we
want, we pick (really: the management application picks) the one with
the highest priority.  The distribution provides default priorities,
which system administrator and user can override.  firmware.json
documents this in much more detail.

I wrote "proposed", because as far as I can tell, neither distributions
nor libvirt are there, yet.

After all this text, I'm finally ready to curve towards -blockdev.
Going from -drive if=T, T!=none to -blockdev involves two steps.  The
first step replaces if=T with if=none and -device.  The second step
replaces -drive if=none with -blockdev.  That step is "obvious" (it took
us a few years to get to obvious, but I digress).  The difficulty is in
the first step.  Two issues:

(1) cfi.pflash01 isn't available with -device.

(2) "Magic board code picks up the backend [created for -drive
    if=pflash], creates a frontend (a cfi.pflash01 device), and maps it
    into the guest's address space."  When we replace if=pflash by
    if=none, we get to replicate that magic on top of -device.

Issue (1) isn't too hard: we add the device to the dynamic sysbus device
white-list, move a sysbus_mmio_map() from pflash_cfi01_realize() into
pflash_cfi01_realize().  The latter requires a new device property to
configure the base address.  I got a working prototype.  Since this
makes the device model's name and properties ABI, review would be
advisable.

To solve (2), we first have to understand the magic.  Device
cfi.pflash01 has the following properties:

    num-blocks                  Size of the device in blocks
    sector-length               Size of a block
                                (admire the choice of names)
    width                       Bank width
    big-endian                  Endianess (d'oh)
    id0, id1, id2, id3          Some kind of device ID, guest-visible,
                                default to zero, few boards change it
    name                        Memory region name
                                (why is this even configurable?)
    phys-addr                   Physical base address
                                (this is the new device property
                                mentioned above)
    secure                      For restricting access to firmware,
                                default off
    device-width                you don't want to know,
                                there is a default, but it's documented
                                as "bad, do not use", yet pretty much
                                all boards use it
    max-device-width            defaults to device-width
                                not actually set anywhere
    old-multiple-chip-handling  back-compat gunk for
                                machine types 2.8 and older

The magic board code in hw/i386/pc_sysfw.c configures as follows:

    num-blocks                  computed from backend size
    sector-length               4096
    width                       1
    big-endian                  0
    id0, id1, id2, id3          all 0
    name                        system.pflash<U>, where U is -drive's
                                unit number
    phys-addr                   computed so
                                unit 0 ends right below 0x100000000,
                                unit n+1 ends at right below unit n

"secure", "device-width", "max-device-width",
"old-multiple-chip-handling" are left at the default.

One additional bit of magic is actually in libvirt: it configures
"secure" by flipping its default with
-global driver=cfi.pflash01,property=secure,value=on.

Now let's consider how to replicate this magic on top of device.

Perhaps machine-type specific defaults could take care of sector-length,
width, big-endian, id0, id1, id2, id3.  Leaves num-blocks, name, and
phys-addr.

Perhaps the realize() method could default num-blocks to size of
backend.  But that doesn't really help the management application,
because it needs to mess with the size anyway to compute phys-addr.  So
scratch that idea.

Moving the magic code to compute num-blocks, phys-addr and name to the
management application is certainly possible, but ugly.

Note that the values computed are fixed when the firmware gets deployed.
If we record them in the firmware descriptor, the management application
doesn't need magic, it can simply pass on the values obtained from the
descriptor.

We'd want to include sector-length in the descriptor then, to ensure
num-block has a defined meaning.

Same technique could take care of width, big-endian, ... in case
machine-type specific defaults turn out to be inadequate for them.

Opinions?

One more problem: the magic board code does a bit more than just
configure the cfi.pflash01 device.  That additional magic needs to be
generalized to work regardless of whether the device gets configured
with -drive if=pflash or with -device.  I got a working prototype.