[libvirt] Matching the type of mediated devices in the migration

Thu Aug 23 04:02:43 UTC 2018

> From: Alex Williamson [mailto:alex.williamson at redhat.com]
> Sent: Thursday, August 23, 2018 11:47 AM
> 
> On Wed, 22 Aug 2018 02:30:12 +0000
> "Tian, Kevin" <kevin.tian at intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson at redhat.com]
> > > Sent: Wednesday, August 22, 2018 10:08 AM
> > >
> > > On Wed, 22 Aug 2018 01:27:05 +0000
> > > "Tian, Kevin" <kevin.tian at intel.com> wrote:
> > >
> > > > > From: Wang, Zhi A
> > > > > Sent: Wednesday, August 22, 2018 2:43 AM
> > > > > >
> > > > > > Are there any suggestions how we can deal with security issues?
> > > > > > Allowing userspace to provide a data stream representing the
> internal
> > > > > > state of a virtual device model living within the kernel seems
> > > > > > troublesome.  If we need to trust the data stream, do we need to
> > > > > > somehow make the operation more privileged than what a vfio
> user
> > > > > might
> > > > > > have otherwise?  Does the data stream need to be somehow
> signed
> > > and
> > > > > how
> > > > > > might we do that?  How can we build in protection against an
> > > untrusted
> > > > > > restore image?  Thanks,
> > > >
> > > > imo it is not necessary. restoring mdev state should be handled as if
> > > > guest is programming the mdev.
> > >
> > > To me this suggests that a state save/restore is just an algorithm
> > > executed by userspace using the existing vfio device accesses.  This is
> > > not at all what we've been discussing for migration.  I believe the
> >
> > not algorithm by userspace. It's kernel driver to apply the audit when
> > receiving opaque state data.
> 
> And a kernel driver receiving and processing opaque state date from a
> user doesn't raise security concerns for you?

opaque is from userspace p.o.v. kernel driver understands the actual
format and thus can audit when restoring the state.

> 
> > > interface we've been hashing out exposes opaque device state through
> a
> > > vfio region.  We therefore must assume that that opaque data contains
> > > not only device state, but also emulation state, similar to what we see
> > > for any QEMU device.  Not only is there internal emulation state, but
> > > we have no guarantee that the device state goes through the same
> > > auditing as it does through the vfio interface.  Since this device and
> > > emulation state live inside the kernel and not just within the user's
> > > own process, a malicious user can do far more than shoot themselves.
> It
> > > would be one thing devices were IOMMU isolated, but they're not,
> > > they're isolated through vendor and device specific mechanism, and for
> > > all we know the parameters of that isolation are included in the
> > > restore state.  I don't see how we can say this is not an issue.
> >
> > I didn't quite get this. My understanding is that isolation configuration
> > is completed when a mdev is created on DEST machine given a type
> > definition. The state image contains just runtime data reflecting what
> > guest driver does on SRC machine. Restoring such state shouldn't
> > change the isolation policy.
> 
> Let's invent an example where the mdev vendor driver has a set of
> pinned pages which are the current working set for the device at the
> time of migration.  Information about that pinning might be included in
> the opaque migration state.  If a malicious user discovers this, they
> can potentially also craft a modified state which can exploit the host
> kernel isolation.

pinned pages may be not a good example. the pin knowledge could be
reconstructed when restoring the state (e.g. in GVT-g pinning is triggered
by shadowing GPU page table which has to be recreated on DEST). 

> 
> > > > Then all the audits/security checks
> > > > enforced in normal emulation path should still apply. vendor driver
> > > > may choose to audit every state restore operation one-by-one, and
> > > > do it altoghter at a synchronization point (e.g. when the mdev is re-
> > > > scheduled, similar to what we did before VMENTRY).
> > >
> > > Giving the vendor driver the choice of whether to be secure or not is
> > > exactly what I'm trying to propose we spend some time thinking about.
> > > For instance, what if instead of allowing the user to load device state
> > > through a region, the kernel could side load it using sometime similar
> > > to the firmware loading path.  The user could be provided with a file
> > > name token that they push through the vfio interface to trigger the
> > > state loading from a location with proper file level ACLs such that the
> > > image can be considered trusted.  Unfortunately the collateral is that
> > > libvirt would need to become the secure delivery entity, somehow
> > > stripping this section of the migration stream into a file and
> > > providing a token for the user to ask the kernel to load it.  What are
> > > some other options?  Could save/restore be done simply as an
> > > algorithmic script matched to stack of data, as I read into your first
> > > statement above?  I have doubts that we can achieve the internal state
> > > we need, or maybe even the performance we need using such a process.
> > > Thanks,
> > >
> >
> > for GVT-g I think we invoke common functions as used in emulation path
> > to recover vGPU state, e.g. gtt rw handler, etc. Zhi can correct me if
> > I'm wrong.
> 
> One example of migration state being restored in a secure manner does
> not prove that such an interface is universally secure or a good idea.
> 
> > Can you elaborate the difference between device state and emulation
> > state which you mentioned earlier? We may need look at some concrete
> > example to understand the actual problem here.
> 
> See my example above, or imagine that the migration state information
> includes any sort of index field where a user might be able to modify
> the index and trick the driver into inserting malicious code elsewhere
> in the host kernel stack.  It's a security nightmare waiting to
> happen.  Thanks,
> 

one thing which I'm not sure is why this becomes a new concern but
not applied to existing VM state maintained by kernel, e.g. various
vCPU states...

Thanks
Kevin