[libvirt] AMD SEV's /dev/sev permissions and probing QEMU for capabilities

Fri Jan 18 11:11:50 UTC 2019

On Fri, Jan 18, 2019 at 10:16:38AM +0000, Daniel P. Berrangé wrote:
>On Fri, Jan 18, 2019 at 10:39:35AM +0100, Erik Skultety wrote:
>> Hi,
>> this is a summary of a private discussion I've had with guys CC'd on this email
>> about finding a solution to [1] - basically, the default permissions on
>> /dev/sev (below) make it impossible to query for SEV platform capabilities,
>> since by default we run QEMU as qemu:qemu when probing for capabilities. It's
>> worth noting is that this is only relevant to probing, since for a proper QEMU
>> VM we create a mount namespace for the process and chown all the nodes (needs a
>> SEV fix though).
>>
>> # ll /dev/sev
>> crw-------. 1 root root
>>
>> I suggested either force running QEMU as root for probing (despite the obvious
>> security implications) or using namespaces for probing too. Dan argued that
>> this would have a significant perf impact and suggested we ask systemd to add a
>> global udev rule.
>

If the creation of namespaces is poses a performance impact, then why don't we
special-case the probing in a sense that we create one namespace for probing,
once, and probe all QEMU binaries in that one namespace?

>I've just realized there is a potential 3rd solution. Remember there is
>actually nothing inherantly special about the 'root' user as an account
>ID. 'root' gains its powers from the fact that it has many capabilities
>by default.  'qemu' can't access /dev/sev because it is owned by a
>different user (happens to be root) and 'qemu' does not have capabilities.
>
>So we can make probing work by using our capabilities code to grant
>CAP_DAC_OVERRIDE to the qemu process we spawn. So probing still runs
>as 'qemu', but can none the less access /dev/sev while it is owned
>by root.  We were not using 'qemu' for sake of security, as the probing
>process is not executing any untrusthworthy code, so we don't  loose any
>security protection by granting CAP_DAC_OVERRIDE.
>

IMHO CAP_DAC_OVERRIDE is a lot, especially on systems without SELinux.

>> I proceeded with cloning [1] to systemd and creating an udev rule that I planned
>> on submitting to systemd upstream - the initial idea was to mimic /dev/kvm and
>> make it world accessible to which Brijesh from AMD expressed a concern that
>> regular users might deplete the resources (limit on the number of guests
>> allowed by the platform). But since the limit is claimed to be around 4, Dan
>> discouraged me to continue with restricting the udev rule to only the 'kvm'
>> group which Laszlo suggested earlier as the limit is so small that a malicious
>> QEMU could easily deplete this during probing. This fact also ruled out any
>> kind of ACL we could create dynamically. Instead, he suggested that we filter
>> out the kvm-capable QEMU and put only that one in the namespace without a
>> significant perf impact.
>
>Yes, my suggestion to mimic /dev/kvm was based on the mistaken mis-understanding
>that there was not a finite resource limit. Given that there are one or more
>finite resource limits, we need access control on which unprivileged users, and
>/or which individual QEMU instances are permitted access. This means /dev/sev
>must remain with restrictive user/group/permissions that prevent any unprivilegd
>account from having access. This means either root:root 0770/0700, or possibly
>having an 'sev' group and using root:sev 0770, so that users can be granted
>access via 'sev' group membership which (might?) allow unprivileged libvirtd to
>use 'sev' if the user was added.
>
>>     - my take on this is that there could potentially be more than a single
>>       kvm-enabled QEMU and therefore we'd need to create more than just a
>>       single namespace.
>
>True, I guess qemu-system-x86_64 and qemu-system-i386 both get KVM
>on an x86_64 host, and likewise for many other 64-bit archs supporting.
>32-bit apps.
>
>>     - I also argued that I can image that the same kind of DOS attack might be
>>       possible from within the namespace, even if we created the /dev/sev node
>>       only in SEV-enabled guests (which we currently don't). All of us have
>>       agreed that allowing /dev/sev in the namespace for only SEV-enabled
>>       guests is worth doing nonetheless.
>
>There's never any perfect level of protection. We're just striving to
>minimize the attack surface by only exposing it where there's a genuine
>need to use it.
>
>> In the meantime, Christophe went through the kernel code to verify how the SEV
>> resources are managed and what protection is currently in place to mitigate the
>> chance of a process easily depleting the limit on SEV guests. He found that
>> ASID, which determines the encryption key, is allocated from a single ASID
>> bitmap and essentially guarded by a single 'sev->active' flag.
>>
>> So, in conclusion, we absolutely need input from Brijesh (AMD) whether there
>> was something more than the low limit on number of guests behind the default
>> permissions. Also, we'd like to get some details on how the limit is managed,
>> helping to assess the approaches mentioned above.
>
>Regardless of this problem, I think it is important to have some docs
>in either libvirt or QEMU that describe the resource usage constraints
>so that management apps can decide how to best take advantage of SEV.
>
>>
>> Thanks and please do share your ideas,
>> Erik
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665400
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1561113
>
>Regards,
>Daniel
>-- 
>|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
>|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
>|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20190118/11b5a9de/attachment-0001.sig>