[libvirt] Notes from the KVM Forum relevant to libvirt

Thu Aug 25 13:58:27 UTC 2011

Quoting Stefan Hajnoczi (stefanha at gmail.com):
> On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange
> <berrange at redhat.com> wrote:
> > On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
> >> On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
> >> > On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
> >> >> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
> >> >> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
> >> >> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
> >> >> >> <berrange at redhat.com> wrote:
> >> >> >> > I was at the KVM Forum / LinuxCon last week and there were many
> >> >> >> > interesting things discussed which are relevant to ongoing libvirt
> >> >> >> > development. Here was the list that caught my attention. If I have
> >> >> >> > missed any, fill in the gaps....
> >> >> >> >
> >> >> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
> >> >> >> >   a zone so that an exploit of QEMU can't escape into the full OS.
> >> >> >> >   Containers are Linux's parallel of Zones, and while not nearly as
> >> >> >> >   secure yet, it would still be worth using more containers support
> >> >> >> >   to confine QEMU.
> >> >> >>
> >> >> >> Can you elaborate on why Linux containers are "not nearly as secure"
> >> >> >> [as Solaris Zones]?
> >> >> >
> >> >> > Mostly because the Linux namespace functionality is far from complete,
> >> >> > notably lacking proper UID/GID/capability separation, and UID/GID
> >> >> > virtualization wrt filesystems. The longer answer is here:
> >> >> >
> >> >> >   https://wiki.ubuntu.com/UserNamespace
> >> >> >
> >> >> > So at this time you can't build a secure container on Linux, relying
> >> >> > just on DAC alone. You have to add in a MAC layer ontop of the container
> >> >> > to get full security benefits, which obviously defeats the point of
> >> >> > using the container as a backup for failure in the MAC layer.
> >> >>
> >> >> Thanks, that is interesting.  I still don't understand why that is a
> >> >> problem.  Linux containers (lxc) uses a different pid namespace (no
> >> >> ptrace worries), file system root (restricted to a subdirectory tree),
> >> >> forbids most device nodes, etc.  Why does the user namespace matter
> >> >> for security in this case?
> >> >
> >> > A number of reasons really...
> >> >
> >> > If user ID '0' on the host starts a container, and a process inside
> >> > the container does 'setuid(500)', then any user outside the container
> >> > with UID 500 will be able to kill that process. Only user ID '0' should
> >> > have been allowed todo that.
> >> >
> >> > It will also let non-root user IDs on the host OS, start containers
> >> > and have root uid=0 inside the container.
> >> >
> >> > Finally, any files created inside the container with, say, uid 500
> >> > will be accessible by any other process with UID 500, in either the
> >> > host or any other container
> >>
> >> These points mean that the host can peek inside containers and has
> >> access to their processes/files.  But from the point of a libvirt
> >> running inside a container there is no security problem.
> >>
> >> This is kind of like saying that root on the host can modify KVM guest
> >> disk images.  That is true but I don't see it as a security problem
> >> because the root on the host is the trusted part of the system.
> >>
> >> >> I think it matters when giving multiple containers access to the same
> >> >> file system.  Is that what you'd like to do for libvirt?
> >> >
> >> > Each container would have to share a (readonly) view onto the host
> >> > filesystem so it can see the QEMU emulator install / libraries. There
> >> > would also have to be some writable areas per QEMU container.  QEMU
> >> > inside the container would be set to run as some non-root UID (from
> >> > the container's POV). So both problem 1 & 3 above would impact the
> >> > security of this confinement.
> >>
> >> But is there a way to escape confinement?  If not, then this is secure.
> >
> > The filesystem UID/GID ownership is the most likely way you can escape
> > the confinement. You would have to be very careful to ensure that each
> > container's view of the filesystem did not include any directories
> > with files that are assigned to another container, since the UID
> > separation would not prevent access to another container's resources.
> >
> > This is rather tedious but could be just about doable, but it gets
> > harder when you throw in things like sysfs and PCI device assignment.
> > eg a guest with PCI device assigned gets given ownership of the files
> > in /sys/bus/pci/devices/0000:00:XX:XX/ and since there is no UID
> > namespacing, this will be accessible to any other container with the
> > same UID. To hack around this when starting up a container you would
> > probably have to bind mount a empty tmpfs over the top of all the
> > PCI device paths you wanted to block in sysfs.

Which of course is easily undoable by root in the container :)

> Ah, I hadn't thought of /sys/bus/pci or /sys/bus/usb!
> 
> Thanks for the explanation and it does seem like the design would get messy.

And plenty more, i.e.  http://blog.bofh.it/debian/id_413

See http://sourceforge.net/mailarchive/message.php?msg_id=27878921 for
someone actively using Smack to help mitigate this (which could also be
done with SELinux).  But yes, this is exactly what user namespace is
designed to address.  The week before last we got a proof of concept of
a filesystem being assigned to a user namespace, which would just about
allow user namespaces to be useful in a container.  It's up at
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-userns-devel.git
When I return from vacation I need to continue work on pushing at least the
first part of that patchset.

-serge