[libvirt] [PATCH 1/3] lxc: allow to keep or drop capabilities

Mon Jun 23 15:59:32 UTC 2014

On Thu, Jun 12, 2014 at 08:48:25AM +0200, Cédric Bosdonnat wrote:
> Added <capabilities> in the <features> section of LXC domains
> configuration. This section can contain elements named after the
> capabilities like:
> 
>   <mknod state="on"/>, keep CAP_MKNOD capability
>   <sys_chroot state="off"/> drop CAP_SYS_CHROOT capability
> 
> Users can restrict or give more capabilities than the default using
> this mechanism.
> ---
>  docs/schemas/domaincommon.rng                   | 196 ++++++++++++++++++++++++
>  src/conf/domain_conf.c                          |  93 ++++++++++-
>  src/conf/domain_conf.h                          |  47 ++++++
>  src/libvirt_private.syms                        |   1 +
>  src/lxc/lxc_cgroup.c                            |   5 +
>  src/lxc/lxc_container.c                         |  90 +++++++++--
>  tests/domainschemadata/domain-caps-features.xml |  28 ++++
>  7 files changed, 442 insertions(+), 18 deletions(-)
>  create mode 100644 tests/domainschemadata/domain-caps-features.xml

> diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c
> index 8dfdc60..71a0d61 100644
> --- a/src/lxc/lxc_cgroup.c
> +++ b/src/lxc/lxc_cgroup.c
> @@ -357,6 +357,11 @@ static int virLXCCgroupSetupDeviceACL(virDomainDefPtr def,
>          {'c', LXC_DEV_MAJ_FUSE, LXC_DEV_MIN_FUSE},
>          {0,   0, 0}};
>  
> +    /* No white list if CAP_MKNOD has to be kept */
> +    int capMknod = def->caps_features[VIR_DOMAIN_CAPS_FEATURE_MKNOD];
> +    if (capMknod == VIR_DOMAIN_FEATURE_STATE_ON)
> +        return 0;
> +
>      if (virCgroupDenyAllDevices(cgroup) < 0)
>          goto cleanup;

So wrt device nodes we have two layers of defence

 - Blocking CAP_MKNOD - this means user can only have access to
   device nodes that are present in the /dev that we populate.
 - CGroups ACL blocking mkdir, read and write for everything
   except the device nodes that are implied by (or explicitly
   requested in) the XML

I can see the value in allowing CAP_MKNOD because if we have
granted the user '/dev/foo' access, there's no harm in letting
them "mknod /dev/foo" too. If we have granted CAP_MKNOD though
I'm not convinced that this implies we should allow access to
any possible device ever.

So at most I think we should allow 'mknod' in the cgroups, but
still keep read+write blocked. We definitely shouldn't allow
read+write just because they requested CAP_MKNOD. We already
let users request access to arbitrary devices in the XML config
to deal with the latter.

> -    if ((ret = capng_updatev(CAPNG_DROP,
> -                             CAPNG_EFFECTIVE | CAPNG_PERMITTED |
> -                             CAPNG_INHERITABLE | CAPNG_BOUNDING_SET,
> -                             CAP_SYS_MODULE, /* No kernel module loading */
> -                             CAP_SYS_TIME, /* No changing the clock */
> -                             CAP_MKNOD, /* No creating device nodes */
> -                             CAP_AUDIT_CONTROL, /* No messing with auditing status */
> -                             CAP_MAC_ADMIN, /* No messing with LSM config */
> -                             keepReboot ? -1 : CAP_SYS_BOOT, /* No use of reboot */
> -                             -1)) < 0) {
> -        virReportError(VIR_ERR_INTERNAL_ERROR,
> -                       _("Failed to remove capabilities: %d"), ret);
> -        return -1;
> +    for (i = 0; i < VIR_DOMAIN_CAPS_FEATURE_LAST; i++) {
> +        bool toDrop = false;
> +        int state = def->caps_features[i];
> +
> +        switch ((virDomainCapsFeature) i) {
> +        case VIR_DOMAIN_CAPS_FEATURE_SYS_BOOT: /* No use of reboot */
> +            toDrop = !keepReboot && (state != VIR_DOMAIN_FEATURE_STATE_ON);
> +            break;
> +        case VIR_DOMAIN_CAPS_FEATURE_SYS_MODULE: /* No kernel module loading */
> +        case VIR_DOMAIN_CAPS_FEATURE_SYS_TIME: /* No changing the clock */
> +        case VIR_DOMAIN_CAPS_FEATURE_MKNOD: /* No creating device nodes */
> +        case VIR_DOMAIN_CAPS_FEATURE_AUDIT_CONTROL: /* No messing with auditing status */
> +        case VIR_DOMAIN_CAPS_FEATURE_MAC_ADMIN: /* No messing with LSM config */
> +            toDrop = (state != VIR_DOMAIN_FEATURE_STATE_ON);
> +            break;
> +        default: /* User specified capabilities to drop */
> +            toDrop = (state == VIR_DOMAIN_FEATURE_STATE_OFF);
> +        }
> +
> +        if (toDrop && (ret = capng_update(CAPNG_DROP,
> +                                          CAPNG_EFFECTIVE | CAPNG_PERMITTED |
> +                                          CAPNG_INHERITABLE | CAPNG_BOUNDING_SET,
> +                                          capsMapping[i])) < 0) {
> +            virReportError(VIR_ERR_INTERNAL_ERROR,
> +                           _("Failed to remove capability %s: %d"),
> +                           virDomainCapsFeatureTypeToString(i), ret);
> +            return -1;
> +        }

So what I don't like about the current code is that the set of 5 capabilities
we drop is essentially arbitrary and has no inherant value.

 - If user namespaces are not active, then the container is insecure
   even if we block those 5 caps, so they add no security benefit.

 - If user namespaces are active, then the container is secure even if
   we allow these 5 caps, so again they add no security benefit.

If this were day 1, I'd just allow all possible capabilities by default
but that would be a semantic change - particularly changing from block
CAP_MKNOD to allowing CAP_MKNOD would cause a regression in any containers
using systemd.

My concern, however, is that a user who wants to run a container with
absolutely everything blocked, has to list them all in the XML, and
the kernel adds new capabilities over time. So they might have a config
which is secure today where they've listed everything to drop, then the
kernel adds a new capabilty and they become potentially less secure
since their existing config doesn't know about the new capability name.

Looking at the XML you've proposed below:

> +<domain type='lxc'>
> +    <name>demo</name>
> +    <uuid>8369f1ac-7e46-e869-4ca5-759d51478066</uuid>
> +    <os>
> +        <type>exe</type>
> +        <init>/sh</init>
> +    </os>
> +    <features>
> +        <capabilities>
> +            <mknod state="on"/>
> +        </capabilities>
> +    </features>

I think what's missing here is some notion of the initial policy.
I think the <capabilities> element could usefully grow a new
'policy' attribute eg

     <features>
         <capabilities policy="default|deny|allow">
             <mknod state="on"/>
         </capabilities>
     </features>

Now this would mean:

 - default == the current historical hypervisor specific default
              behaviour. ie what libvirt LXC driver does today.
              can allow/deny any caps to change this default
 - deny == all CAPS blocked by default. Must whitelist any caps
           to be allowed in container
 - allow == all CAPS allowed by default. Must whitelist any caps
            to be blocked from container

And the default policy would be 'default' for sake of back compat.

The final thing that concerns me is that this list of caps names
is pretty much Linux specific. eg Solaris caps will potentially
have different names. I don't see a good way to deal with that
other than to say if we ever need to support this on Solaris, we'll
extend the list of caps names to be the union of all names used by
Linux and Solaris, and only process the caps relevant to each OS.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|