What are Kubernetes Operators?
A Kubernetes operator is a method of packaging, deploying and managing a Kubernetes application. A Kubernetes application is both deployed on Kubernetes and managed using the Kubernetes API. Operators automate the management of applications or service life cycles on behalf of a human operator, providing for the ability to automate at every level of the stack—from managing the parts that make up the platform all the way to applications that are provided as a managed service.
Engineering teams can use the power of Operators, which offer autonomous management by exposing configuration natively through Kubernetes objects, for quicker installation and more frequent, robust updates. In addition to the automation advantages of Operators for managing the platform, Red Hat OpenShift makes it easier to find, install and manage Operators running on clusters.
Using Operators in Red Hat OpenShift
Included in Red Hat OpenShift is the embedded OperatorHub, a registry of operators from various software vendors and open source projects. Within the OperatorHub you can browse and install a library of operators that have been verified to work with Red Hat OpenShift and that have been packaged for easier lifecycle management.
OpenShift offers two different systems to manage operators depending on their purpose:
Platform Operators, which are managed by the Cluster Version Operator (CVO), are installed by default to perform cluster functions.
Add-on Operators, which are managed by Operator Lifecycle Manager (OLM), can be made accessible for users to run in their applications.
Additionally, suitably privileged users can manage operators through other means such as using YAML files or helm charts.
Why is security important to Operators?
More engineering teams are moving toward using Operators to deploy in their production environments. While the full potential of Operators hasn’t yet been reached, it’s important to not lose sight of the benefits of building in more secure code and practices as early as possible in the development process.
Good security practices for Operators
Minimize cluster-scope and namespace-scope permissions
There are two types of classification for Operators:
Namespace-scoped — Operator watches and manages resources within a namespace and requires permissions within these namespace to do so
There are subtypes within this:
in a single, prenamed namespace determined by the developer
in a single namespace provided at installation time
in multiple namespaces (e.g. using the MultiNamespace install mode type in the OperatorGroup)
Cluster-scoped — watches and manages resources across or all namespaces within a cluster and requires cluster-scoped permissions to achieve this
Generally speaking, following the Principle of Least Privilege (PoLP), you should restrict access as much as possible while still allowing your Operator to function. Permissions can be granted by creating role bindings and cluster role bindings that bind the Operator’s service account to the required roles and cluster roles. This can be achieved using Operator Lifecycle Manager (OLM) bundle deployment architecture.
Apart from the Operator image itself, an Operator bundle is an OLM-prescribed format for holding metadata about an Operator. The metadata contains everything that Kubernetes needs to know in order to use the Operator — its custom resource definitions (CRDs), role-based access control (RBAC) roles and bindings required, dependency tree and other information as outlined in Deploying Operators with OLM bundles.
A benefit of using OLM is that it manages the permissions needed to install and run the Operator. OLM uses the cluster-admin role to do the installation and separates the install time requirements, for example the APIService and CustomResourceDefinition resources are always created by OLM using the cluster-admin role, thus reducing the overall permissions surface.
Using OLM, cluster administrators can choose to specify a service account for an Operator group so that all Operators associated with the group are deployed and run with the privileges granted to the service account. A service account associated with an Operator group should never be granted privileges to write these resources. Any Operator tied to this Operator group is now confined to the permissions granted to the specified service account. If the Operator asks for permissions that are outside the scope of the service account, the install fails with appropriate errors.
Reduce the usage of cluster-scope permissions
The use of cluster-scoped Operators should be justified. If not necessary, the recommendation is to run namespaced-scoped Operators with the minimal permissions required.
Cluster-scoped Operators require access to resources across the entire cluster, including the control plane, using permissions obtained by using cluster roles and cluster role bindings.
Namespace-scoped Operators require access only to resources in single namespaces and these permissions can be obtained by using roles and role bindings. Exception to this is where a namespace-scoped operator must create a CustomResourceDefinition (CRD) which is a cluster-scoped resource.
If there are static cluster-scoped resources whose definition won’t change based on the inputs given to the Operators, you can move the creation of those resources to the Operator Lifecycle Manager (OLM) catalog. For example, you can move CRD creation from your Operator to OLM since it doesn’t change throughout the Operator’s lifecycle.
Both Kubernetes and OpenShift platforms offer authorization through role-based access control (RBAC). The security context is an essential element of pod and container definitions in Kubernetes. Note that this is different to the OpenShift security feature called security context constraint (SCC).
Kubernetes Operators also define permissions granted to the Operator, generally in a YAML definition called role.yaml. Roles are assigned at the namespace level, so any escalation of privilege is inherently limited by the namespace itself. However, ClusterRole needs to be checked more carefully because they apply cluster wide.
One possible way that privileges could be escalated is if a non-privileged user (with system:authenticated role) gets access to the service account token used by the operator. A common way of reducing this risk is to deploy the Operator in a separate namespace from its Operands, in which the non-privileged user doesn’t have access to read secrets, or if deployed in a namespace shared with non-privileged users, those users should not have access to read secrets in that namespace. It is recommended that Operators are never deployed in a shared namespace, especially one that allows non-privileged users access.
We recommend that code reviews should include looking for RBAC roles that can leverage themselves to gain extra privileges.
Examples to watch out for:
The Bind verb can be applied to Roles or ClusterRoles and allows a principal to bypass a general restriction on (cluster)role binding creation, which stops users who can create role bindings from escalating their privileges by binding to high privilege roles like cluster admin. This restriction is described in the Kubernetes documentation: Restrictions on role binding creation or update.
Escalate rights on cluster roles: Escalate bypasses the Kubernetes RBAC check, which prevents users who are able to create roles or cluster roles from creating (or editing) these objects to have more rights than they do.
Multiple Roles should be described to reduce the scope of any actions needed for containers that the Operator may run on the cluster. For example, if you have a component that generates a TLS Secret upon startup, a Role that allows Create but not List on Secrets is more secure than using a single all-powerful Service Account.
If you grant cluster-admin, cluster-admin can itself update/alter SCCs.
If you have usage of a given SCC and also have “create pod” then you can create a new pod to grab the full extent of what the SCC will permit.
If you have RBAC that permits editing of RBAC you can edit your own limits.
resources: - roles - rolebindings verbs: - patch - create
Would potentially allow the operator to give unprivileged users permissions to access privileged namespaces by 'giving' them the roles when requested. (Note: by default Operators are limited to only granting permissions to others that it has itself).
Instead of using the wildcard character in RBAC definitions as per the image below, it is good practice to explicitly list out each verb or resource. Each item in the list can then more easily be examined to confirm where and how the permissions are needed, or if they were accidentally grabbed as a convenience during earlier development.
For example, instead of using "*" in the verbs section, you can list them out in full, such as: get, list, watch. If the operator knows the name of a resource it will edit, it can be limited to only get/edit and will frequently not need "list".
Being explicit with lists will also future-proof the permissions for if the "*" changes to match additional items not currently present.
Using RBAC to define and apply permissions
The relationships between cluster roles, roles, cluster role bindings, role bindings, users, groups and service accounts are illustrated below.
Figure 1: Default cluster roles
Since operators are run with a service account in a namespace, anyone with the ability to create workloads in that namespace can escalate to the permissions of the operator. To address these concerns, a notion of scoping operators was introduced via the OperatorGroup object. An OperatorGroup would specify a set of namespaces within a cluster in which all operators installed would share the same scope. The Operator Lifecycle Manager (OLM) ensures that only one operator within a namespace owns a particular CRD to avoid collision problems.
The problem is that APIs in a cluster are cluster-scoped. They are visible via discovery to any user that wishes to see them. Even Operators that agree on a particular Group, Version, Kind (GVK) may have differences of opinion in how those objects should be admitted to a cluster, or how conversion between API versions should happen. This means that it increases the likelihood that more than one “opinion” about an API exists in the cluster.
The article Operator Descoping Plan describes this further.
Pod and container securityContext and Security Context Constraints (SCCs)
When trying to containerize third-party applications, it may sometimes be necessary to bend to the expectations of those applications and run as specific UIDs, perhaps even running as root. For Operators that are created to be container-native, you should never make any UID expectations, and accept the customary “billion+” high UID that the OpenShift cluster assigns to the namespace your operator runs in.
Set a numeric USER in the Containerfile to avoid defaulting to, or assuming the expected user may have uid=0
Use group id permissions to manage shared file permissions instead of user id.
Similarly, the usage of hostPath volumes allows files on the host node to be accessible from the container. If a container is insecurely configured and it is compromised, the attacker could try to attack the host and other containers running on the host.
For hostPath specifically:
Operators should never require host paths unless they comprise part of the control plane itself.
Other deployment recommendations:
readOnlyRootFilesystem -- set it to TRUE
Avoid writing any local files to the root filesystem. Use /tmp or an emptyDir instead. Pay attention to PID files, as well as any log output that isn’t going to STDOUT.
runAsNonRoot -- set to TRUE
This can be set in either the podSpec or containerSpec’s security context. By enabling this, the container will refuse to run if other circumstances indicate it might run with uid=0.
automount service account token -- set to FALSE
By default, the service account token is mounted as a file within the container. Operators will generally have an SA that they require access to in order to function (hence, set this TRUE). However, any pods that the operator creates may benefit from the added protections of setting it false.
OpenShift Security context constraints (SCC) are gatekeepers that will limit which pods can be admitted to the cluster. Since the Operator process also runs as a pod in a cluster, you can use the same concepts to enhance the security posture of your Operator container as well.
The Udica tool was created to simplify the creation of custom SELinux policies that can then be tied to custom SCCs.
Continuous security scans
Continuously scanning helps to identify vulnerabilities and pick up the latest security bug fixes in Go, Kubernetes and the Operator container’s base image.
For the container images that are running in OpenShift and are pulled from Red Hat Quay registries, you can use an Operator to list the vulnerabilities of those images. The Container Security Operator can be added to OpenShift to provide vulnerability reporting for images added to selected namespaces.
Container image scanning for Red Hat Quay is performed by the Clair security scanner. In Red Hat Quay, Clair can search for and report vulnerabilities in images built from RHEL, CentOS, Oracle, Alpine, Debian and Ubuntu operating system software.
Where should an Operator run? Depending on what it is, we would recommend the Operator runs in an appropriate location. For an Operator that comprises part of the control plane, it can be scheduled to run on control plane nodes using Tolerations.
Even if Operators and Operands are split between namespaces, because the Operator itself may have a highly privileged service account to perform its kube API interaction, if it runs on a worker node any compromise of that worker node may reveal the Service Account credentials. Separation of workloads by node, as well as by namespace, is therefore beneficial.
About the authors
Dave Baker has been with Red Hat since 2017. He's currently working as a Design Architect in the Secure Engineering team within Product Security, and has spent the last years in various security related roles helping to protect Red Hat OpenShift Container Platform and many other products.
Florencio has had cybersecurity in his veins since he was a kid. He started in cybersecurity around 1998 (time flies!) first as a hobby and then professionally. His first job required him to develop a host-based intrusion detection system in Python and for Linux for a research group in his university. Between 2008 and 2015 he had his own startup, which offered cybersecurity consulting services. He was CISO and head of security of a big retail company in Spain (more than 100k RHEL devices, including POS systems). Since 2020, he has worked at Red Hat as a Product Security Engineer and Architect.