Towards the end of 2019, OpenShift Dedicated site reliability engineers (SREs) on the SRE-Platform (SREP) team had a problem ahead of a new feature release: Kubernetes role-based authentication controls (RBAC) wasn't working. Or, rather, it wasn't working for us. RBAC wants to work in broad situations: A user with this role can do these types of actions on those kinds of objects. There's no easy way to say "except on these specific objects," and we needed a way to do exactly that. Why is RBAC good enough for most people but not us?
The SREP team operates the platform for Red Hat's OpenShift Dedicated product. That includes placing guardrails around core OpenShift components so that users can't accidentally interfere with the cluster's operations. While our users can expect to create their own namespaces, we don't want core namespaces (like kube-system) interfered with. The RBAC rules say that users can create and delete their own namespaces, but that granting that access would allow delete access to kube-system, which would be problematic.
The problem extends beyond namespaces to other objects, like the users, groups, and identity provider configurations used by engineers to respond to issues on a cluster. It's essential to keep this trio of objects secured to prevent tampering. But we also want users to manage their own users, groups, and identity provider configurations. Kubernetes RBAC leaves us a large gap to fill. We need to support a product that has guardrails around sensitive in-cluster objects while also enabling our users to manage their own objects.
Thankfully, Kubernetes lets us have it both ways: RBAC for the wide strokes and a mechanism called dynamic admission controllers for the finer controls that are difficult or cumbersome to express with traditional RBAC rules. These admission controllers are handled by a webhook that decides what to do with the request.
Creating a stopgap
I was tasked with creating a webhook to block customer access to SRE-managed, in-cluster groups. We anticipated that this stopgap solution would give us time to evaluate a more permanent solution.
Thankfully Kubernetes allows administrators to provide their own custom dynamic admission controllers for this purpose. Once RBAC permits the user to perform the request, the API server makes a call to the specified webhooks to decide if the request is allowable. The webhook call includes a payload containing the user's information (user and group memberships) and a representation of the object in question. Of the two types of available dynamic admission controllers, we rely heavily on the ValidatingWebhookConfiguration type.
After a brief development period, I created a Flask framework with Python to support the group's use case and other webhooks. However, we soon had many more webhooks to add, which took this simple framework from easy to use to a painful, error-prone endeavor to maintain. Early design choices that worked for one or two webhooks didn't prove to scale out.
While the Python framework worked well, interacting with it was painful. Creating a new webhook meant the developer had to create several YAML files, modify an existing Python file, and add the new webhook Python file. The initial library of "common helper functions" wasn't common and had to be changed.
On the face of it, those are not difficult tasks, but human nature is to use other, similar works as a basis for new work. Engineers copied and pasted existing YAML files and source code, forgetting to validate the important changes. The information that described webhooks—their name, business logic, and configuration—was split between the YAML and the Python code. Forgetting to update the list of active webhooks was a source of error.
Webhooks must also execute as quickly as possible because they stand in the way of the API server accepting the user's changes. Even slight delays will be felt—especially with frequent, automated changes. This is because changes to common types of objects (like namespaces) trigger the webhook each time, and any slowdown results in severe API server performance problems.
Testing was not in the initial design. We later decided to add tests to gain confidence in our webhooks. Testing was difficult because of the poor initial design patterns that all subsequent webhooks inherited.
In late spring 2020, my frustration with the Python codebase and tangled mess of YAML boiled over, and I ported the entire thing into Golang, determined to make better choices this time. Going into the rewrite, I didn't want anyone to have to write or modify YAML files. I wanted to make a framework to solve the general problem of how SREP writes and manages ValidatingWebhookConfiguration webhooks by automating as much as possible. For these reasons I designed a Golang interface to provide an easy way for webhook authors to write and register their webhooks into the framework.
Registering a webhook is done with a single file per webhook (this Namespace webhook is an example), leveraging Golang's init function behavior.
The initial porting was fairly literal, copying each Python file more or less directly. Because the Python webhooks handled HTTP requests, so did the new Golang webhooks. The initial "entrypoint" into each webhook was an unfortunate method that coupled each webhook to
(HandleRequest(http.ResponseWriter, *http.Request)). Still, it was a conscious decision to defer refactoring the Python baggage to gain the advantage of being in Golang.
Now the "entrypoint" allows webhooks to be much more focused on what it means to handle the API server's request. In this way, the webhooks don't need to care about net/http so long as they're given an admissionctl.Request object. After refactoring to remove the HandleRequest:
import admissionctl "sigs.k8s.io/controller-runtime/pkg/webhook/admission" Authorized(request admissionctl.Request) admissionctl.Response
The Golang interface allows the webhooks to register themselves primarily, so framework code can run a webserver and accept incoming requests from the API server. However, that's not the only value the interface provides.
My programming background is rooted in Ruby, and one of Ruby's central concepts is that you can "ask" a question of an object (for example,
obj.nil? to return true or false if
obj has a
nil value) by sending "messages" to objects. I find it very natural to think in those same patterns. So when it came time to construct an interface, I thought about what questions a framework might want to "ask" the webhooks. In other words, webhooks ought to have the capability to answer questions about themselves so that other components of the framework can interact in different and meaningful ways.
Saying goodbye to manual YAML
The new Golang interface, which is central to the Golang rewrite, supports the goal of removing the need for engineers to write YAML by hand. We still need YAML, but all the better if the framework can write it for us.
In Kubernetes and OpenShift, each webhook needs a ValidatingWebhookConfiguration object to instruct the API server how and when to access the webhook.
The Golang interface requires each webhook to know its name, what requests to send the Uniform Resource Identifier (URI), what it should handle (like namespaces and groups), what should happen if the API server cannot access the webhook, and other useful facts.
I wrote a small program that ties into the same webhook registration to "ask" for "answers" needed to create the ValidatingWebhookConfiguration object to use in the cluster. No more manual YAML editing, just run a program.
[ Get started with containers in 30 days with the Containers, Kubernetes, and Red Hat OpenShift technical overview course. ]
Increasing interface value
The same interface we use to prevent writing YAML serves another purpose within the framework: duplication checking. Suppose someone accidentally adds two webhooks that want the same request URI. In that case, the framework will panic and disallow it. These issues still come up; human nature doesn't avoid copy and paste just because we're writing Golang instead of Python.
As we learned more about the new Golang implementation, we discovered that there was more we wanted to ask of our webhooks. What if we could help with documentation around each webhook's purpose? In consultation with the docs team, I expanded the interface to include a
Doc() string method, which is in turn used by another small program to write documentation. It was written in JSON format so that it can be included in other processes to provide more human-readable formats.
The SREP team is primarily a Golang-centric team, and we have the majority of our testing experience within the realm of Golang. For that reason, the port of the Python-based webhooks to Golang represented a leap forward in testing options. The current state of unit testing within the framework is functional with good coverage, but there's room for a better developer experience to clean up some repetitive patterns.
Cleaning up the webhooks
Forcing each webhook to be self-contained makes deleting obsolete webhooks easy. SREP used to have three extra webhooks that safeguarded specific User, Group and Identity SRE-related objects so that SREs could reliably access and manage user clusters. Each SRE had a User on the clusters with Group memberships associated with specific RBAC roles. SREs logged into user clusters used an SRE-only identity provider. These in-cluster objects needed to be secured to ensure the login chain.
Long after creating the webhooks and porting them to Golang, the way SREs managed clusters fundamentally changed by moving away from in-cluster users, groups, and identity providers. The early design pattern to have webhooks as self-contained as possible meant removing the obsolete trio of webhooks was no more challenging than removing the webhook directories and their registration files. Regenerating the YAML happens with our utility, and once regeneration finishes, the webhooks are no longer present, thanks to our cluster management method.
Recognizing options and limitations
Our new methods aren't perfect; there are still some limitations.
SREP webhooks are all of the ValidatingWebhookConfiguration type, and they are the only type of webhook supported by the framework described in this article. MutatingWebhookConfiguration is not supported. Mutating webhooks will change the object representation submitted to the API server, and that is not a workflow the team needs.
There are no authentication controls in the webhook framework, as there might be with other controllers deployed in a cluster. When we designed this, there did not seem to be a need to authenticate the API server. So, for example, anyone may call a service and submit an AdmissionReview object with
However, communications between the API server and the webhook service are encrypted to help guard against sensitive material in the payload.
This webhook framework is certainly not the only way to accomplish the goal of dynamic admission control. We looked at other available options.
- Open Policy Agent: One of the other options we considered was Open Policy Agent, which can perform this kind of admission control. Ultimately, we did not adopt it for reasons we don't recall.
- Other frameworks: There are other frameworks similar to this one that you can use to integrate. Unfortunately, the other solutions weren't suitable for us, as they required much more work to integrate, and some needed the authentication piece.
It's been more than two years since we initially conceived of this framework and nearing two years since we converted to Golang. This means there are undoubtedly other options now. None are bad, and none are better than what we devised.
Breathing in new life
Near the end of 2019, SREP had a challenge and we needed a rapid solution. Python lived up to the promise of being a rapid development language, letting us quickly address the challenges while giving us time to investigate more permanent solutions.
Moving to Golang breathed new life into the hard-to-manage Python and YAML tangle. Leaning into Golang's native features allowed engineers to focus on the business of writing webhooks instead of worrying about YAML and processing HTTP requests.