Why use operators?
At a high level, the Kubernetes Operator pattern makes it easy to run complex software at scale. With the DevOps movement, we learned to manage and monitor complex applications and infrastructure from centralized platforms (Chef Server, Puppet Enterprise, Ansible Tower, Nagios, etc). This centralized monitoring & automated remediation works great with relatively stable infrastructure components like bare metal servers, virtual machines, and network devices. However, containers change much quicker than traditional infrastructure and traditional applications. One might say this speed is a tenet of Cloud Native behavior.
The ephemeral nature of containerized applications drives an immense amount of state change. In turn, this rapid state change challenges the ability of centralized monitoring and automation to keep up. Kubernetes can literally change the state of containers every few seconds. The solution is to bring the automation close to the applications - to deploy it within Kubernetes so that it has direct access to the application’s state at all times. The Kubernetes Operator pattern allows us to do just this - deploy automation side by side with the containerized application.
I often describe the Operator pattern as deploying a robot sysadmin next to the containerized application. Though, to truly understand how Operators work, we need to dive a bit deeper into the history of running services, and the art of making these services resilient. From here on, we will refer to this as operational knowledge, or operational excellence.
Traditionally, when new software was deployed, we also deployed a real, human Sysadmin to care and feed the application. This care and feeding included tasks like installation, upgrades, backups, & restores, troubleshooting, and return to service. If a service failed, we paged this Sysadmin, they logged into a server, would troubleshoot the application, and fix what was broken. To track this work, they would document their progress in a ticketing system.
As we moved into the world of containers (about six years ago at the time of this writing), we updated the packaging format of the application, but we continued to deploy a real human Sysadmin to care and feed the application. This simplified installation, and upgrades, but did little for data backups/restores and break/fix work.
With the advent of the Kubernetes Operator pattern, we deploy the software application in a container and we also deploy a robot Sysadmin in the same environment to care and feed the application. We refer to these robot Sysadmins as Operators. Not only can they perform installation, upgrades, backups, and restores, but they can also perform more complex tasks like recovering a database when the tables become corrupted. This takes us into another world of automation and reliability.
But, why does this work philosophically? You might still struggle to understand how this is different than traditional clustering, or having a monitoring system participate in “self-healing” (an old buzzword now). Let’s explain...
A brief history of operational knowledge
Operators embed operational knowledge, or operational excellence in the infrastructure, close to the service. To better explain this, let’s walk through the history of operating servers & services.
- Operational excellence 1.0 - One computer, multiple administrators. Back in years past, before most users of containers remember, there were large computers cared for, managed and operated by multiple systems administrators. This remained true all the way into the late 1990s with large Mainframes and Unix systems managed by multiple systems administrators. That’s right, there were multiple human beings assigned to one computer. These systems administrators automated operations on these systems, fixed services if they were broken, added/removed users, and even wrote device drivers themselves. The administrators were highly technical, and would be considered software engineers by modern definitions. They performed all software related tasks, maintaining uptime, reliability, and operational excellence. The cost was high, but the quality was also very high.
- Operational Excellence 2.0 - One administrator, multiple computers. With the advent of Linux, and perhaps more so Windows, the number of servers outgrew the number of administrators. There became less and less large, multi-user systems. At the same time there became more and more single service systems - DNS Servers, Web Servers, Mail Servers, etc. A single administrator could manage multiple servers remotely, and we often measured productivity by comparing the number of servers per administrator. Administrators still retained intimate knowledge of each service they managed, sometimes using Runbooks to document common tasks. If a service or server failed, an administrator would work on it remotely thereby maintaining a high level of quality, while at the same time being responsible for a higher total number of servers.
- Operational Excellence 3.0 - One service, multiple computers. As a sort of operational dead end, there was a renewed focus on the quality of the service by leveraging clustering and automatic recovery on cheaper hardware. Databases, web servers, NFS, DNS SAMBA servers and more were clustered for resilience. Tools like Veritas Clustering, Veritas File System, Pacemaker, GFS, and GPFS became popular. For this to work properly, operational knowledge of how to start and recover a service had to be configured in the clustering software. This required a solid understanding of how the service worked (detection of failures, how to restart, etc). With clustering software and configuration, nodes could be treated as capacity (N+1 means one extra server, N+2 means two extra servers). Bringing operational knowledge close to the service allowed for automated recovery, but building new services or clusters, much less decommissioning them, could take days or even weeks because each service had to be designed and maintained separately.
- Operational Excellence 4.0 - With this iteration, we moved the logic for recovering services back away from the service itself, and put it in the monitoring systems and load balancers. We embedded configuration in DNS, and started to use configuration management to maintain things. This created a fundamental tension in a lot of IT organizations. Server administrators would embed the logic for recovering services in the monitoring system. For example, if the monitoring system saw a problem with a web server it could ssh into the server and restart Apache. There were several major challenges with this paradigm. First, having configuration in many different places created a lot of complexity (see also: Why you don't have to be afraid of Kubernetes). Second, storage, network, and virtualization administrators didn’t want automation logging in and provisioning/deprovisioning services, so achieving truly cloud-native architectures was difficult.
- Operational Excellence 5.0 - While centralized monitoring and automation can be abused with containers to achieve a Desired State/Actual State model similar to Kubernetes, it’s not nearly as elegant. Using a manifest (YAML or JSON), Kubernetes enables the powerful desired state. For example, once an administrator defines that they want three copies of a web server running, Kubernetes will maintain exactly three copies running. If one of the containerized web servers dies, Kubernetes will restart another one because it sees that the actual state doesn’t match the desired state. This gives you application definition and recovery in one file format. But, how do you manage more complex tasks like scanning corrupted database tables, upgrades, schema changes, or rebalancing data between volumes? That’s what operators do. It moves this logic close to the service bringing the best of Operational Excellence 3.0 and 4.0 together in one place. This same logic can be applied to the Kubernetes platform itself (see also: Red Hat OpenShift Container Platform 4 now defaults to CRI-O as underlying container engine).
Who should build operators
This leads us to the important question of who should be building operators? Well, the short answer is, it depends.
If you are a developer building Java web services, Python web services, or really any web services then you shouldn’t have to write an Operator. Your web services should be relatively stateless and as long as you focus on learning to use readiness and liveness checks, Kubernetes will manage the desired state for you. That said, you might use pre-built Operators to manage a PostgreSQL, MongoDB, or MariaDB instance. Check out all of the services you can consume from Operator Hub. Welcome to the Kubernetes ecosystem.
If you are an administrator, practicing DevOps, and you are building complex services for others to consume, then you may very well need to think about writing operators. You will almost certainly be consuming Operators and upgrading them, so check out the Operator Lifecycle Manager (Built into OpenShift 3.X and 4.X) and Operator SDK.
If you’re a developer working on a data management, networking, monitoring, storage, security, or GPU solution for Kubernetes/OpenShift then you are the most likely to need to write an operator. I would suggest looking at the Operator Framework, Operator SDK, Operator Lifecycle Manager, and Operator Metering. You might also like to look at the Red Hat container certification program called Partner Connect.
As I have said before, Kubernetes isn’t complex, your business problem is. Kubernetes not only redefines the operating system (OpenShift is the New Enterprise Linux and Sorry, Linux. Kubernetes is now the OS that matters), it redefines the entire operational paradigm. Surely, the history of operational excellence could be divided into any number of paradigm shifts, but I have attempted to break it down with a model that is cognitively digestible to help operations teams, software vendors (ISVs), and even developers better understand why Operators are important.
This fifth generation of operational excellence, using the Kubernetes Operator pattern brings the automation close to the application giving it access to state in near real-time. With Operators deployed side by side within Kubernetes, response, management and recovery of applications can happen at a speed that just isn’t possible with human beings and ticket systems. An added benefit is the ability to provision multiple copies of an application with a single command, or more importantly deprovision them. This ability to provision, deprovision, recover, and upgrade is the fundamental difference between cloud-native and traditional applications.
I want to give special thanks to Daniel Riek who presented this concept of Operational Excellence at FOSDEM 20 in Brussels, Belgium last week. If you didn’t have an opportunity to attend his talk, I recommend you watch it when the video goes live. Until then, see this interview with him: How Containers and Kubernetes re-defined the GNU/Linux Operating System. A Greybeard's Worst Nightmare