Skip to main content

How to maintain Kubernetes at scale: 4 strategies

Red Hat's OpenShift site reliability engineers explain how they maintain a large fleet of clusters while meeting users' needs.
Close up of architectural detail

Photo by Scott Webb from Pexels

Considering the recent shift from monolithic applications to microservices and containers, more and more companies and engineers are adopting Kubernetes (or Kubernetes-based products) as their application platforms. However, Kubernetes has a high learning curve, so it's no surprise that technical stakeholders from various roles and backgrounds are desperate for more information on how to successfully manage Kubernetes at scale.

During DevConf US 2021, a team of site reliability engineers (SREs) who manage Red Hat's hosted OpenShift options formed an "Ask the Experts" panel to talk about how they can maintain a large fleet of clusters that deploy OpenShift, Red Hat's Kubernetes platform. Four strategies came up repeatedly during the panelists' discussion: standardization, cross-team collaboration, user enablement, and security.


When it comes to managing large fleets of Kubernetes clusters, standardization is arguably one of the most powerful tools SREs have in our arsenal. In the same way system administrators need a base level of similarity across systems in their server rooms, the SREs managing Red Hat's hosted OpenShift fleet must hold their clusters to a standardized template so that they know when something is amiss and how to bring clusters back into alignment.

For Red Hat's OpenShift SREs, this starts with an opinionated installer and a suite of operators that the SREs build, deploy, and maintain on top of the base OpenShift product to help ensure each cluster is in constant alignment with the standard. Although the clusters aren't necessarily all on the same version of Kubernetes or OpenShift, we use standardized operators across the fleet to help watch for changes to certain resources and to alert us when certain resources or clusters need to be brought back into alignment. To help ensure that each of these operators is the same version across the fleet, we use the Operator Lifecycle Manager (OLM) tool shipped with OpenShift to manage operator versions. We even have an operator that helps us manage cluster upgrades in an automated way so that the clusters are held to this standardized template throughout the cluster's entire lifecycle.

[ Before you start your first container development project, get the checklist 10 considerations for Kubernetes deployments. ]

This level of standardization involves a lot of stakeholders and support from various of teams, which brings me to the next tool we use to manage Kubernetes at scale.

Cross-team collaboration

Because deploying a complicated product like Kubernetes or OpenShift on such a large scale is an intimidating feat, it is imperative that our teams are highly collaborative, constantly communicating, and continuously working towards the common goal of managing Kubernetes at scale.

At Red Hat, the SRE teams are just one piece of the puzzle. While the SREs are on the front line of managing and maintaining the clusters, our success would be impossible without the dozens of teams that work on things including:

  • Developing the base product
  • Maintaining the user interfaces (UIs) (and the UIs' backends) and command-line interfaces (CLIs) that are used to interact with their clusters
  • Building the opinionated installer that deploys the clusters
  • Creating tooling to see and manage various aspects of the fleet in one place
  • Supporting the questions and issues that our users bring to us

In addition to the numerous engineers working on those tasks, other Red Hat teams (including our own) help by consuming the product. Many services that our SRE teams (and other Red Hat teams) provide run on clusters we manage. This "dog-fooding" of the product is vital to our success, as it enables us to find and resolve issues long before they make it into the product that our external users and customers see and use.

[ Open source data pipelines for intelligent applications can help you increase data agility for artificial intelligence and machine learning. Download the eBook to learn more. ]

Having such a large number of internal teams and external companies using our product every day brings me to our next tip: user enablement.

User enablement

Although the most important strategy for Red Hat's SREs is standardization, at the end of the day, we are providing a service for various internal and external users, each with different needs from the service. We want to ensure that we're meeting those needs. We receive numerous requests for new features or to vet new operators that users want to deploy to the clusters, and we build these features out or vet the new operators whenever possible. Then we enable our users to choose when they want to upgrade their clusters and to which version.

We also work persistently to broaden our users' permissions within their clusters. This means internal teams get access with constraints based on their use cases, and external users can have up to full administrative privileges on their clusters to ensure they can use the cluster to meet their needs. While giving users full administrative privileges might appear to make it difficult to keep clusters standardized, we try to enable our users in a way that allows us to maintain constant cluster standardization. To accomplish this, we converse regularly with our users about their use cases and how we can enable them in a way that doesn't introduce problems into our fleet.

Of course, no conversation about user enablement could be complete without a discussion of one of the most critical topics for all software deployments: security.


Our users require that our clusters implement various security measures, and these security measures help inform each of the strategies listed above. We have built thoughtful internal processes around vulnerability mitigation. In the spirit of cross-team collaboration, we maintain constant and open communication with Red Hat's ProdSec teams, and we have a subteam within the Red Hat SRE team who oversees concerns from engineers and users, establishes relationships with the ProdSec teams, and maintains our security compliance certifications.

Additionally, each SRE accesses each cluster using the same method, providing standardization and security around SRE access. Furthermore, access to clusters and various resources on the clusters is audited, and audit logs are shipped off-cluster in real time. Auditing is one of our solutions to help bring clusters back in alignment with our standard, and it also helps us start conversations with our users on how to enable their use cases while maintaining this standard. In addition to the standard security measures deployed across all clusters, we also help users customize their clusters with their own security measures, such as options to keep clusters private or hidden behind their company's VPN.

Learn more

These four strategies—standardization, cross-team collaboration, user enablement, and security—enable Red Hat's SREs to minimize many of the challenges with managing Kubernetes at scale. Each strategy must work in tandem with the others to provide a robust solution that works for a large number of users and meets their differing needs.

For more discussion on these topics, please see the recording of the DevConf panel, Managing Kubernetes at Scale.

DevConf is a free, Red Hat-sponsored technology conference for community projects and professional contributors to free and open source technologies. Check out the conference schedule to find other presentations that interest you, and access the YouTube playlist to watch them on demand.

What to read next

Topics:   Containers   DevConf  
Author’s photo

Candace Sheremeta

Candace Sheremeta is a teacher-turned-software-engineer who currently works as a Senior Site Reliability Engineer for the Managed OpenShift options at Red Hat. She's spent the last 5 years working in the Kubernetes space, and has aspirations of people management. More about me

Related Content