In the previous blogs, we looked at the evolution of automation and tooling, how they are used to look at the performance and push the limits of OpenShift at large scale and highlights from the recent scale test runs (OpenShift Scale-CI, part 1: Evolution, OpenShift Scale-CI, part 2: Deep Dive, and OpenShift Scale-CI, part 3: OCP 4.1 and 4.2 Scale Run Highlights). In this blog post, we will look at potential problems and journey towards making automation and tooling intelligent enough to overcome them.
Problem with Current Automation and Tooling
The automation and tooling is not intelligent enough to handle failures leading to system degradation. When using them to run tests — especially ones focusing on performance, scalability, and chaos against distributed systems like Kubernetes/Openshift clusters — the system/application components might start degrading. (For example, node failures and system components failures like ApiServer, Etcd, and SDN e.t.c., might occur.) The CI pipeline/automation orchestrating the workloads/test cases is unable to stop the execution and signal the cluster admin when this happens and cannot move onto later test cases. This is needed, even if the cluster is still functioning, because of the high availability and self-healing design. This leads to:
- Inaccurate results.
- Loss of time (to both clusters as well as engineers), which gets very expensive for large-scale clusters ranging from 250 - 2000 nodes.
- Clusters ending up in an unrecoverable state due to the test load as the automation did not stop when cluster health started to degrade.
Today, human monitoring is necessary to understand the situation and stop the test automation/CI Pipeline to fix the issue, which is not really feasible when running the tests against multiple clusters with different parameters. Our team took shifts to monitor the cluster to make sure all was in order during the scale test runs, which put a lot of stress on the control plane. One of the goals behind building the CI pipeline/automation was to enable engineers to focus on writing new tools and new test cases while it continuously uses hardware by churning out the data from performance and scalability test runs. So, how can we solve this problem of automation and tooling not taking system degradation into account?
Cerberus to the Rescue
We built a tool called Cerberus to address the problem of automation and tooling not being able to react to system degradation. Cerberus watches the Kubernetes/OpenShift clusters for dead nodes and system component failures and exposes a go or no-go signal, which can be consumed by workload generators like Ripsaw, Scale-CI, CI pipelines/automation like Scale-CI Pipeline or any applications in the cluster and act accordingly.
What Components Can Cerberus Monitor?
It supports watching/monitoring:
- Node’s health
- System components and pods deployed in any namespace specified in the config
System components are watched by default as they are critical for running the operations on Kubernetes/OpenShift clusters. It can be used to monitor application pods as well.
Daemon Mode vs Iterations
Cerberus can be run in two modes 1) Daemon 2) Iteration. When running in daemon mode which is the default, it keeps monitoring the cluster till the user interrupts it. It has a tuning set where the wait duration can be specified before starting each watch/iteration. This is key, as setting it to a low value might lead to increased requests to the server, thus overloading it. It is important to tweak it appropriately, especially on a large-scale cluster.
In the iterations mode, it will run for the specified number of iterations and exit when done.
Demo
Cerberus can be run using python or as a container on the host with access to the Kubernetes/OpenShift cluster as documented. Here is a short demo showing the functionality:
Notifications on Failures
Automation/Tools consuming the signal exposed by Cerberus and acting accordingly is just one side of the story. It is also important to notify the team/cluster-admin to take a look at the cluster, analyze it, and fix the issue when something goes wrong. Cerberus has support for slack integration, and when enabled, it can ping on slack with the information about the cluster API to identify the cluster and failures found.
Tagging everyone or sending a message without tagging a particular person might be confusing as to who takes charge of fixing the issue and rekicks the workload. The Cerberus cop feature addresses this. A cop can be assigned over the week in the config for it to read and only tag the particular person who has been assigned the cop function in the channel.
Report/Diagnosis
We have looked at how Cerberus aggregates the failures and exposes a go/no-go signal as well as how it notifies the user. Does it generate a report/collect data on failures? Yes, it does. It generates a report with details about each watch per iteration. It also has support to provide more information on the failure by inspecting the failed component by collecting logs, events, et cetera, when inspect component mode is enabled in the config.
Use Cases
There are a number of potential use cases. Here are couple of them for which we are using Cerberus:
- We run tools to push the limits of Kubernetes/OpenShift to look at the performance and scalability. There are a number of instances where system components or nodes start to degrade, which invalidates the results and the workload generator continues to push the cluster until it is unrecoverable. The go/no-go signal exposed by Cerberus here is used to stop the test run and signal us on system degradation.
- When running chaos experiments on a Kubernetes/OpenShift cluster, they can potentially break the components unrelated to the targeted components, which means that the chaos experiment won't be able to find it. The go/no-go signal is used here to decide whether the cluster recovered from the failure injection as well as to decide whether to continue with the next chaos scenario.
We, in the OpenShift group at Red Hat, are planning to enable Cerberus in the upcoming scale test runs with nodes ranging from 250 - 2000 nodes. It’s going to be interesting to see how well it scales.
Stay tuned for updates on more tooling and automation enhancements as well as highlights from OpenShift 4.x large scale test runs. Any feedback is appreciated and as always, feel free to create issues and enhancements requests on github or reach out to us on sig-scalability channel on Kubernetes slack.
Sull'autore
Naga Ravi Chaitanya Elluri leads the Chaos Engineering efforts at Red Hat with a focus on improving the resilience, performance and scalability of Kubernetes and making sure the platform and the applications running on it perform well under turbulent conditions. His interest lies in the cloud and distributed computing space and he has contributed to various open source projects.
Altri risultati simili a questo
Key considerations for 2026 planning: Insights from IDC
Red Hat and Sylva unify the future for telco cloud
Edge computing covered and diced | Technically Speaking
Ricerca per canale
Automazione
Novità sull'automazione IT di tecnologie, team e ambienti
Intelligenza artificiale
Aggiornamenti sulle piattaforme che consentono alle aziende di eseguire carichi di lavoro IA ovunque
Hybrid cloud open source
Scopri come affrontare il futuro in modo più agile grazie al cloud ibrido
Sicurezza
Le ultime novità sulle nostre soluzioni per ridurre i rischi nelle tecnologie e negli ambienti
Edge computing
Aggiornamenti sulle piattaforme che semplificano l'operatività edge
Infrastruttura
Le ultime novità sulla piattaforma Linux aziendale leader a livello mondiale
Applicazioni
Approfondimenti sulle nostre soluzioni alle sfide applicative più difficili
Virtualizzazione
Il futuro della virtualizzazione negli ambienti aziendali per i carichi di lavoro on premise o nel cloud