In the previous blogs, we looked at the evolution of automation and tooling, how they are used to look at the performance and push the limits of OpenShift at large scale and highlights from the recent scale test runs (OpenShift Scale-CI, part 1: Evolution, OpenShift Scale-CI, part 2: Deep Dive, and OpenShift Scale-CI, part 3: OCP 4.1 and 4.2 Scale Run Highlights). In this blog post, we will look at potential problems and journey towards making automation and tooling intelligent enough to overcome them.
Problem with Current Automation and Tooling
The automation and tooling is not intelligent enough to handle failures leading to system degradation. When using them to run tests — especially ones focusing on performance, scalability, and chaos against distributed systems like Kubernetes/Openshift clusters — the system/application components might start degrading. (For example, node failures and system components failures like ApiServer, Etcd, and SDN e.t.c., might occur.) The CI pipeline/automation orchestrating the workloads/test cases is unable to stop the execution and signal the cluster admin when this happens and cannot move onto later test cases. This is needed, even if the cluster is still functioning, because of the high availability and self-healing design. This leads to:
- Inaccurate results.
- Loss of time (to both clusters as well as engineers), which gets very expensive for large-scale clusters ranging from 250 - 2000 nodes.
- Clusters ending up in an unrecoverable state due to the test load as the automation did not stop when cluster health started to degrade.
Today, human monitoring is necessary to understand the situation and stop the test automation/CI Pipeline to fix the issue, which is not really feasible when running the tests against multiple clusters with different parameters. Our team took shifts to monitor the cluster to make sure all was in order during the scale test runs, which put a lot of stress on the control plane. One of the goals behind building the CI pipeline/automation was to enable engineers to focus on writing new tools and new test cases while it continuously uses hardware by churning out the data from performance and scalability test runs. So, how can we solve this problem of automation and tooling not taking system degradation into account?
Cerberus to the Rescue
We built a tool called Cerberus to address the problem of automation and tooling not being able to react to system degradation. Cerberus watches the Kubernetes/OpenShift clusters for dead nodes and system component failures and exposes a go or no-go signal, which can be consumed by workload generators like Ripsaw, Scale-CI, CI pipelines/automation like Scale-CI Pipeline or any applications in the cluster and act accordingly.
What Components Can Cerberus Monitor?
It supports watching/monitoring:
- Node’s health
- System components and pods deployed in any namespace specified in the config
System components are watched by default as they are critical for running the operations on Kubernetes/OpenShift clusters. It can be used to monitor application pods as well.
Daemon Mode vs Iterations
Cerberus can be run in two modes 1) Daemon 2) Iteration. When running in daemon mode which is the default, it keeps monitoring the cluster till the user interrupts it. It has a tuning set where the wait duration can be specified before starting each watch/iteration. This is key, as setting it to a low value might lead to increased requests to the server, thus overloading it. It is important to tweak it appropriately, especially on a large-scale cluster.
In the iterations mode, it will run for the specified number of iterations and exit when done.
Demo
Cerberus can be run using python or as a container on the host with access to the Kubernetes/OpenShift cluster as documented. Here is a short demo showing the functionality:
Notifications on Failures
Automation/Tools consuming the signal exposed by Cerberus and acting accordingly is just one side of the story. It is also important to notify the team/cluster-admin to take a look at the cluster, analyze it, and fix the issue when something goes wrong. Cerberus has support for slack integration, and when enabled, it can ping on slack with the information about the cluster API to identify the cluster and failures found.
Tagging everyone or sending a message without tagging a particular person might be confusing as to who takes charge of fixing the issue and rekicks the workload. The Cerberus cop feature addresses this. A cop can be assigned over the week in the config for it to read and only tag the particular person who has been assigned the cop function in the channel.
Report/Diagnosis
We have looked at how Cerberus aggregates the failures and exposes a go/no-go signal as well as how it notifies the user. Does it generate a report/collect data on failures? Yes, it does. It generates a report with details about each watch per iteration. It also has support to provide more information on the failure by inspecting the failed component by collecting logs, events, et cetera, when inspect component mode is enabled in the config.
Use Cases
There are a number of potential use cases. Here are couple of them for which we are using Cerberus:
- We run tools to push the limits of Kubernetes/OpenShift to look at the performance and scalability. There are a number of instances where system components or nodes start to degrade, which invalidates the results and the workload generator continues to push the cluster until it is unrecoverable. The go/no-go signal exposed by Cerberus here is used to stop the test run and signal us on system degradation.
- When running chaos experiments on a Kubernetes/OpenShift cluster, they can potentially break the components unrelated to the targeted components, which means that the chaos experiment won't be able to find it. The go/no-go signal is used here to decide whether the cluster recovered from the failure injection as well as to decide whether to continue with the next chaos scenario.
We, in the OpenShift group at Red Hat, are planning to enable Cerberus in the upcoming scale test runs with nodes ranging from 250 - 2000 nodes. It’s going to be interesting to see how well it scales.
Stay tuned for updates on more tooling and automation enhancements as well as highlights from OpenShift 4.x large scale test runs. Any feedback is appreciated and as always, feel free to create issues and enhancements requests on github or reach out to us on sig-scalability channel on Kubernetes slack.
저자 소개
Naga Ravi Chaitanya Elluri leads the Chaos Engineering efforts at Red Hat with a focus on improving the resilience, performance and scalability of Kubernetes and making sure the platform and the applications running on it perform well under turbulent conditions. His interest lies in the cloud and distributed computing space and he has contributed to various open source projects.
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래