Cerberus monitors OpenShift/Kubernetes cluster health and exposes a go/no-go signal consumable by other benchmark tools. Since introducing Cerberus in the first blog post (OpenShift Scale-CI: Part 4: Introduction to Cerberus - Guardian of Kubernetes/OpenShift Clouds), we have brought in a tremendous amount of enhancements to strengthen the guardian of the OpenShift/Kubernetes clusters. Cerberus has now evolved into a highly scalable and reliable tool with the potential to monitor diverse cluster components. In this blog post, we will be taking a look at all of the enhancements, the reasons we added them, and how the user will be able to use them. 

Recent Additions to Monitoring

1: Schedulable Masters

It is essential to have dedicated resources for the master nodes in order to ensure high availability of key components such as API Server, Scheduler, Controller Manager, and etcd that always run on master nodes. Therefore, we need to avoid other noncritical workloads from interfering with the functioning of the master services. Taints enable a node to repel a set of pods from getting associated with it. Cerberus checks for masters without NoSchedule taint and warns the user regarding schedulable master nodes.

2: Cluster Operators

Operators are pieces of software that run in a pod and are good indicators of unhealthy OpenShift clusters. These operators show multiple status, and we can look specifically at the degraded status to determine its overall health. With the OpenShift distribution set, Cerberus takes a look at all of the cluster operators and verifies that they aren’t degraded.

3: Routes

OpenShift should be able to have little or no downtime for a variety of different routes that control the cluster and expose the end user’s applications to the outside world. Being able to monitor how long these routes are down is an important component of a cluster's health. Now, the user has the ability to monitor multiple routes/urls to spot if they are down by providing authorization to access each route.

4: User-Defined Checks

The user can bring in additional checks to monitor the components that are not being directly monitored by Cerberus such as checks to monitor status of application pods, database connectivity, and application server status and thus stepping up the significance and functionality of Cerberus.

5: Distributions supported

Cerberus can run on Kubernetes and OpenShift distributions with a few tweaks to the configuration file. The user can select an appropriate configuration file based on the distribution. Using the OpenShift distribution runs both Kubernetes and OpenShift checks while the Kubernetes distribution examines the components specific to Kubernetes.

Improved Diagnostics

1: Cerberus Alerts on High Kube API Server Latencies

Cerberus queries Prometheus if deployed for KubeAPILatencyHigh alert at the end of each iteration and warns the user if 99th percentile latency for given requests to the kube-apiserver is above 1 second. Higher latency is an indicator of problems in the cluster.

2: Cerberus History API

A single go/no-signal could miss flaky nodes, pods, and operators. History API is designed to provide adequate data enabling the user to decide if the cluster is healthy or not. Failures seen in a specific time window are provided to the user in the json format. It consists of information on the type of failure, name, and type of failed component and includes a timestamp when the failure occurred.

  • Users can retrieve the node/component failures seen in the past one hour by visiting the history url. Users can pass an optional loopback parameter to retrieve failures seen in the past loopback minutes.
  • Users can retrieve the node/component failures between two time timestamps, failures of specific types and failures related to specific components in json format by visiting the analyze url. The user gets the privilege to apply filters to scrape the failures satisfying certain criteria.

History API is available as a Python package to ease the usage in other tools.

Cerberus Scalability

Cerberus is a highly scalable and reliable tool with the ability to monitor the health of 500-plus node clusters in a matter of seconds. Following enhancements were added to boost the scalability:

1: Multiprocessing

Cerberus uses multiple cores to reduce the cluster monitoring time.

    1. It performs all cluster health checks in parallel.
    2. It monitors the system critical and application components in all namespaces in parallel.
    3. It tracks pod crash/restart during wait time between iterations in all namespaces in parallel.
    4. It collects detailed logs, metrics, and events of failed components in parallel.

The graph below shows the time comparisons of health monitoring time by a single process/serial and multiprocessing versions of Cerberus on clusters of different size.


2: Reduction in the Number of API Calls

Cerberus makes a single API call to get the status of all the pods in a namespace instead of iterating through the list of pods and making an API call to get the status of each pod. Similarly, a single API call is used to get the status of all the nodes. Cerberus handles bulk requests in chunks to prevent overloading the API Server by consuming a large amount of resources when there are thousands of objects running in a cluster.

Cerberus Reliability

Cerberus is a highly reliable tool with ability to generate go/no-go signals under scenarios like cluster connection loss, cluster shut down, etcd leader change, rpc error, etc. It has the potential to monitor cluster health till the user terminates Cerberus monitoring. 

New Ways on How to Get Started 

1: Cerberus python package

Cerberus is available as a Python package to ease the installation and usage. 

2: Cerberus as a Kubenetes Deployment

Cerberus can be run internally to the cluster as a Kubernetes deployment. However, it is not a recommended option as the pod which is running Cerberus might get disrupted. It is advised to run Cerberus external to the cluster to ensure the receipt of a go/no-go signal is not affected when the cluster is down, Kube API stops serving requests, etc.



We are always looking for new ideas and enhancements to this project. Feel free to create issues on github or open your own pull request for future improvements you would like to see. Also, reach out to us with any questions, concerns or need help on slack