Introducing Cluster Observability Operator

November 24, 20234-minute readObservability

Manager, Software Engineering - OpenShift Observability

Principal Product Manager - Observability

Today we're unveiling the Cluster Observability Operator (COO), a new Red Hat OpenShift Operator that is designed to manage observability stacks on your clusters. Its upstream variant can be used on vanilla Kubernetes. This is more than an Operator; it’s also a testament of our commitment to delivering tightly integrated observability solutions that evolve with our customers' and users' needs.

COO is now available as a technology preview for all OpenShift users, introducing the new MonitoringStack custom resource definition (CRD) as an initial feature set, which lets you run highly available monitoring stacks consisting of Prometheus, AlertManager and Thanos Querier. Additional observability components may be added in a future release (see the “Looking forward” section below for more details).

COO complements the built-in monitoring capabilities of OpenShift and can be run in parallel with the default platform monitoring and user workload monitoring stacks managed by the Cluster Monitoring Operator (CMO).

From a product perspective, COO is a strategic enhancement reflecting our deep understanding of the evolving Kubernetes/OpenShift landscape. By incorporating the latest technological advancements, COO is designed to closely integrate with existing systems, so our customers can better stay ahead in the rapidly advancing world of cloud-native technologies.

Background

OpenShift ships with built-in monitoring capabilities by default. On an OpenShift cluster, the CMO manages two monitoring stacks:

The platform monitoring stack, which monitors the cluster infrastructure and all OpenShift components and acts as a data source for the OpenShift Console.
The optional user workload monitoring stack, which can be used to monitor custom workloads.

With its opinionated configuration tuned for reliability and easy operation, curated alerting rules, accessible dashboards, and simple but reliable tenancy model, the default OpenShift monitoring stack has set an industry standard and played a key role in the success of OpenShift in enterprises worldwide.

OpenShift monitoring’s design decisions and tradeoffs between supportability and flexibility in configuration all fit neatly with the most common enterprise use cases, in which small- to mid-sized clusters are deployed with ownership shared between two roles: administrators and developers.

Typically, a single OpenShift cluster in this environment is managed and used by one site reliability engineering (SRE) team that is responsible for operating the cluster infrastructure, and by multiple development teams that use the cluster and own one or more namespaces in which they run their workloads.

The SRE team can rely on the built-in platform metrics, alerts and dashboards on which to base their service level objectives (SLOs). The development teams can leverage the user workload monitoring stack for monitoring their custom workloads with the ability to restrict access to and control the visibility of metrics on a namespace level.

Recently, however, we've been seeing an ever-increasing number of customer needs that don’t fall into the standard use case described above. A few examples include:

Very small clusters and resource-constrained environments (for example, edge use cases)
Very large clusters with hundreds or thousands of nodes
More complex ownership models with multiple levels of responsibility for different parts of a cluster
More complex requirements regarding tenancy
A large number of clusters with the requirement to observe them in a more centralized way

Additionally, we’ve begun thinking of monitoring as only one part of a complete observability story. In OpenShift, metrics, logs and traces have traditionally been set up and dealt with separately, with logs and traces being optional components in a default OpenShift installation.

If we take a more holistic approach to observability that includes all of these different signals and then correlate and present them in a unified way, we can work toward a solution for customers that will make observing large platform operations easier and help reduce complexity in both setup and use.

We have created the Cluster Observability Operator as part of this holistic approach toward addressing these customer needs and use cases.

In creating COO, our product vision was to develop a tool that not only addresses current user requirements but also anticipates future trends in cluster management and observability. This forward-thinking approach means that COO is a solution for today and a strategic asset that will continue to deliver value as customer needs and industry standards evolve.

Cluster Observability Operator explained

Cluster Observability Operator can be installed and managed on OpenShift using the Operator Lifecycle Manager from the official Red Hat channels. For other Kubernetes distributions, please refer to the upstream documentation.

After installing COO, it’s straightforward to create a MonitoringStack custom resource in your namespace that will spin up a monitoring stack with the default configuration:

apiVersion: monitoring.rhobs/v1alpha1
kind: MonitoringStack
metadata:
  labels:
coo: example
  name: sample-monitoring-stack
  namespace: coo-demo
spec:
  logLevel: debug
  retention: 1d
  resourceSelector:
matchLabels:
   app: demo

Under the hood, COO runs Prometheus Operator, creating a highly available Prometheus instance paired with Thanos Querier and AlertManager instances.

Using COO, you can run any number of monitoring stacks on your cluster with this approach, enabling many use cases that haven't previously been possible using default OpenShift monitoring.

Additionally, COO leverages Server-Side Apply to enable fine-grained control of the underlying configuration (for example, of the Prometheus object) without moving full ownership of the resource to the user.

With these two basic concepts, COO enables:

Scalability: The stack can be configured in a way to fit both the smallest environments (for example, only scrape-and-forward with remote write) and the largest environments (for example, through manual sharding by running multiple stacks on one cluster).
Multitenancy: COO-managed stacks can fit into any ownership model. For example, additional SRE teams can operate shared services on the cluster for other teams.
Flexibility: Any number of scrape targets and alerting rules can be added to a COO-managed stack by leveraging the Prometheus Operator CRDs.

Looking forward

As mentioned, deploying and managing monitoring stacks using COO is expected to be only an initial feature set. In future releases, we plan to add capabilities for managing logging and distributed tracing stacks, all with the benefits described above.

As we look to the future, our product roadmap for COO is ambitious and aligns with our goal of continuous innovation. By expanding its capabilities to encompass logging and distributed tracing, we are not just enhancing a product, we're evolving an ecosystem. This holistic approach to observability underlines our commitment to delivering comprehensive, industry-leading solutions that are in tune with the needs of our users and the direction of the market.

Additionally, creating an ObversabilityStack CRD and managing other observability signals under COO will add another abstraction layer that will help simplify the configuration of observability components even further and will enable us to add additional functionality that works across all observability signals.

Introducing the Cluster Observability Operator is a new milestone in the OpenShift ecosystem. It reflects our commitment to innovation, adaptability and customer-centric development. COO enhances our current offerings and sets the stage for future developments in observability. We would highly value your feedback, additional ideas and any community contributions to the upstream project as we evolve and refine this tool.

About the authors

Daniel Mohr

Manager, Software Engineering - OpenShift Observability

Daniel Mohr joined Red Hat in 2021 with a background in embedded Linux software development, site reliability engineering for large scale web applications and leading SRE teams. In his role as an engineering manager he works with topics like the Red Hat OpenShift monitoring stack, Multicluster Observability and Power Monitoring for OpenShift as part of the Red Hat Observability group.

Roger Florén

Principal Product Manager - Observability

Roger Florén, a dynamic and forward-thinking leader, currently serves as the Principal Product Manager at Red Hat, specializing in Observability. His journey in the tech industry is marked by high performance and ambition, transitioning from a senior developer role to a principal product manager. With a strong foundation in technical skills, Roger is constantly driven by curiosity and innovation. At Red Hat, Roger leads the Observability platform team, working closely with in-cluster monitoring teams and contributing to the development of products like Prometheus, AlertManager, Thanos and Observatorium. His expertise extends to coaching, product strategy, interpersonal skills, technical design, IT strategy and agile project management.

Browse by channel

Explore all channels

Introducing Cluster Observability Operator

Background

Cluster Observability Operator explained

Looking forward

About the authors

Daniel Mohr

Roger Florén

More like this

Browse by channel

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links