订阅内容

When an operator fails, it can mean loss of critical functionality. When an admin sees an operator at risk, it must be fixed before it becomes a critical issue. As an operator developer, you want to help your users get this valuable info, fast. Here's how the Red Hat OpenShift Virtualization operator team created an Operator health metric that can help.

Screenshot of a cluster overview with OpenShift Virtualization alerts highlighted

Currently, operators have different metrics and alerts that are unique to their needs and reflect their conditions. Yet there is no aggregation that can tell the user, through one single pane of glass, what the operator's general health is. Moreover, it’s not clear to users which alerts impact the operator's health and which are related to workloads.

In the OpenShift Virtualization operator, we created a metric that performs this aggregation and enables us to clearly define the operator's health.

To achieve this, we added three labels to each of the operator alerts:

1. kubernetes_operator_part_of – operator name

2. kubernetes_operator_component – operator component (can be the same as the above)

3. operator_health_impact – indicates the impact of the issue on the operator's health. This label will get one of the following values:

  • critical indicates an issue that negatively impacts the operator's health, and the user should take action immediately to solve it.
  • warning indicates an issue that might soon negatively impact the operator's health, and the user should solve the issue.
  • none indicates an issue that doesn't affect the operator's health. In most cases, this is related to the workload rather than the operator itself.

We also created and added a new metric that aggregates the metrics conditions by the name kubevirt_hco_system_health_status.

Its value is calculated by the following:

  • Degraded = True  or !Available = True-> kubevirt_hco_system_health_status = 2 (Error)
  • Progressing = True or !ReconcileComplete = True -> kubevirt_hco_system_health_status = 1 (Warning)
  • Else (Available = True and Degraded = False and Progressing = False and ReconcileComplete = True) -> kubevirt_hco_system_health_status = 0 (Healthy)

We created a recording rule for the health metric called kubevirt_hyperconverged_operator_health_status that will indicate the overall health of our operator, link. Values: healthy (0), warning (1) or critical (2).
It is based on both alerts and operator conditions. Its value will be the minimum of the two mentioned above.

This metric provides a clear operator health status and enables you to easily detect the causes of the issue by checking the firing alerts and their health impact.

Health metrics based on Conditions vs. Alerts

Conditions-based health metricAlerts-based health metric
Doesn’t necessarily indicate a real issue since issues can be resolved by k8s.  Usually indicates a real issue since there is an evaluation time prior to each alert.
Harder to track the issue and sub-operator since data in the conditions is aggregative.Easy to track what the issue is since we can see the alerts.
Precision depends on the code coverage.Precision depends on the alerts coverage. 
Doesn’t require additional code changes.Requires adding the the labels:
- operator_health_impact to differentiate between operator and workload alerts to determine  if they impact the health of the operator.
- kubernetes_operator_part_of - To identify the alerts  related to the specific operator.
It is the convention for reporting health status in OCP operators.Not the convention. The community is not sure why this approach would be better.

 


关于作者

Shirly has been with Red Hat since January 2014 and currently serves as the OpenShift Virtualization Observability Team Lead. In this role, Shirly concentrates on improving the operator observability within the Kubernetes and OpenShift ecosystem, to ensure it effectively addresses user needs. Over the years, Shirly has been involved in numerous projects, driven by a genuine passion for enhancing system efficiency and user experience. She is also proactive in sharing the knowledge and best practices garnered with the broader operators community.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Original series icon

原创节目

关于企业技术领域的创客和领导者们有趣的故事