When an operator fails, it can mean loss of critical functionality. When an admin sees an operator at risk, it must be fixed before it becomes a critical issue. As an operator developer, you want to help your users get this valuable info, fast. Here's how the Red Hat OpenShift Virtualization operator team created an Operator health metric that can help.

Currently, operators have different metrics and alerts that are unique to their needs and reflect their conditions. Yet there is no aggregation that can tell the user, through one single pane of glass, what the operator's general health is. Moreover, it’s not clear to users which alerts impact the operator's health and which are related to workloads.
In the OpenShift Virtualization operator, we created a metric that performs this aggregation and enables us to clearly define the operator's health.
To achieve this, we added three labels to each of the operator alerts:
1. kubernetes_operator_part_of
– operator name
2. kubernetes_operator_component
– operator component (can be the same as the above)
3. operator_health_impact
– indicates the impact of the issue on the operator's health. This label will get one of the following values:
critical
indicates an issue that negatively impacts the operator's health, and the user should take action immediately to solve it.warning
indicates an issue that might soon negatively impact the operator's health, and the user should solve the issue.none
indicates an issue that doesn't affect the operator's health. In most cases, this is related to the workload rather than the operator itself.
We also created and added a new metric that aggregates the metrics conditions by the name kubevirt_hco_system_health_status
.
Its value is calculated by the following:
Degraded
=True
or !Available
=True
->kubevirt_hco_system_health_status
= 2 (Error)Progressing
=True
or !ReconcileComplete
=True
->kubevirt_hco_system_health_status
= 1 (Warning)- Else (
Available
=True
andDegraded
=False
andProgressing
=False
andReconcileComplete
=True
) ->kubevirt_hco_system_health_status
= 0 (Healthy)
We created a recording rule for the health metric called kubevirt_hyperconverged_operator_health_status that will indicate the overall health of our operator, link. Values: healthy (0), warning (1) or critical (2).
It is based on both alerts and operator conditions. Its value will be the minimum of the two mentioned above.
This metric provides a clear operator health status and enables you to easily detect the causes of the issue by checking the firing alerts and their health impact.
Health metrics based on Conditions vs. Alerts
Conditions-based health metric | Alerts-based health metric |
Doesn’t necessarily indicate a real issue since issues can be resolved by k8s. | Usually indicates a real issue since there is an evaluation time prior to each alert. |
Harder to track the issue and sub-operator since data in the conditions is aggregative. | Easy to track what the issue is since we can see the alerts. |
Precision depends on the code coverage. | Precision depends on the alerts coverage. |
Doesn’t require additional code changes. | Requires adding the the labels: - operator_health_impact to differentiate between operator and workload alerts to determine if they impact the health of the operator.- kubernetes_operator_part_of - To identify the alerts related to the specific operator. |
It is the convention for reporting health status in OCP operators. | Not the convention. The community is not sure why this approach would be better. |
저자 소개
Shirly has been with Red Hat since January 2014 and currently serves as the OpenShift Virtualization Observability Team Lead. In this role, Shirly concentrates on improving the operator observability within the Kubernetes and OpenShift ecosystem, to ensure it effectively addresses user needs. Over the years, Shirly has been involved in numerous projects, driven by a genuine passion for enhancing system efficiency and user experience. She is also proactive in sharing the knowledge and best practices garnered with the broader operators community.
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
오리지널 쇼
엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리
제품
- Red Hat Enterprise Linux
- Red Hat OpenShift Enterprise
- Red Hat Ansible Automation Platform
- 클라우드 서비스
- 모든 제품 보기
툴
체험, 구매 & 영업
커뮤니케이션
Red Hat 소개
Red Hat은 Linux, 클라우드, 컨테이너, 쿠버네티스 등을 포함한 글로벌 엔터프라이즈 오픈소스 솔루션 공급업체입니다. Red Hat은 코어 데이터센터에서 네트워크 엣지에 이르기까지 다양한 플랫폼과 환경에서 기업의 업무 편의성을 높여 주는 강화된 기능의 솔루션을 제공합니다.