In this blog post I am going to present the Node Health Check (NHC) Operator and the way it evolves automatic remediations for nodes.
It is well understood that hardware fails, software contains bugs, and deployments are at risk in those times. In some situations, while a cluster may have enough compute capacity to failover an application, it may not do so because persistent storage is in use and the state of a node is unknown or in some network partitioning scenarios involving StatefulSets. This is when a failover risks the creation of divergent data sets or corruption.
Even for stateless workloads, a node going into an unknown state means loss of compute capacity. In some situations, reprovisioning may solve this problem (say, correct some config values), and in other cases, a reboot will do.
Node Health Check Operator is an operator that monitors the node's conditions using a set of criteria, makes a health determination, and delegates any required remediation to the configured mechanism using a remediation template. Node Health Check Operator installs Poison Pill Operator, and both components are ready to automatically remediate worker nodes, out of the box, with no further configuration needed.
Prior to the release of the NHC Operator, the Poison Pill Operator was driven by the Machine Health Check (MHC) controller, which requires a functioning Machine API (most commonly associated with IPI deployments). Customers who could not or would not use the Machine API had no supported means of making workloads highly available in the presence of node level failures.
Building upon the experience gained in using Poison Pill and the Machine Health Check controller, the team decided to generalize the solution and pushed an upstream enhancement for External Remediation API[4] with a NHC as the detection and delegating mechanism.
mediK8S is the umbrella project upstream for the development of the remediation design and implementation of Poison Pill, NHC, and more in-progress remediator providers like Machine Deletion (a provider that is Machine API based).
Installation
NHC Operator is available now as a tech preview in the marketplace for OpenShift 4.9.
Without any configuration and customization, the installation pulls and installs Poison Pill Operator, and both are configured to remediate worker nodes. Here is the auto-created CR:
apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
name: nhc-worker-default
spec:
# mandatory
remediationTemplate:
kind: PoisonPillRemediationTemplate
apiVersion: medik8s.io/v1alpha1
name: poison-pill-default-template
namespace: poison-pill
# see k8s doc on selectors https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#resources-that-support-set-based-requirements
selector:
matchExpressions:
- key: node-role.kubernetes.io/worker
operator: Exists
minHealthy: "51%"
unhealthyConditions:
- type: Ready
status: "False"
duration: 300s
- type: Ready
status: Unknown
duration: 300s
pauseRequests: []
Let's break this down step by step. For each CR, there is a set of nodes selected by the selector (here, picking all workers nodes). All actions and calculations are met on that set.
Next are the set of node conditions for an 'unhealthy' state.
A node condition "Ready" is marked "Unknown" when the control plane does not get a heartbeat for default of more than 40 sec.
A node is marked "False" for various initialization steps and validation that failed (like network plug-ins related issues).
unhealthyConditions:
- type: Ready
status: "False"
duration: 300s
- type: Ready
status: Unknown
duration: 300s
When meeting the criteria above, a failing node is marked as unhealthy, and now we need to decide if we should attempt remediation. There may be several reasons not to try remediation. For example, so that it does not cause further disruption and potentially fence most of the capacity of the cluster, NHC defines a minimum number of healthy nodes (again, from the selection set).
spec:
minHealthy: "51%"
In a cluster of six workers, the default NHC will remediate a node only if at least four out of six are healthy. The percentage calculation always rounds up the outcome.
One can specify a fixed number instead.
One special feature, exclusive to OpenShift at the moment, is that in case the cluster is being updated (with an upgrade or downgrade), then remediation will be skipped until the ClusterVersion reports a non-progressing state again.
Another way of pausing remediation from happening for a certain CR is to place a 'pauseRequest' list entry. Remediation happens only when that list is empty, but this does not stop or cancel any in-flight remediations:
pauseRequest:
- "paused by some process"
Remediation
NHC Operator is using the External Remediation API, which was established by a joint effort with Ericsson, in the k8s community. Essentially, each NHC has an object template from which it derives the specific remediation resource to create. When all checks are met, the controller will create a new custom resource using the configured type:
spec:
# mandatory
remediationTemplate:
kind: PoisonPillRemediationTemplate
apiVersion: medik8s.io/v1alpha1
name: poison-pill-default-template
namespace: openshift-operators
It is worth mentioning that the node-problem-detector can be used to create more refined node conditions and help trigger remediations in more elaborated states. Please refer to the node-problem-detector project page for more information on how to define custom NodeConditions.
In the future, we expect to see more specialized remediators that handle certain use-cases and/or special infrastructure. For example, the Machine Deletion remediator work upstream.
Conclusions
OpenShift 4.9 has the Node Health Check Operator in tech-preview operator in the marketplace to remediate nodes, with an out-of-the-box behavior that automatically reboots a node and brings stateful and stateless workloads safely back up in minutes.
References:
- https://cloud.redhat.com/blog/kubernetes-self-remediation-aka-poison-pill
- NHC Operator docs https://docs.opensift.com/container-platform/4.9/nodes/nodes/eco-node-health-check-operator.html
- Machine Health Check docs https://docs.openshift.com/container-platform/4.1/machine_management/deploying-machine-health-checks.html
- External Remediation API enhancement PR https://github.com/kubernetes-sigs/cluster-api/pull/3190
- MediK8S home page https://www.medik8s.io
저자 소개
유사한 검색 결과
Ford's keyless strategy for managing 200+ Red Hat OpenShift clusters
F5 BIG-IP Virtual Edition is now validated for Red Hat OpenShift Virtualization
Can Kubernetes Help People Find Love? | Compiler
Scaling For Complexity With Container Adoption | Code Comments
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래