Node Health Check Operator

2022년 2월 9일Roy Golan3분 읽기

In this blog post I am going to present the Node Health Check (NHC) Operator and the way it evolves automatic remediations for nodes.

It is well understood that hardware fails, software contains bugs, and deployments are at risk in those times. In some situations, while a cluster may have enough compute capacity to failover an application, it may not do so because persistent storage is in use and the state of a node is unknown or in some network partitioning scenarios involving StatefulSets. This is when a failover risks the creation of divergent data sets or corruption.

Even for stateless workloads, a node going into an unknown state means loss of compute capacity. In some situations, reprovisioning may solve this problem (say, correct some config values), and in other cases, a reboot will do.

Node Health Check Operator is an operator that monitors the node's conditions using a set of criteria, makes a health determination, and delegates any required remediation to the configured mechanism using a remediation template. Node Health Check Operator installs Poison Pill Operator, and both components are ready to automatically remediate worker nodes, out of the box, with no further configuration needed.

Prior to the release of the NHC Operator, the Poison Pill Operator was driven by the Machine Health Check (MHC) controller, which requires a functioning Machine API (most commonly associated with IPI deployments). Customers who could not or would not use the Machine API had no supported means of making workloads highly available in the presence of node level failures.

Building upon the experience gained in using Poison Pill and the Machine Health Check controller, the team decided to generalize the solution and pushed an upstream enhancement for External Remediation API[4] with a NHC as the detection and delegating mechanism.

mediK8S is the umbrella project upstream for the development of the remediation design and implementation of Poison Pill, NHC, and more in-progress remediator providers like Machine Deletion (a provider that is Machine API based).

Installation

NHC Operator is available now as a tech preview in the marketplace for OpenShift 4.9.

Without any configuration and customization, the installation pulls and installs Poison Pill Operator, and both are configured to remediate worker nodes. Here is the auto-created CR:

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nhc-worker-default
spec:
  # mandatory
  remediationTemplate:
kind: PoisonPillRemediationTemplate
apiVersion: medik8s.io/v1alpha1
name: poison-pill-default-template
namespace: poison-pill
  # see k8s doc on selectors https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#resources-that-support-set-based-requirements
  selector:
matchExpressions:
   - key: node-role.kubernetes.io/worker
     operator: Exists
  minHealthy: "51%"
  unhealthyConditions:
- type: Ready
   status: "False"
   duration: 300s
- type: Ready
   status: Unknown
   duration: 300s
  pauseRequests: []

Let's break this down step by step. For each CR, there is a set of nodes selected by the selector (here, picking all workers nodes). All actions and calculations are met on that set.

Next are the set of node conditions for an 'unhealthy' state.

A node condition "Ready" is marked "Unknown" when the control plane does not get a heartbeat for default of more than 40 sec.

A node is marked "False" for various initialization steps and validation that failed (like network plug-ins related issues).

unhealthyConditions:
  - type: Ready
status: "False"
duration: 300s
  - type: Ready
status: Unknown
duration: 300s

When meeting the criteria above, a failing node is marked as unhealthy, and now we need to decide if we should attempt remediation. There may be several reasons not to try remediation. For example, so that it does not cause further disruption and potentially fence most of the capacity of the cluster, NHC defines a minimum number of healthy nodes (again, from the selection set).

spec:
    minHealthy: "51%"

In a cluster of six workers, the default NHC will remediate a node only if at least four out of six are healthy. The percentage calculation always rounds up the outcome.

One can specify a fixed number instead.

One special feature, exclusive to OpenShift at the moment, is that in case the cluster is being updated (with an upgrade or downgrade), then remediation will be skipped until the ClusterVersion reports a non-progressing state again.

Another way of pausing remediation from happening for a certain CR is to place a 'pauseRequest' list entry. Remediation happens only when that list is empty, but this does not stop or cancel any in-flight remediations:

pauseRequest:
   -   "paused by some process"

Remediation

NHC Operator is using the External Remediation API, which was established by a joint effort with Ericsson, in the k8s community. Essentially, each NHC has an object template from which it derives the specific remediation resource to create. When all checks are met, the controller will create a new custom resource using the configured type:

spec:
  # mandatory
  remediationTemplate:
kind: PoisonPillRemediationTemplate
apiVersion: medik8s.io/v1alpha1
name: poison-pill-default-template
namespace: openshift-operators

It is worth mentioning that the node-problem-detector can be used to create more refined node conditions and help trigger remediations in more elaborated states. Please refer to the node-problem-detector project page for more information on how to define custom NodeConditions.

In the future, we expect to see more specialized remediators that handle certain use-cases and/or special infrastructure. For example, the Machine Deletion remediator work upstream.

Conclusions

OpenShift 4.9 has the Node Health Check Operator in tech-preview operator in the marketplace to remediate nodes, with an out-of-the-box behavior that automatically reboots a node and brings stateful and stateless workloads safely back up in minutes.

References: