OCP Disaster Recovery Part 3: Recovering an OpenShift 4 IPI cluster With the Loss of Two Master Nodes

8 luglio 2021Zanoni Maciel22 minuti (tempo di lettura)

Continuing in the OpenShift disaster recovery series, I will describe the procedure for disaster recovery from the loss of two master nodes in this article.

As mentioned in part 2 of this series covering single master node recovery, remember that in any disaster, a backup of the platform is critical for recovery.

So, before proceeding, please review part 1 of the disaster recovery series, where I explain the configuration and automatic procedure for generating ETCD backups.

With the backup of ETCD done, the next steps will be essential for a successful recovery.

NOTE: It is only possible to recover an OpenShift cluster if there is still a single integral master left. If you have lost all master nodes, the following steps cannot be replicated successfully.

This solution has been tested from versions 4.7 onwards.

When you lose more than one master node, the OpenShift API will be completely offline. The following steps will be used for this recovery.

Before we begin this example scenario, we will validate that the cluster is functional with all the machines in the deployment:

NOTE: For this article, we used a cluster OpenShift 4.7.13 IPI in vSphere.

$ oc get nodes
NAME                         STATUS   ROLES    AGE    VERSION
zmaciel-btqbk-master-0       Ready    master   82m   v1.20.0+df9c838
zmaciel-btqbk-master-1       Ready    master   82m   v1.20.0+df9c838
zmaciel-btqbk-master-2       Ready    master   82m   v1.20.0+df9c838
zmaciel-btqbk-worker-4227z   Ready    worker   71m   v1.20.0+df9c838
zmaciel-btqbk-worker-z4hjw   Ready    worker   71m   v1.20.0+df9c838

Online machines are validated:

$ oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}'
zmaciel-btqbk-master-0        poweredOn
zmaciel-btqbk-master-1        poweredOn
zmaciel-btqbk-master-2        poweredOn
zmaciel-btqbk-worker-4227z        poweredOn
zmaciel-btqbk-worker-z4hjw        poweredOn

And the cluster operators are available:

$ oc get clusteroperators
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.13    True        False         False      61m
baremetal                                  4.7.13    True        False         False      81m
cloud-credential                           4.7.13    True        False         False      85m
cluster-autoscaler                         4.7.13    True        False         False      81m
config-operator                            4.7.13    True        False         False      81m
console                                    4.7.13    True        False         False      67m
csi-snapshot-controller                    4.7.13    True        False         False      81m
dns                                        4.7.13    True        False         False      80m
etcd                                       4.7.13    True        False         False      79m
image-registry                             4.7.13    True        False         False      74m
ingress                                    4.7.13    True        False         False      71m
insights                                   4.7.13    True        False         False      74m
kube-apiserver                             4.7.13    True        False         False      78m
kube-controller-manager                    4.7.13    True        False         False      78m
kube-scheduler                             4.7.13    True        False         False      78m
kube-storage-version-migrator              4.7.13    True        False         False      71m
machine-api                                4.7.13    True        False         False      76m
machine-approver                           4.7.13    True        False         False      81m
machine-config                             4.7.13    True        False         False      80m
marketplace                                4.7.13    True        False         False      80m
monitoring                                 4.7.13    True        False         False      70m
network                                    4.7.13    True        False         False      81m
node-tuning                                4.7.13    True        False         False      81m
openshift-apiserver                        4.7.13    True        False         False      55m
openshift-controller-manager               4.7.13    True        False         False      80m
openshift-samples                          4.7.13    True        False         False      73m
operator-lifecycle-manager                 4.7.13    True        False         False      80m
operator-lifecycle-manager-catalog         4.7.13    True        False         False      80m
operator-lifecycle-manager-packageserver   4.7.13    True        False         False      74m
service-ca                                 4.7.13    True        False         False      81m
storage                                    4.7.13    True        False         False      80m

Verifications

Let’s begin with the failure scenario. We see that some machines no longer exist in the VMware environment.

This will result in the OpenShift API going completely offline.

$ oc login https://<api_URL>:6443
error: dial tcp XX.XX.XX.XX:6443 connect: connection refused - verify you have provided the correct host and port and that the server is currently running.
$ curl -k https://<api_URL>:6443
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to <api_URL>.com:6443
$ oc get nodes
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding
$ oc get clusteroperators
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding

We also see that the console is unavailable:

Recovering Failed Master Nodes

The first procedure is to ssh the still functional master node:

$ ssh -i ~/.ssh/id_rsa core@XX.XX.XX.XX

Let's show the backup that will be used for recovery:

$ ls /home/core/backup/
snapshot_2021-06-04_194718.db  static_kuberesources_2021-06-04_194718.tar.gz

NOTE: After any update in the OpenShift cluster, it is highly recommended to perform a backup of ETCD. For restoring a backup using an earlier version, additional steps will be required for correctly recovering the cluster.

Before performing the ETCD backup restore, it is necessary to stop the static control plane pods.

Stopping the ETCD pod:

$ sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp

Validate that the pod ETCD stopped:

$ sudo crictl ps | grep etcd | grep -v operator

NOTE: When you run the above command, no output should be returned. If the pod ETCD is still running, you must wait until it stops.

Stopping the Kube APIServer pod:

$ sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp

Validated that pod Kube APIServer stopped:

$ sudo crictl ps | grep kube-apiserver | grep -v operator

NOTE: When you run the above command, no output should be returned. If the kube-apiserver pod still appears, you should wait until it stops.

Move the ETCD data directory to a different location:

$ sudo mv /var/lib/etcd/ /tmp

With the backup located, the next step is to restore the ETCD backup for the cluster to bring the backup online and thus perform the recovery procedures of the two master nodes that were lost.

$ sudo -E /usr/local/bin/cluster-restore.sh /home/core/backup
...stopping kube-apiserver-pod.yaml
...stopping kube-controller-manager-pod.yaml
...stopping kube-scheduler-pod.yaml
...stopping etcd-pod.yaml
Waiting for container etcd to stop
complete
Waiting for container etcdctl to stop
complete
Waiting for container etcd-metrics to stop
complete
Waiting for container kube-controller-manager to stop
.complete
Waiting for container kube-apiserver to stop
complete
Waiting for container kube-scheduler to stop
complete
starting restore-etcd static pod
starting kube-apiserver-pod.yaml
static-pod-resources/kube-apiserver-pod-7/kube-apiserver-pod.yaml
starting kube-controller-manager-pod.yaml
static-pod-resources/kube-controller-manager-pod-8/kube-controller-manager-pod.yaml
starting kube-scheduler-pod.yaml
static-pod-resources/kube-scheduler-pod-6/kube-scheduler-pod.yaml

NOTE: Both the ETCD backup and restore scripts already exist on the master nodes by default. Its execution requires privileges.

After restoring the ETCD backup, it is necessary to restart the Kubelet service:

$ sudo systemctl restart kubelet.service

Validate that the kubelet service started correctly:

$ sudo systemctl status kubelet.service
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-mco-default-env.conf, 10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf
   Active: active (running) since Fri 2021-06-04 19:56:34 UTC; 24s ago
  Process: 184093 ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state (code=exited, status=0/SUCCESS)
  Process: 184091 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS)
 Main PID: 184095 (kubelet)
    Tasks: 15 (limit: 101998)
   Memory: 75.0M
      CPU: 4.706s
   CGroup: /system.slice/kubelet.service
           └─184095 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --co>

After restoring the backup, communication with the OpenShift API will be possible:

$ oc get nodes
NAME                         STATUS   ROLES    AGE   VERSION
zmaciel-btqbk-master-0       NotReady   master   97m   v1.20.0+df9c838
zmaciel-btqbk-master-1       NotReady   master   97m   v1.20.0+df9c838
zmaciel-btqbk-master-2       Ready      master   97m   v1.20.0+df9c838
zmaciel-btqbk-worker-4227z   Ready      worker   86m   v1.20.0+df9c838
zmaciel-btqbk-worker-z4hjw   Ready      worker   86m   v1.20.0+df9c838
$ oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}'
zmaciel-btqbk-master-0        poweredOff
zmaciel-btqbk-master-1        poweredOff
zmaciel-btqbk-master-2        poweredOn
zmaciel-btqbk-worker-4227z        poweredOn
zmaciel-btqbk-worker-z4hjw        poweredOn
$ oc get clusteroperators
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.13    True        True          False      75m
baremetal                                  4.7.13    True        False         False      95m
cloud-credential                           4.7.13    True        False         False      99m
cluster-autoscaler                         4.7.13    True        False         False      94m
config-operator                            4.7.13    True        False         False      95m
console                                    4.7.13    True        False         False      81m
csi-snapshot-controller                    4.7.13    True        False         False      94m
dns                                        4.7.13    True        False         True       94m
etcd                                       4.7.13    True        False         False      93m
image-registry                             4.7.13    True        False         False      88m
ingress                                    4.7.13    True        False         False      85m
insights                                   4.7.13    True        False         False      87m
kube-apiserver                             4.7.13    True        False         False      91m
kube-controller-manager                    4.7.13    True        False         False      92m
kube-scheduler                             4.7.13    True        False         False      92m
kube-storage-version-migrator              4.7.13    True        False         False      85m
machine-api                                4.7.13    True        False         False      90m
machine-approver                           4.7.13    True        False         False      95m
machine-config                             4.7.13    True        False         False      94m
marketplace                                4.7.13    True        False         False      93m
monitoring                                 4.7.13    False       True          True       2m30s
network                                    4.7.13    True        True          False      95m
node-tuning                                4.7.13    True        False         False      94m
openshift-apiserver                        4.7.13    True        False         False      69m
openshift-controller-manager               4.7.13    True        False         False      94m
openshift-samples                          4.7.13    True        False         False      86m
operator-lifecycle-manager                 4.7.13    True        False         False      94m
operator-lifecycle-manager-catalog         4.7.13    True        False         False      94m
operator-lifecycle-manager-packageserver   4.7.13    True        False         False      69s
service-ca                                 4.7.13    True        True          False      95m
storage                                    4.7.13    False       False         False      20s

Before proceeding with the next steps, it will be necessary to recreate the master nodes to replace the machines that were lost.

NOTE: This procedure is only possible when the installation method is via IPI.

Get the machines from OpenShift to extract their settings:

$ oc get machines -n openshift-machine-api
NAME                         PHASE     TYPE   REGION   ZONE   AGE
zmaciel-btqbk-master-0       Running                          100m
zmaciel-btqbk-master-1       Running                          100m
zmaciel-btqbk-master-2       Running                          100m
zmaciel-btqbk-worker-4227z   Running                          94m
zmaciel-btqbk-worker-z4hjw   Running                          94m

Extract the settings from the machine that is still functional to recreate the new machines:

$ oc get machine zmaciel-btqbk-master-2 -n openshift-machine-api -o yaml > new_master3.yml
$ oc get machine zmaciel-btqbk-master-2 -n openshift-machine-api -o yaml > new_master4.yml

Adjust files by removing the following fields:

Status section;

status:
  addresses:
  - address: 10.36.250.79
    type: InternalIP
  - address: 10.36.2.9
    type: InternalIP
  - address: fe80::4c67:9463:4dd1:4e11
    type: InternalIP
  - address: zmaciel-btqbk-master-2
    type: InternalDNS
  lastUpdated: "2021-06-04T20:26:31Z"
  nodeRef:
    kind: Node
    name: zmaciel-btqbk-master-2
    uid: 07cf61ab-e718-472d-86b2-7f82546d023b
  phase: Running
  providerStatus:
    conditions:
    - lastProbeTime: "2021-06-04T18:27:55Z"
      lastTransitionTime: "2021-06-04T18:27:55Z"
      message: Machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: 42011945-f400-ccd8-1f65-688e44a2bafa
    instanceState: poweredOn
spec.providerID;
spec:
  metadata: {}
  providerID: vsphere://420156fd-d64a-ac6c-fcd0-0bb30524d146
metadata.annotations;
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
annotations:
machine.openshift.io/instance-state: poweredOn
...
metadata.generation;
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
...
  generation: 2
...
metadata.resourceVersion;
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
...
  resourceVersion: "871091"
...
metadata.uid;
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
...
  uid: 310d6108-b46c-4d3c-a61e-95fa3f2ad07a
...

Once complete, you will need to set two new parameters in the file:

Change the metadata.name field to a new name:

apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
...
  name: zmaciel-btqbk-master-3
...
Update the metadata.selfLink:
apiVersion: machine.openshift.io/v1beta1
kind: Machine

metadata:

...

  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/zmaciel-btqbk-master-3

...

Create the new machines using the adjusted files:

$ oc apply -f new_master3.yml
machine.machine.openshift.io/zmaciel-btqbk-master-3 created
$ oc apply -f new_master4.yml
machine.machine.openshift.io/zmaciel-btqbk-master-4 created

$ oc get machines -n openshift-machine-api
NAME                         PHASE          TYPE   REGION   ZONE   AGE
zmaciel-btqbk-master-0       Running                               114m
zmaciel-btqbk-master-1       Running                               114m
zmaciel-btqbk-master-2       Running                               114m
zmaciel-btqbk-master-3       Provisioned                           49s
zmaciel-btqbk-master-4       Provisioning                          24s
zmaciel-btqbk-worker-4227z   Running                               107m
zmaciel-btqbk-worker-z4hjw   Running                               107m

$ oc get machines -n openshift-machine-api
NAME                         PHASE     TYPE   REGION   ZONE   AGE
zmaciel-btqbk-master-0       Running                          117m
zmaciel-btqbk-master-1       Running                          117m
zmaciel-btqbk-master-2       Running                          117m
zmaciel-btqbk-master-3       Running                          4m40s
zmaciel-btqbk-master-4       Running                          4m15s
zmaciel-btqbk-worker-4227z   Running                          111m
zmaciel-btqbk-worker-z4hjw   Running                          111m

$ oc get nodes
NAME                         STATUS   ROLES    AGE     VERSION
zmaciel-btqbk-master-0       NotReady   master   116m   v1.20.0+df9c838
zmaciel-btqbk-master-1       NotReady   master   116m   v1.20.0+df9c838
zmaciel-btqbk-master-2       Ready      master   116m   v1.20.0+df9c838
zmaciel-btqbk-master-3       Ready      master   89s    v1.20.0+df9c838
zmaciel-btqbk-master-4       Ready      master   87s    v1.20.0+df9c838
zmaciel-btqbk-worker-4227z   Ready      worker   105m   v1.20.0+df9c838
zmaciel-btqbk-worker-z4hjw   Ready      worker   105m   v1.20.0+df9c838

Delete the unhealthy machine member:

$ oc delete machine zmaciel-btqbk-master-0 -n openshift-machine-api
machine.machine.openshift.io "zmaciel-btqbk-master-0" deleted
$ oc delete machine zmaciel-btqbk-master-1 -n openshift-machine-api
machine.machine.openshift.io "zmaciel-btqbk-master-1" deleted

Validate the removal of the machine::

$ oc get nodes
NAME                         STATUS                        ROLES    AGE     VERSION
zmaciel-btqbk-master-0       NotReady,SchedulingDisabled   master   117m    v1.20.0+df9c838
zmaciel-btqbk-master-1       NotReady,SchedulingDisabled   master   117m    v1.20.0+df9c838
zmaciel-btqbk-master-2       Ready                         master   117m    v1.20.0+df9c838
zmaciel-btqbk-master-3       Ready                         master   2m47s   v1.20.0+df9c838
zmaciel-btqbk-master-4       Ready                         master   2m45s   v1.20.0+df9c838
zmaciel-btqbk-worker-4227z   Ready                         worker   106m    v1.20.0+df9c838
zmaciel-btqbk-worker-z4hjw   Ready                         worker   106m    v1.20.0+df9c838

$ oc get machines -n openshift-machine-api
NAME                         PHASE      TYPE   REGION   ZONE   AGE
zmaciel-btqbk-master-0       Deleting                          120m
zmaciel-btqbk-master-1       Deleting                          120m
zmaciel-btqbk-master-2       Running                           120m
zmaciel-btqbk-master-3       Running                           6m50s
zmaciel-btqbk-master-4       Running                           6m25s
zmaciel-btqbk-worker-4227z   Running                           113m
zmaciel-btqbk-worker-z4hjw   Running                           113m

$ oc get nodes
NAME                         STATUS   ROLES    AGE     VERSION
zmaciel-btqbk-master-2       Ready    master   119m    v1.20.0+df9c838
zmaciel-btqbk-master-3       Ready    master   4m37s   v1.20.0+df9c838
zmaciel-btqbk-master-4       Ready    master   4m35s   v1.20.0+df9c838
zmaciel-btqbk-worker-4227z   Ready    worker   108m    v1.20.0+df9c838
zmaciel-btqbk-worker-z4hjw   Ready    worker   108m    v1.20.0+df9c838

$ oc get machines -n openshift-machine-api
NAME                         PHASE     TYPE   REGION   ZONE   AGE
zmaciel-btqbk-master-2       Running                          122m
zmaciel-btqbk-master-3       Running                          9m19s
zmaciel-btqbk-master-4       Running                          8m54s
zmaciel-btqbk-worker-4227z   Running                          116m
zmaciel-btqbk-worker-z4hjw   Running                          116m

After replacing the master nodes, we will need to force the redeployment of the ETCD, Kube APIServer, Kube Controller Manager, and Kube Scheduler.

Force ETCD redeployment:

$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
etcd.operator.openshift.io/cluster patched

Verify all nodes are updated to the latest revision:

$ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
1 nodes are at revision 5; 2 nodes are at revision 6

Only when all the pods ETCD are on the same version should proceed to the next step. This process can take several minutes to complete.

Example output:

$ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
AllNodesAtLatestRevision
3 nodes are at revision 6

Force KubeAPIServer redeployment:

$ oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
kubeapiserver.operator.openshift.io/cluster patched

Verify all nodes are updated to the latest revision:

$ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
2 nodes are at revision 9; 1 nodes are at revision 10

Only when all the pods Kube APIServer are on the same version should you follow the next step. This process can take several minutes to complete.

Example output:

$ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
AllNodesAtLatestRevision
3 nodes are at revision 10

Force KubeControllerManager redeployment:

$ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
kubecontrollermanager.operator.openshift.io/cluster patched

Verify all nodes are updated to the latest revision:

$ oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
3 nodes are at revision 8; 0 nodes have achieved new revision 9

Only when all Controller Manager pods are on the same version should you proceed to the next step.

Example output:

$ oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
AllNodesAtLatestRevision
3 nodes are at revision 9

Cluster validation After Recovery

After these procedures are completed, you should check the pods ETCD, APIServer, Controller Manager, Scheduler, and all operators. If they are all running and available, your cluster will be recovered.

$ oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd
etcd-zmaciel-btqbk-master-2                3/3     Running     0          21m
etcd-zmaciel-btqbk-master-3                3/3     Running     0          19m
etcd-zmaciel-btqbk-master-4                3/3     Running     0          20m

$ oc get pods -n openshift-kube-apiserver | grep kube-apiserver
kube-apiserver-zmaciel-btqbk-master-2       5/5     Running     0          14m
kube-apiserver-zmaciel-btqbk-master-3       5/5     Running     0          11m
kube-apiserver-zmaciel-btqbk-master-4       5/5     Running     0          7m43s

$ oc get pods -n openshift-kube-controller-manager | grep kube-controller-manager
kube-controller-manager-zmaciel-btqbk-master-2   4/4     Running     0          7m21s
kube-controller-manager-zmaciel-btqbk-master-3   4/4     Running     0          6m31s
kube-controller-manager-zmaciel-btqbk-master-4   4/4     Running     0          5m40s

$ oc get pods -n openshift-kube-scheduler | grep openshift-kube-scheduler
openshift-kube-scheduler-zmaciel-btqbk-master-2   3/3     Running     9          136m
openshift-kube-scheduler-zmaciel-btqbk-master-3   3/3     Running     0          32m
openshift-kube-scheduler-zmaciel-btqbk-master-4   3/3     Running     2          30m

Verification of cluster operators:

$ oc get clusteroperators
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.13    True        False         False      125m
baremetal                                  4.7.13    True        False         False      146m
cloud-credential                           4.7.13    True        False         False      149m
cluster-autoscaler                         4.7.13    True        False         False      145m
config-operator                            4.7.13    True        False         False      146m
console                                    4.7.13    True        False         False      132m
csi-snapshot-controller                    4.7.13    True        False         False      145m
dns                                        4.7.13    True        False         False      145m
etcd                                       4.7.13    True        False         False      144m
image-registry                             4.7.13    True        False         False      139m
ingress                                    4.7.13    True        False         False      136m
insights                                   4.7.13    True        False         False      138m
kube-apiserver                             4.7.13    True        False         False      142m
kube-controller-manager                    4.7.13    True        False         False      143m
kube-scheduler                             4.7.13    True        False         False      143m
kube-storage-version-migrator              4.7.13    True        False         False      136m
machine-api                                4.7.13    True        False         False      141m
machine-approver                           4.7.13    True        False         False      146m
machine-config                             4.7.13    True        False         False      28m
marketplace                                4.7.13    True        False         False      144m
monitoring                                 4.7.13    True        False         False      28m
network                                    4.7.13    True        False         False      146m
node-tuning                                4.7.13    True        False         False      145m
openshift-apiserver                        4.7.13    True        False         False      120m
openshift-controller-manager               4.7.13    True        False         False      145m
openshift-samples                          4.7.13    True        False         False      137m
operator-lifecycle-manager                 4.7.13    True        False         False      145m
operator-lifecycle-manager-catalog         4.7.13    True        False         False      145m
operator-lifecycle-manager-packageserver   4.7.13    True        False         False      52m
service-ca                                 4.7.13    True        False         False      146m
storage                                    4.7.13    True        False         False      46m

Final Thoughts

This concludes part 3 on OpenShift disaster recovery. Please note that all checks mentioned in the article are very important, as, with them, you will have the true status of the cluster.

I'm currently researching alternatives on how to recover a cluster when all masters are lost; however, I have not found success as of yet. Therefore, it is my recommendation that master nodes never run on the same VMware host or zone (AWS/Azure/GCP).

I hope I have contributed to your knowledge.