Red Hat Advanced Cluster Management for Kubernetes: High availability and disaster recovery (part 3)

2023 年 10 月 20 日Andre Rocha10 分 (読了時間の目安)

This article examines tasks related to creating and restoring Red Hat Advanced Cluster Management for Kubernetes backups.

Read parts 1 and 2 here:

Red Hat Advanced Cluster Management for Kubernetes: High availability and disaster recovery (part 1)
Red Hat Advanced Cluster Management for Kubernetes: High availability and disaster recovery (part 2)

Backup schedule

Primary Hub Cluster

A BackupSchedule feature will enable backups to run at scheduled intervals. You must create a backup schedule on your primary hub cluster.

You can see an example of defining a BackupSchedule resource below:

$ cat schedule-acm.yaml

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: BackupSchedule
metadata:
  name: schedule-acm
  namespace: open-cluster-management-backup
spec:
  veleroSchedule: 0 */2 * * *    # Run backup every 2 hours
  veleroTtl: 120h                # Optional: delete backups after 120h. If not specified, default is 720h
  useManagedServiceAccount: true # Auto Import Managed Clusters

Adjust the veleroSchedule and veleroTtl values as needed to satisfy any retention and retrieval goals.

Check the scheduled backup:

$ oc get BackupSchedule -n open-cluster-management-backup

NAME           PHASE     MESSAGE
schedule-acm   Enabled   Velero schedules are enabled

Verify that initial backup continuous restore (CR)s were created:

$ oc get backups -n open-cluster-management-backup

NAME                                            AGE
acm-credentials-schedule-20230704201006         22m
acm-managed-clusters-schedule-20230704201006    22m
acm-resources-generic-schedule-20230704201006   22m
acm-resources-schedule-20230704201006           22m
acm-validation-policy-schedule-20230704201006   22m

And finally, check if the backups are being completed:

$ oc get backups -n open-cluster-management-backup -o custom-columns="NAME":.metadata.name,"PHASE":.status.phase

NAME                                            PHASE
acm-credentials-schedule-20230704201006         Completed
acm-managed-clusters-schedule-20230704201006    Completed
acm-resources-generic-schedule-20230704201006   Completed
acm-resources-schedule-20230704201006           Completed
acm-validation-policy-schedule-20230704201006   Completed

Many backups will exist on the object storage side:

Screenshot of a terminal window displaying a tree of hubcluster backups

Backup-restore Red Hat Advanced Cluster Management Policy

Go to Red Hat Advanced Cluster Management and review the policy named backup-restore-enabled to identify violations and templates used. By default, the Red Hat Advanced Cluster Management Backup and Restore Operator installs a policy to check the status of backups and the operator's integrity.

Several templates exist for checking different status types and settings to ensure backups are up and running. A complete list can be consulted here.

Example of a working policy:

Screenshot of a working "backup-restore-enabled" policy working

You can see a full example without violations below:

Screenshot of a full example without violations

Backup-restore process on standby hub clusters

Standby Hub Cluster

Create a custom resource for sync restore purposes on your standby Red Hat Advanced Cluster Management. Remember that the standby hub cluster should already have Red Hat Advanced Cluster Management installed and any necessary customizations already made.

Use the following checklist to help you remember everything on your standby cluster:

Install the Red Hat Advanced Cluster Management Operator
Create the MultiClusterHub
Activate the cluster-backup component in the MultiClusterHub
Create the S3 storage secret
Enable the ManagedServiceAccount component on the MultiClusterEngine
Enable DataProtectionApplication (DPA)

Sync restore CR

Data must be continuously and passively restored on standby hub clusters. This function is activated through a restore CR and also by setting the parameters syncRestoreWithNewBackups and restoreSyncInterval.

Go to your standby hub cluster and create the restore passive sync, as seen below:

$ cat restore-standby-acm-passive-sync.yaml

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
  name: restore-acm-passive-sync
  namespace: open-cluster-management-backup
spec:
  syncRestoreWithNewBackups: true # restore when there are new backups
  restoreSyncInterval: 10m        # checks for backups every 10 minutes
  cleanupBeforeRestore: CleanupRestored
  veleroManagedClustersBackupName: skip
  veleroCredentialsBackupName: latest
  veleroResourcesBackupName: latest

Note: The veleroManagedClustersBackupName parameter must always remain set to skip in CR synced restore.

Specifying a backup name here would activate managed clusters on the standby hub cluster, which is undesirable at this time.

Leave the parameters veleroCredentialsBackupName and veleroResourcesBackupName set to latest in these CRs.

Adjust the restoreSyncInterval parameter values to match your desired restore and recovery objectives.

Check if the resource was created:

$ oc get -n open-cluster-management-backup Restore restore-acm-passive-sync

NAME                       PHASE     MESSAGE
restore-acm-passive-sync   Enabled   Velero restores have run to completion, restore will continue to sync with new backups

Note: This item depends on at least one successful backup so that there is information to restore.

Validating restores

Make sure the restore CR is enabled and restores have completed on the standby hub cluster. Use this command:

$ oc get restore -n open-cluster-management-backup

NAME                       PHASE     MESSAGE
restore-acm-passive-sync   Running   Velero restore restore-acm-passive-sync-acm-resources-schedule-20230704220045 is currently executing
restore-acm-passive-sync   Enabled   Velero restores have run to completion, restore will continue to sync with new backups

You should see all of your managed cluster's restore data on the object storage, as seen below:

Screenshot of a managed cluster's restore data in object storage

Here are the key details about the features that should exist in every Red Hat Advanced Cluster Management by now:

Primary Hub Cluster

Backup schedule

Standby Hub Cluster

Sync restore CR

That's it for this chapter. The next section covers activating a standby hub cluster as a primary hub cluster.

Activating a standby Red Hat Advanced Cluster Management cluster

This part looks at tasks related to promoting a Standby cluster to Primary and the old Primary to Standby.

This process consists of defining a standby Red Hat Advanced Cluster Management cluster as the structure's new primary, and making the OpenShift clusters managed by Red Hat Advanced Cluster Management start communicating with the new primary.

Enabling managed OpenShift clusters

When the primary hub cluster is down and a failover must be performed, activating the resources of the managed OpenShift clusters in the standby Red Hat Advanced Cluster Management cluster that is elected as the new primary is necessary.

Once activated, managed OpenShift clusters will start communicating with the new primary hub cluster for information regarding applications, policies, etc.

The CR that was created earlier only restores passive data. Therefore, it is necessary to stop it and create restore-once CRs to recover activation data from managed OpenShift clusters.

Important: These steps can only be performed when the primary hub cluster is completely unavailable. Since managed OpenShift clusters must only communicate with a single Red Hat Advanced Cluster Management hub cluster at a time, the replaced hub cluster must be completely inaccessible. Once confirmed, proceed with activating managed OpenShift clusters in the new primary hub cluster.

Standby pre-activation scenario

Please pay close attention to the next paragraphs.

At this point, your Primary hub cluster should be similar to this one:

Primary Hub Cluster

About the managed clusters:

Cluster ocptesting was created by Red Hat Advanced Cluster Management (by using the Hive API).
All others were manually imported by using all the methods available (kubeconfig, token, and command).
- The test environment was prepared like this on purpose to verify that all managed clusters will be correctly transferred to the standby hub cluster.

And your Standby hub cluster should look like this:

Standby Hub Cluster

Activation process

Be careful not to run the commands on the wrong clusters!

Steps for the Primary Hub Cluster

Verify that the backup schedule is enabled on the primary hub cluster:

$ oc get -n open-cluster-management-backup BackupSchedule schedule-acm
NAME           PHASE     MESSAGE
schedule-acm   Enabled   Velero schedules are enabled

Check the last backup status on the primary hub cluster:

$ oc get backups -n open-cluster-management-backup -o custom-columns="NAME":.metadata.name,"PHASE":.status.phase

Backups about credentials, managed-clusters, resources-generic, resources, and validation must be in the Completed stage, as seen below:

Screenshot of a terminal showing everything in a "Completed" stage

Disable the BackupSchedule:

$ oc delete -n open-cluster-management-backup BackupSchedule schedule-acm

Power down the Primary Hub Cluster. It must be completely inaccessible (you are simulating a failure).

Configure the Standby Hub Cluster

1. Check if Velero restores have run to completion:

$ oc get restore -n open-cluster-management-backup restore-acm-passive-sync

2. Delete the existing rolling restore on the Standby Hub Cluster selected as the new active cluster:

$ oc delete restore restore-acm-passive-sync -n open-cluster-management-backup

3. Create a new restore CR to activate the managed clusters on the new hub cluster. See the following example:

$ cat restore-acm-passive-activate.yaml

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
  name: restore-acm-passive-activate
  namespace: open-cluster-management-backup
spec:
  cleanupBeforeRestore: CleanupRestored
  veleroManagedClustersBackupName: latest
  veleroCredentialsBackupName: skip
  veleroResourcesBackupName: skip

$ oc create -f restore-acm-passive-activate.yaml
restore.cluster.open-cluster-management.io/restore-acm-passive-activate created

Note that the veleroManagedClustersBackupName will be set to latest so that the latest available restore will be used on the new hub cluster. The other parameters are set to skip as the data contained in them has already been restored via restore CR.

The cleanupBeforeRestore parameter is important as it will ensure that only the most recent information from the specified backup is restored on the hub cluster. It will clear all data on the new hub before restoring it to a state consistent with the most recent backup version.

Once this resource is created, the data will be immediately restored in the new hub cluster, and the managed OpenShift clusters will be imported and shown as Ready in the dashboard.

New Primary Hub Cluster

Check the restore activation status:

$ oc get restore -n open-cluster-management-backup
restore-acm-passive-activate   Finished   All Velero restores have run successfully

Note: It may take a few minutes for the managed clusters to display the Ready status as the resources in the open-cluster-management-agent and open-cluster-management-agent-addon namespaces will be reconfigured.

4. Create the backup schedule:

$ oc create -f schedule-acm.yaml

$ oc get BackupSchedule -n open-cluster-management-backup -w
NAME           PHASE     MESSAGE
schedule-acm   New       Velero schedules are initialized
schedule-acm   Enabled   Velero schedules are enabled

Possible problems

If some of your clusters fail to import automatically, there may be a communication problem (dial tcp timeout).

Screenshot of a terminal window showing a failed import

$ kubectl logs -n open-cluster-management-agent pod/klusterlet-registration-agent-7cb7b98b95-dd5hz
Trace[1455841955]: ---"Objects listed" error:Get "https://api.homelab.rhbrlabs.com:6443/apis/certificates.k8s.io/v1/certificatesigningrequests?limit=500&resourceVersion=0": dial tcp 10.1.224.110:6443: i/o timeout 30002ms (20:39:00.451)

Check your firewall, name resolution, and routing.

New Primary Hub Cluster

After the standby hub cluster activation is complete, it will be set as the new primary hub cluster. The new primary's dashboard should contain complete information about the managed clusters.

At this point, the switching process is complete!

Screenshot of a Grand Theft Auto game "Mission Passed! Respect+" mission completion notification

GTA Feelings :-)

Auto Import feature limitations

There are limitations with the above approach that can result in the managed cluster not automatically importing when moving to a new hub cluster. I recommend reading and familiarizing yourself with the information below.

These situations that can result in the managed cluster not being imported:

1. Since the auto import operation uses the cluster import function through the auto import secret feature, the hub cluster must be able to access the managed cluster to perform the import operation.

2. Since the auto-import-secret created on the restore cluster uses the ManagedServiceAccount token to connect to the managed cluster, the latter must also provide the kube apiserver information. Non-OpenShift Kubernetes clusters typically do not have this configuration. The apiserver must be defined in the ManagedCluster resource, as shown in the following example:

apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
  name: managed-cluster-name
spec:
  hubAcceptsClient: true
  leaseDurationSeconds: 60
  managedClusterClientConfigs:
      url: <apiserver>

Only OpenShift clusters have the apiserver configuration set automatically when the cluster is imported into the hub.

Note: For any other type of managed clusters, such as EKS clusters, this information must be set manually by the user; otherwise, the automatic import feature will ignore these clusters and they will display Import Pending when moved to the restored cluster hub.

3. The backup controller regularly searches for imported managed clusters and creates the ManagedServiceAccount resource in the managed cluster namespace as soon as that managed cluster is found. This should trigger a token creation on the managed cluster. If the managed cluster is not accessible when this operation is performed (e.g., managed cluster hibernating or down), the ManagedServiceAccount will be unable to create the token. As a result, if a hub backup is performed at this time, the backup will not contain a token to automatically import the managed cluster.

4. It is possible that a secret ManagedServiceAccount is not included in a backup if the backup schedule is executed before the backup label is defined in the secret ManagedServiceAccount. ManagedServiceAccount secrets do not have the cluster.open-cluster-management.io/backup label defined during their creation. For this reason, the backup controller regularly looks for ManagedServiceAccount secrets in managed cluster namespaces and adds the backup label if not found.

5. In case the secret token auto-import-account is valid and it is backed up, but the restore operation occurs at a time when the token available with the backup has already expired, the auto-import operation will fail. If this happens, the status of the restore.cluster.open-cluster-management.io resource should report the invalid token issue for each managed cluster in this situation.

Closing notes

At the time of writing, the latest release of Red Hat Advanced Cluster Management is version 2.8. For this version, a Red Hat Advanced Cluster Management standby hub cluster promoted to primary permanently replaces the old cluster that acted in this role. The old cluster cannot be brought back online, as Disaster Recovery (DR) does not work this way. DR processes are performed when the primary site is lost. A function to have two Red Hat Advanced Cluster Management clusters online acting as Active/Standby is still under development.

Wrap up

Red Hat Advanced Cluster Management is a game-changer for organizations seeking to simplify the management of Kubernetes clusters across hybrid cloud environments. With its centralized management, policy-based governance, streamlined application lifecycle management, and built-in observability, Red Hat Advanced Cluster Management empowers administrators to unlock the full potential of Kubernetes while reducing complexity and enhancing operational efficiency. By adopting Red Hat Advanced Cluster Management, organizations can confidently scale their Kubernetes deployments, accelerate application delivery, and drive innovation in today's fast-paced digital landscape.

This article showed how to prepare an HA/DR framework to keep your Red Hat Advanced Cluster Management environment safe from failures.

All the best, and I'll see you next time!

Try Red Hat Advanced Cluster Management for Kubernetes

執筆者紹介

Andre Rocha

Senior Cloud Consultant

Andre Rocha is a Consultant at Red Hat focused on OpenStack, OpenShift, RHEL and other Red Hat products. He has been at Red Hat since 2019, previously working as DevOps and SysAdmin for private companies.

Read full bio

類似検索

ブログ投稿

チャンネル別に見る

すべてのチャンネルを見る