Imagine that your company relies on Argo CD to manage Kubernetes configurations, but you lose all of your Argo CD data when a disaster occurs. Without a disaster recovery strategy in place, you face significant downtime and loss of revenue. This is why it's critical to protect your Argo CD data and provide business continuity in the event of a disaster. Argo CD is a popular GitOps tool for managing Kubernetes configurations. It relies on custom resource definitions to store important data and manifests, making it essential to make certain of the availability and integrity of this data.
Red Hat Advanced Cluster Management for Kubernetes (RHACM) provides a comprehensive disaster recovery strategy for Kubernetes clusters, including support for backing up Argo CD data and manifests using the OpenShift API Data Protection (OADP) operator. This strategy is an efficient and reliable way to protect and restore Kubernetes cluster data. It protects Argo CD data and manifests so you can recover them quickly during a disaster.
It supports the replication of data and manifests to a secondary site, providing an added layer of protection against disasters that might impact the primary site. Depending on the organization's needs, this replication uses periodic backups, real-time replication, or continuous synchronization. This provides business continuity and minimizes downtime in the event of a disaster.
This blog details how to implement a disaster recovery strategy for Argo CD using RHACM and the OADP operator. I cover the necessary components you must back up and offer a high-level architecture diagram. I also provide step-by-step instructions to configure an active-passive RHACM hub cluster configuration as an admission controller to block your passive cluster from syncing applications. By the end of this post, you will better understand how to protect your Argo CD data and enable business continuity in a disaster.
The following image is the design process for what I cover in this blog:
How does it work?
The cluster backup and restore operator is a component that runs on the hub cluster, and it relies on the OADP Operator to establish a connection with a backup storage location on the same cluster. In addition, the OADP Operator is responsible for installing Velero, which backs up and restores user-created resources on the hub cluster. Use the cluster-backup-chart file to install the cluster backup and restore operator. However, in RHACM version 2.5 or later, you can enable the installation of the cluster backup and restore operator chart by setting the cluster-backup option to true on the MultiClusterHub resource. View the following image for an example:
When you install the cluster backup and restore operator chart, the OADP Operator is automatically installed in the same namespace as the backup chart. If you previously installed and used the OADP Operator on your hub cluster, uninstall it. The backup chart now works with the operator already installed in the chart namespace. This change does not affect your old backups and previous work. You can continue to use the same storage location for the DataProtectionApplication resource, which is owned by the OADP Operator and installed with the backup chart. This means you have access to the same backup data as before. The only difference is that Velero backup resources are now loaded in the new OADP Operator namespace on your hub cluster. By default, OADP backs up all custom resource definitions from the following API groups:
Note: The OADP Operator does not back up repositories, ConfigMaps, and secrets of Argo CD, which are necessary for Disaster Recovery configuration (Users, repositories for Applications, Access to external clusters, etc.). Therefore, you must label the custom resource definition to make sure the resources are a part of the backed-up components.
The operator defines two resources: BackupSchedule.cluster.open-cluster-management.io and Restore.cluster.open-cluster-management.io. These resources are used to configure RHACM backup schedules and process and restore backups, respectively. The operator also sets up the necessary options to back up remote cluster configurations and other hub cluster resources you might want to restore. Refer to the following image of the backup and restore architecture:
The following image depicts a scenario where an active hub cluster manages remote clusters and performs regular backups of hub cluster data. In a disaster or outage, the passive hub cluster can restore the data, except for the activation data of the managed clusters, which is transferred to the passive hub cluster. The passive hub clusters can restore this passive data either continuously or as a one-time operation, depending on the specific requirements of the situation.
The Restore.cluster.open-cluster-management.io resource can restore passive data continuously or as a one-time operation. Before restoring passive data, check that the passive cluster contains a policy configured to watch existing AppProject.argoproj.io resources and modify them to contain the Window-Sync block type. That way, you can avoid syncing any applications and race conditions from Argo CD. After you restore the passive data, each AppProject custom resource goes through the following stages:
To use the passive cluster as the primary cluster, remove the policy and create a BackupSchedule.cluster.open-cluster-management.io resource.
Before creating the BackupSchedule.cluster.open-cluster-management.io, be sure to meet the following prerequisites for active and passive clusters:
- OADP Operator must be installed.
- OpenShift GitOps Operator must be installed.
- All Operators must be installed in the same namespaces.
- Create a DataProtectionApplication custom resource. Verify that the Available status is displayed.
Use the following steps to implement this design.
Step 1: Know your Argo CD custom resources
Locate all Argo CD custom resources you want to back up, such as secrets, ConfigMaps, appProjects, and anything else related to custom resources. After you locate these, label them as cluster.open-cluster-management.io/backup: "argocd" for the OADP Operator to identify. Labeling assures that the resource is part of the passive data backup:
Label your Argo CD custom resources using the RHACM console, the oc label command or an RHACM policy. I recommend deploying an RHACM policy that checks if Argo CD resources are labeled. You may use the RHACM policy policy-argoresources policy from my repository. The following image shows a list of policy violations after the policy-argoresources policy is applied when Argo CD resources are not labeled:
The following image shows that there are no violations in the local cluster because all of the Argo CD resources are labeled:
Step 2: Create a scheduled backup
Create a scheduled backup after you label the resources and the DataProtectionApplication has an Available status on the active and passive clusters. BackupSchedule.cluster.open-cluster-management.io is a resource that sets up an RHACM backup, which is scheduled according to the veleroSchedule value. Create the resource on the active cluster and in the open-cluster-management namespace. Run the following command to create the ScheduledBackup YAML:
oc create ScheduledBackup.yaml -n open-cluster-management-backup
When the scheduled time is met, backups are created and displayed in the Backup section from the OADP Operator. View the following image:
Step 3: Create a syncWindows specification
This stage is critical to avoid problems such as race conditions between Argo CD's passive and active clusters. Create a block sync policy on your passive cluster. To deploy the policy, use the Open Cluster Management as the admission controller. Run the following command:
oc create policy.yaml
See my policy-argo-windowsync-block policy for more details. My policy uses Open Cluster Management to mutate and locate any custom resource that matches kind: AppProject, and modify it to contain the SyncWindow block type, which is scheduled for day (24h). This stops syncs on the passive cluster. View the following images that show there is no sync from the console:
Step 4: Restore the data
This is the final stage, where the magic happens. First, look at the differences between the Argo CD instances (passive and active) before restoring the scheduled backup data. View the following images of the Argo CD console:
Verify that nothing on the Argo CD passive cluster can interrupt the process. Next, create a Restore.cluster.open-cluster-management.io resource on the passive cluster. Run the following command to create your Restore resource:
oc create Restore.yaml -n open-cluster-management-backup
Your Restore resource might resemble the following file:
Based on the previous Restore resource, the latest BackupStorageLocation resource is restored every 15 minutes. After restoring the first backup, you should see that all resources transferred successfully and that AppProjects resources now have the SyncWindow status. Open the passive Argo CD instance and see what happened. View the following images of the Argo CD console:
Congratulations! You completed the restore process. To set the passive cluster as active, make sure the current active cluster is not available, then remove the policy and the Restore resource. Create a ScheduledBackup resource. If you want to revert the change, just complete this procedure again.
This blog post explored how to implement a disaster recovery strategy for Argo CD using RHACM and the OADP operator. I covered the necessary components that must be backed up, showed a high-level architecture diagram, and provided step-by-step instructions for configuring an active-passive hub cluster configuration. You saw how to use the RHACM governance component to block your passive cluster from syncing applications. By following the steps outlined above, organizations can protect their Argo CD data and permits business continuity during disasters. Whether you choose to replicate data and manifests to a secondary site using periodic backups, real-time replication, or continuous synchronization, this strategy provides an added layer of protection against disasters that may impact the primary site.