Many enterprises look to deploy their business-critical stateful applications on the Red Hat OpenShift platform. Such applications have non-negotiable business requirements, like high availability, data security, disaster recovery, strict performance SLAs and hybrid/multi-cloud operations. But today, we’ll be focusing on just one of these critical areas.

This blog explores disaster recovery (DR). DR is trending as an operational concern in the Kubernetes community as the platform increases its support for stateful applications from the predominantly stateless applications commonly deployed today. Ultimately, the goal is to keep operations flowing during a datacenter, availability zone or regional outage. That means business continuity.

In keeping up with that goal, OpenShift is now expanding its disaster recovery solutions with the introduction of the Regional-DR solution. It is built on two OpenShift products: Cluster management via Red Hat Advanced Cluster Management (RHACM) and persistent data storage using Red Hat OpenShift Data Foundation (RHODF).

Regional Disaster Recovery

The Regional-DR solution is designed to protect your applications against a wide range of large blast radius failures and disaster scenarios, like datacenter failures. It is the most flexible and non-restrictive DR solution provided by Red Hat OpenShift. This approach is built upon dual independent clusters located at two geographically separated datacenters. Failures in any one cluster or datacenter do not propagate to the other cluster, thus offering a robust DR solution.

Due to the asynchronous nature of the data replication, application performance is not impacted, and there are no network restrictions or latency limits to deploy. It is offered on all platforms supported by OpenShift and RHODF, both on-premises and in public clouds. DR protection is application-granular, and DR operations like failover and failback are only applied to the specified applications without impacting the rest of the cluster.

Two key metrics govern disaster recovery solutions: recovery point objective (RPO) and recovery time objective (RTO). The former is based on how much data loss an application can tolerate in a disaster situation, and the latter is how much downtime can be afforded for the application.

The Regional-DR solution addresses both concerns.

Diagram of the regional disaster recovery scenario with 2 regions

This configuration deploys cluster pairs in two data centers connected by a WAN. RHODF, which provides the persistent data volumes (PVs), enables volume-to-volume asynchronous data replication between these cluster pairs. RHACM, which provides cluster management and manages application placement in the clusters, provides automated DR operations for easy recovery.

Data policy-based DR orchestration

A data policy framework provided in the RHACM GUI simplifies DR orchestration. Through these data policies, users can choose a schedule interval for how frequently the application's data volume(s) are replicated between the specified primary and secondary cluster pairs. This schedule interval determines the RPO of the application.

Based on the data policy chosen, Red Hat OpenShift DR Operators set up the corresponding volume(s) of the application for replication and initiate the data replication as per the schedule. Along with PV data replication, OpenShift DR operators also replicate the corresponding PV metadata. This replicated metadata is preserved at the secondary cluster in an object store provided and managed by RHODF. This metadata is critical for volume failover and restoring applications during a disaster.

While RHODF does not limit how frequently the PVs can be replicated across clusters, in practice, replication schedule intervals tend to be between five and 15 minutes. Anything lower may lead to high hardware and network resource consumption.

Operator-based DR automation

Granular step-by-step diagram of OpenShift disaster recovery across regions using Kubernetes Operators

While RHODF is responsible for data volume replication, the application failover is automated by a set of OpenShift DR operators provided with RHACM and RHODF. There are three of these:

  • A central OpenShift DR (ODR) Hub Operator, which is installed on the RHACM hub cluster to manage failover and relocation for applications.
  • An OpenShift DR (ODR) Cluster Operator, which is installed on each managed cluster to manage the lifecycle of protected PVCs for an application.
  • A RHODF Multicluster Orchestrator, which is installed on the RHACM hub cluster to automate numerous DR configuration tasks.

These DR Operators enable the applications to seamlessly failover or relocate between the clusters by automating all actions required for the application recovery to a single user-initiated action. This approach minimizes user errors during chaotic disasters and makes sure the applications are recovered consistently and quickly, thus improving recovery time (RTO).

DR functions require an operational RHACM hub. Because of this requirement, a RHACM hub recovery solution is available in case of any RHACM hub failures.

Automated DR failover and failback

Step-by-step diagram of disaster recovery using OpenShift clusters and Kubernetes Operators

ODR operators automate the sequence of operations required for applications to recover on the secondary cluster. The entire failover sequence requires a single action invoked by the user upon detection of a failure.

The failover operation involves RHACM. It deletes the application from the failed cluster, breaks the replication relationships between the volumes belonging to the application, populates the metadata of all the application PVs on the DR cluster, mounts the volumes on the DR cluster, and installs the application on the secondary DR cluster from the Git repository. This failover process is application-granular and can be repeated for each application in a sequence controlled by users based on application priorities and dependencies. Failovers are always user-initiated and managed, thus eliminating unintended failovers and possible data loss.

Not all application failovers are based on catastrophic failures that incur a possible data loss. Users can initiate a controlled application relocation operation without data loss if an anomaly is detected in the application or cluster or if users observe deterioration in the cluster status. This process is simplified and a single click operation due to the automation provided by the DR operators.

DR failback provides a smooth recovery to the primary cluster. DR failbacks are planned, controlled, and enabled without any data loss. Once the user initiates the failback operation, the RHODF Operators resync the data from the DR cluster back to the primary cluster. The PV data and metadata changes are restored to the primary cluster before the application failback sequence commences. The RHACM then deletes the application from the DR cluster and redeploys it on the primary cluster. Administrators can use this same process for workload migration across public and private cloud platforms.

The Regional-DR solution is designed for applications deployed with a declarative model, which many modern applications support. In the declarative model, applications follow and conform to a declared state deployed in the cluster by DevOps tools from a central Git resource, which is a single source of truth for the application configuration. During a DR event, ODR Operators can direct the application redeployment to a healthy cluster with these DevOps tools.

Wrap up

In summary, the Regional-DR solution increases application protection by lowering RPO with RHODF-based ongoing volume replication and lowering RTO with RHACM-based DR Operator controlled automation. It also reduces DR costs by increasing resource utilization by keeping all clusters active during the sunny state. While an application instance can only be active at one cluster at any point, both clusters can be active, running different sets of applications and protecting each other with cross replication. Regional DR is a solution that helps protect all stateful workloads deployed on OpenShift. It is designed and integrated with OpenShift cluster tools like alerting and monitoring, making DR protection an integral part of cluster and application management.


저자 소개

Venkat Kolli is a Product Manager at Red Hat, managing High Availability and Disaster Recovery solutions for Red Hat OpenShift.  

Venkat has over 30 years of experience in the IT industry with extensive expertise in enterprise storage systems and enterprise software. Having joined Red Hat in November 2019, Venkat has previously held Product Management positions at VMware, SanDisk, NetApp and Veritas. 

Venkat is passionate about storage and data management and distributed system architectures. He has a diverse background in domains spanning cloud computing, container data services, virtualization, hyper converged infrastructure and software defined storage.

Read full bio