As more and more organizations move from on-premise datacenters to private, public, and hybrid clouds, it is important to understand that high availability is not the same as disaster recovery (DR).
DR planning is needed to recover systems when natural or human-induced disasters hit the primary datacenter/region. Recent public cloud outages suggest that we must have a DR plan in place, even with the high availability provided by the public cloud providers. DR planning should be part of the initial application design discussions, allowing the deployment architecture to accommodate for unforeseen events.
Leaving it as an afterthought could lead to maintainability challenges, and non-compliance with regulatory, security and industry standards.
Some common challenges organization might face while implementing end-to-end DR plans include:
Not allocating adequate IT budgets to match DR expectations, limiting the recovery speeds application teams can design within cost constraints.
Application development teams prioritizing feature construction, testing, and deployment into production and not considering DR until an internal audit or escalation identifies this business-critical gap.
Maintaining DR environments with the complexity of multiple public cloud vendors, as there is no single control plane to manage across environments.
Key terminology and initial considerations of DR planning
Understanding key DR terminology and its context can assist in how to conceptualize the definition and planning of a valid DR approach.
Recovery Time Objective (RTO) - the maximum time for restoring infrastructure and application services after an unanticipated event, aligned with a company’s business continuity requirements.
Recovery Point Objective (RPO) - the amount of data loss that a company can sustain without a business impact. RPO drives data backup and data replication strategies, critical to maintaining the business and meeting internal and external compliance requirements.
Service Level Agreement (SLA) -a commitment between a service provider and a client. Generally SLAs are enforced by the compliance team in the IT and communications department to align with the business mission. SLAs are closely tied with DR and business continuity as they define the need for service resiliency.
With key DR terminology outlined, what might be the initial considerations in defining an effective DR plan?
Identify the RPO and RTO of your application based on business requirements, compliance requirements, and the cost (revenue loss) of downtime. This can help identify the right DR tier for your application. According to time tested industry standards, DR tiers span from 0 to 7. Tier 7 is assigned to most critical applications with zero to near zero data loss tolerance and service recovery time of just minutes. Tier 0 is assigned to applications where recovery can be unpredictable and sometimes not possible. A higher DR tier usually means a more expensive DR environment.
Identify the upstream and downstream application dependencies. It’s especially important to know when an outage of one application will cause a full or partial service outage to others. An enterprise level DR plan will help identify these dependencies and assign the right DR tiers to the applications.
Standard DR patterns
Once DR tiers have been identified, a DR plan specific to each application and its data recovery requirements helps ensure that only acceptable data loss happens and time to recover is within the SLA for its DR tier. Presented below are standard DR patterns and implementations, in order of increasing resiliency and cost, and decreasing data loss and time to recover.
Backup and Restore: The least expensive and slowest DR, this pattern can be used for applications able to withstand up to 24-hour service outages. Usually, scheduled backups are stored in a DR region and used to restore an outage, or they are stored on-premises and deployed to infrastructure in a DR region to restore full application service when needed.
Pilot Light: The application’s core services run in the DR region at all times, initially offering reduced functionality when a disaster occurs but converting to full recovery as the remaining services are provisioned. This DR type is best suited for applications that can withstand service degradation for several hours.
Warm Standby (multi-region active-passive): This setup allows for a scaled down version of a fully functional environment that’s up and running at all times in the DR region--which can mean faster recovery time, but higher cost compared to Pilot Light. This implementation should be used for business-critical applications that need to be fully restored within a few minutes.
Hot Standby (multi-region active-active with full replication): In this setup, fully scaled and functional environments (infrastructure and application) are running in two geographically different regions (primary and secondary) with asynchronous synchronization of data to the secondary region. Because traffic is routed to both regions, it has the fault tolerance and resiliency to withstand an outage that removes an entire region, making it the choice for mission-critical systems, albeit at greater expense.
Please note that this is not an exhaustive list of DR setups, and an organization can creatively combine these techniques to best fit the DR needs of a specific system and to optimize costs.
Minimizing DR costs
While public clouds provide inherent flexibility to scale up or scale down on demand, reducing capital expenditure, it is important to understand that DR is not built-in, and application teams must explicitly implement a DR plan themselves. In this section we'll look at things to consider.
Stick with the DR tier SLAs
Meet, without exceeding, SLA expectations of the DR tier assigned. Control storage costs by carefully choosing the right storage class, according to the cloud provider’s life cycle rules, and deleting data and backups that exceed your SLAs.
Automate your infrastructure and processes
Automation plays a key role in the success of a DR implementation because, when a disaster hits, shorter recovery time is critical. Time to recover the service is directly proportional to potential revenue loss and can result in penalties for not meeting compliance requirements. DR involves precisely integrating several components like infrastructure, application and data synchronization, routing, security, and compliance and often can be expedited with automation.
Automation can enable on-demand multiple components of a system for the DR environment in a different datacenter or cloud region using infrastructure as code (IaC) frameworks. This can help minimize costs and meet required recovery SLAs. Automation can more rapidly put the DR environment in operation as needed and scale it back down when the primary region is available, tightly controlling costs. Automation also provides confidence by eliminating manual steps involved with capability proven during the DR testing phase.
Depending on the DR tier assigned to your application, the time it takes to initiate the whole DR environment on demand using IaC may not be fast enough. In such cases, one must consider which DR infrastructure components should be spun up on demand.
Regardless of the DR configuration selected, it is paramount to perform DR tests regularly to make sure SLAs are met prior to if, or when, a real DR scenario happens.
Red Hat Ansible Automation Platform is an enterprise automation platform that can help reduce cost and risk across infrastructure, network and engineering. It is infrastructure-agnostic with a framework and language that can be applied across applications, compute, network, cloud, containers, and virtual machines. It can help improve operations and can work well with technologies already in use.
Use containerized applications
Containers are lightweight software components often used with microservices architectures to build cloud-native applications where services are loosely coupled and can be scaled independently.
Kubernetes is the industry standard for managing and orchestrating containers. Red Hat OpenShift Platform, the leading enterprise Kubernetes platform and can help build, deploy, run and manage intelligent applications more securely at scale across major cloud providers.
OpenShift can also offer cluster autoscaling to resize your cluster based on the demands of your workload. Ideally suited for Pilot Light and Warm Standby DR scenarios, autoscaling contains the cost of your primary and DR environment and provides on-demand scalability when traffic is routed to the DR environment in a disaster scenario.
Match the choice between zonal clusters and regional clusters in public clouds to the criticality of your application to further lower costs
Zonal clusters have a single control plane in a single zone for the nodes in its zone or multiple zones. Regional clusters in a given region have replicas of the control plane running in several zones each for that zone’s nodes, incurring extra cost but making them more tolerant to zonal outages.
Red Hat OpenShift stands out as a leader with a security-focused, supported Kubernetes platform. Red Hat OpenShift Dedicated is a fully managed cloud service for Red Hat OpenShift on cloud providers like AWS and Google Cloud. You can also choose among fully managed offerings that help reduce complexity like Microsoft Azure Red Hat OpenShift, Red Hat OpenShift on IBM Cloud, Red Hat OpenShift service on AWS. Managed cloud services using Red Hat OpenShift allow you to focus on building and scaling applications that add value to your business with the flexibility to choose your cloud vendor.
Red Hat Advanced Cluster Management for Kubernetes helps reduce operational costs with a unified management interface for managing Kubernetes clusters across on-premises, bare metal, vSphere and public cloud deployments. Features like multi-cluster health monitoring can help scan clusters across environments to identify and resolve issues impacting workloads, and more quickly remediating service outages.
Consider implementing a serverless configuration
Serverless is a relatively new computing model used for building and running applications without having to manage the servers on which the application runs. Applications in a serverless environment are sets of one or more functions executed on demand and scaled dynamically to meet demand without the need for capacity planning.
A serverless computing model can be the model of choice for stateless, highly dynamic, asynchronous or concurrent workloads. Also, serverless fits well for infrequent, sporadic workloads where the request traffic is unpredictable and needs dynamic compute capacity. It eliminates the overhead of server maintenance and companies pay only for compute time consumed. When used for DR environments, set-up costs are minimal and charges are incurred only when events are routed to it.
To learn more about serverless, its benefits, and common use cases, you can refer to the whitepaper published by CNCF Serverless Working Group.
Including DR and automation as part of initial application design and development will help inject the required design patterns and DevOps practices to build a resilient cloud-native application, reducing the operations cost and bringing consistent availability to the application.
The absence of a DR plan can lead to unexpected revenue loss and compliance issues, and it could also lead to loss of trust in the customer base. Identifying the proper DR tier, making good architectural choices, and using infrastructure-agnostic automation tools can reduce the cost of maintaining a DR environment, while delivering the resilience to quickly recover in case of a disaster.
About the author
Anil Yeluguri has 18 years of information technology experience, designing and implementing solutions for telecom, healthcare, hospitality, utilities and bioinformatics industries. He is experienced in various open source and application modernization technologies including cloud-native architecture and distributed enterprise application design.