In my previous blog post, I have shared the vision of Disaster Recovery as a Service for OpenStack (DraaS) as an umbrella topic that describes what needs to be done to protect workloads running in an OpenStack cloud from a large scale disaster.

Last week we shared this vision in several sessions at the OpenStack summit. While OpenStack attendees were dealing with infrastructure Disaster Recovery topics in Hong Kong, the strongest tropical cyclone in recorded history “Typhoon Haiyan” also known as Typhoon Yolanda, devastated multiple coastal cities in the Philippines and took the lives of tens of thousands of people with millions evacuated. The storm destroyed complete cities, villages, airports, roads, power and communications infrastructures.

If there’s one thing that history has not only taught us, but also keeps on teaching us every year, is that catastrophic events do happen and that if we don’t invest in preventative measures now, we will pay a hefty price later.

What would happen to your organization if this type of calamity hit?

It is hard enough to protect hosted workloads even in a case of an overheated datacenter that can knock down production servers and deeply impact your operations and revenue generating activities. For service providers, downtime is not an option, every hour that your production service is down, you can loose not only business but also your reputation.

It is one thing to put your application workload in the cloud, but how can you guarantee that when the hosting service goes down, you can provide the right safety net and business continuity for your customers?
Although an entire datacenter, can in fact go down in the case of a disaster, from a user’s point of view, what service providers should care about is how to protect their own data and make sure their services continue running after such events.

When it comes to elastic clouds, it is all about being able to adapt to workload changes by dynamically provisioning and decommissioning resources, and the more dynamic and elastic the cloud platform is, the more challenges you face in making your data and services highly available in the event of a disaster. Data recovery is usually not the end goal–it is the ability to restore the services that use this data. For service providers, this is the hosted workload.

The Replication targets
Disaster Recovery between a primary cloud and a target cloud requires the data to be available in (at least two) geographically dispersed, independent sites in a share-nothing model.

OpenStack replication targets can include:

  • Private cloud to Private cloud
  • Private cloud to Public cloud
  • Public cloud to Public cloud
  • Bare-metal environments to Public cloud
Example of replication between OpenStack in different sites Example of replication between OpenStack in different sites

As our recovery target is the hosted workload, we should look at ways to achieve DR at the workload level. Imagine selecting a DR service level flavor for a workload, such as applying a “Gold” profile for application service that requires the highest protection level with the shortest recovery point objective (RPO) and the shortest recovery time objective (RTO). Such a DR policy can be based on synchronous replication and hot backup site. Or what if you were able to select the other policies such as “Silver” based on periodic replication, or “Bronze” based on async replication with low capacity standby site for application services that require lower protection levels with longer RPOs & RTOs?

Recovery point Objective and Recover Time Objective

The first step for Disaster Recovery enablement in OpenStack is the ability to support data and state (metadata) replication. Several different approaches may be applicable, such as leveraging application-based replication, host-based replication (Hypervisor VM level) and of course array-based replication.

Replicating Data
OpenStack Swift Globally Distributed Cluster object storage can be used to replicate Glance virtual machine images. Swift is currently designed to work in a single region where a region is defined as a low latency link between Swift zones. As long as sites are nearby, zones can be distributed over multiple sites.

Another option to replicate virtual machine images would be to utilize Glance’s multiple image locations feature. Starting in the OpenStack Havana release, image service images can now be stored in multiple locations. This enables the efficient consumption of image data and the use of backup images in the event of a primary image failure.
Cinder can be extended to support storage array based replication in the following ways:

  1. Utilize the scheduler to create “protected” volumes on storage arrays that are continuously replicating
  2. Use volume types to create replicated volumes where drivers support volume level granularity for replication
  3. Replicate data in 2 independent volumes (across different storage backends and possibly sites) using hypervisor based replication

Replicating OpenStack Services State

Disaster Recovery in OpenStack should include support for:
Capturing the metadata relevant for the protected workloads/resources: either as point-in-time snapshots of the metadata, or as continuous replication of the metadata. Without capturing the Openstack different services state, we will not be able to achieve a complete failover of the hosted workloads to the recovery site.

Examples of OpenStack metadata that requires replication can include:

  • Nova: VM flavors and SSH keys
  • Keystone: Identities of tenants and users
  • Neutron: Virtual networks between VMs
  • Cinder: Volume types and pairing
  • Glance:  Registry and image metadata
  • Ability to provide consistency of the replicated data & metadata with checkpoints

We note that metadata changes are less frequent than application data changes, and different mechanisms can handle replication of different portions of the metadata and data (volumes, images, etc).
Understanding that Disaster Recovery is a complex task where different applications and use-cases have different requirements, some use-cases can be easily supported while others may be more complex, this is targeted as a long-term effort with incremental steps.

Some APIs and features are expected to be integrated into existing projects such as Nova (DR features for compute). Some functionality, like DR orchestration may be part of Heat, or a new project, or even outside the scope of OpenStack.

Enabling Cinder storage replication in the OpenStack Icehouse release is just the first step in protecting workloads running in OpenStack clouds to ensure business continuity while preparing for the worst case scenario.