Disaster recovery’s shortcomings and how architects can address them

April 21, 2022Mudassar Iqbal 4-minute read

In my previous article, I shared the fundamental traits of excellent software architectures. As I wrote, a prodigious architecture must have enough resilience and reliability to handle failures. Truly, in any tale about a software architecture's evolution, disaster recovery is likely to be its epiphany; the moment of striking realization. Let me explain this in a bit of detail.

The concept of disaster recovery began early in computing's history. In the dark age of batch processing mainframes, the general practice was to keep an offsite contingency mainframe operating in parallel to the primary one and to continuously back up the primary mainframe on magnetic tapes. If a failure made it impossible for the primary mainframe to continue its operations, the backup tapes were transported and loaded on the contingency machine. This method of resuming business operations from a contingent machine with the help of backed-up data eventually evolved into the most common practice to ensure business continuity. This process is usually referred to as a disaster recovery process.

However, the word "evolve" carries little value when it comes to disaster recovery. Most organizations still maintain two distant sites, either with similar processing capacity or with one superior to the other as a design choice. This is not entirely different from using two distant mainframes. When a site fails, you still have to resume operations from the other site.

One thing that has changed is the speed at which organizations can resume operations. In the past, resuming operations took days or even weeks, while today it can be done within minutes or hours (although not "immediately," as might be expected). This astounding drop in expected downtime became possible only in recent decades, alongside the revolutionary progress of network, storage, and other reliable communication instrumentation.

Disaster recovery's evolution has been dichotomous: The methods have not changed much, but recovery speed is significantly faster. The question is: Why?

Delays and missing data

The revolutionary drop in expected downtime and the capability to operate multiple sites in parallel is largely due to the evolution of high-speed networks and storage, not to changes in disaster recovery itself.

We can continuously synchronize geographically dislocated sites for their respective state. However, in a process like this, where storage is an integral component, storage availability becomes a single point of failure if it lacks resilience.

Despite the revolutionary changes in computing, I/O-related complexities, combined with the distance between peering sites, still enforce asynchronous data replication at regular intervals. This creates some delay until a contingent site can take over processing in a failure. Hence, there is no such thing as "immediate" recovery in a large-scale disaster.

[ Accelerate application development while reducing cost and complexity: Modernize your IT with managed cloud services. ]

In addition to the delay, there is always a chunk of data written to the primary disk after the last synchronization that therefore is not replicated to the contingent site. This chunk of data is lost if it is stored in a site that fails. Additionally, sites are synchronized with data at rest (persistent data), and data in flight is usually treated as an orphan object. Since the disaster recovery process doesn't synchronize the ephemeral state, it adds to the potential data loss.

The problem is that the industry has accepted this shortcoming as fate. A deep dive into disaster recovery shows that the concept in practice today is lavishly self-contradicting. This is not to dismiss some of the popular attempts coming to overcome the issues, even though they have fallen short of their objectives.

Data replication options

There are numerous practices available to replicate storage from one site to another in a disaster. These include active-active storage clustering, which replicates data synchronously, and active-passive clustering, which does so asynchronously. While these methods work both in theory and in practice, they have inherent deficiencies.

Active-passive storage clustering increases the risk of data loss because replication isn't synchronous. The active-active topology tries to address this risk with synchronous replication but ends up contributing to I/O overhead. Also, this I/O overhead grows as data volume increases and the distance between peering sites widens. These shortcomings are inherited by any physical storage solution, meaning a storage solution without these disadvantages is essentially non-existent.

The emergence of cloud computing has opened another option: Disaster Recovery as a Service (DRaaS). This augments the conventional disaster recovery process with vendors hosting contingent sites, storage, or both in the cloud with asynchronous replication of state.

While this comes with the added reliability inherited from the cloud, it only promises near-real-time recovery point objective (RPO) and recovery time objective (RTO). If you've read so far, you probably recognize that DaaS inherits the same old flaw: potential data loss.

Also, I have yet to come across a DaaS offering that works across multiple continents. Perhaps a more pertinent question is: Will it ever be possible to implement a disaster recovery process that spans continents? Especially with the stringent data protection regulations that are now being enforced in certain regions and countries?

With all the arguments so far, it is evident that the root cause for this deficiency is that sites are synchronized too late during operations. There is a way to bridge this gap with software.

In-memory database solutions

In-memory database solutions are an enhanced version of cache that have been around for a while now. Initially, such solutions were designed to overcome performance-related woes associated with IT systems processing large volumes of data. However by combining in-memory database solutions with features like application clustering, they have become an extremely productive vessel to synchronize state across a multinode install base.

Until recently, there remained a downside: Despite its ability to persist data through cache and to synchronize multinode clusters, an in-memory database solution lacked scalability beyond the hosting site.

However, new cross-site data replication features offered by open source in-memory databases such as Infinispan and Hazelcast iMDG have allowed the disaster recovery process to enter a whole new era. These add resilience by leveraging underlying features that can persist data through cache and synchronize state across multinode clusters and geographically distant sites.

This means a scenario that was generally considered impossible—0% downtime with 100% data protection—has become a reality. Despite all the excitement that this concept creates, it remains to be seen how well it ages.

[ Learn why event-driven architecture are ideal for hybrid cloud applications. ]