In my previous article, I shared the fundamental traits of excellent software architectures. As I wrote, a prodigious architecture must have enough resilience and reliability to handle failures. Truly, in any tale about a software architecture's evolution, disaster recovery is likely to be its epiphany; the moment of striking realization. Let me explain this in a bit of detail.
The concept of disaster recovery began early in computing's history. In the dark age of batch processing mainframes, the general practice was to keep an offsite contingency mainframe operating in parallel to the primary one and to continuously back up the primary mainframe on magnetic tapes. If a failure made it impossible for the primary mainframe to continue its operations, the backup tapes were transported and loaded on the contingency machine. This method of resuming business operations from a contingent machine with the help of backed-up data eventually evolved into the most common practice to ensure business continuity. This process is usually referred to as a disaster recovery process.
However, the word "evolve" carries little value when it comes to disaster recovery. Most organizations still maintain two distant sites, either with similar processing capacity or with one superior to the other as a design choice. This is not entirely different from using two distant mainframes. When a site fails, you still have to resume operations from the other site.
One thing that has changed is the speed at which organizations can resume operations. In the past, resuming operations took days or even weeks, while today it can be done within minutes or hours (although not "immediately," as might be expected). This astounding drop in expected downtime became possible only in recent decades, alongside the revolutionary progress of network, storage, and other reliable communication instrumentation.
Disaster recovery's evolution has been dichotomous: The methods have not changed much, but recovery speed is significantly faster. The question is: Why?
Delays and missing data
The revolutionary drop in expected downtime and the capability to operate multiple sites in parallel is largely due to the evolution of high-speed networks and storage, not to changes in disaster recovery itself.
We can continuously synchronize geographically dislocated sites for their respective state. However, in a process like this, where storage is an integral component, storage availability becomes a single point of failure if it lacks resilience.
Despite the revolutionary changes in computing, I/O-related complexities, combined with the distance between peering sites, still enforce asynchronous data replication at regular intervals. This creates some delay until a contingent site can take over processing in a failure. Hence, there is no such thing as "immediate" recovery in a large-scale disaster.
[ Accelerate application development while reducing cost and complexity: Modernize your IT with managed cloud services. ]
In addition to the delay, there is always a chunk of data written to the primary disk after the last synchronization that therefore is not replicated to the contingent site. This chunk of data is lost if it is stored in a site that fails. Additionally, sites are synchronized with data at rest (persistent data), and data in flight is usually treated as an orphan object. Since the disaster recovery process doesn't synchronize the ephemeral state, it adds to the potential data loss.
The problem is that the industry has accepted this shortcoming as fate. A deep dive into disaster recovery shows that the concept in practice today is lavishly self-contradicting. This is not to dismiss some of the popular attempts coming to overcome the issues, even though they have fallen short of their objectives.
Data replication options
There are numerous practices available to replicate storage from one site to another in a disaster. These include active-active storage clustering, which replicates data synchronously, and active-passive clustering, which does so asynchronously. While these methods work both in theory and in practice, they have inherent deficiencies.
Active-passive storage clustering increases the risk of data loss because replication isn't synchronous. The active-active topology tries to address this risk with synchronous replication but ends up contributing to I/O overhead. Also, this I/O overhead grows as data volume increases and the distance between peering sites widens. These shortcomings are inherited by any physical storage solution, meaning a storage solution without these disadvantages is essentially non-existent.
The emergence of cloud computing has opened another option: Disaster Recovery as a Service (DRaaS). This augments the conventional disaster recovery process with vendors hosting contingent sites, storage, or both in the cloud with asynchronous replication of state.
While this comes with the added reliability inherited from the cloud, it only promises near-real-time recovery point objective (RPO) and recovery time objective (RTO). If you've read so far, you probably recognize that DaaS inherits the same old flaw: potential data loss.
Also, I have yet to come across a DaaS offering that works across multiple continents. Perhaps a more pertinent question is: Will it ever be possible to implement a disaster recovery process that spans continents? Especially with the stringent data protection regulations that are now being enforced in certain regions and countries?
With all the arguments so far, it is evident that the root cause for this deficiency is that sites are synchronized too late during operations. There is a way to bridge this gap with software.
In-memory database solutions
In-memory database solutions are an enhanced version of cache that have been around for a while now. Initially, such solutions were designed to overcome performance-related woes associated with IT systems processing large volumes of data. However by combining in-memory database solutions with features like application clustering, they have become an extremely productive vessel to synchronize state across a multinode install base.
Until recently, there remained a downside: Despite its ability to persist data through cache and to synchronize multinode clusters, an in-memory database solution lacked scalability beyond the hosting site.
However, new cross-site data replication features offered by open source in-memory databases such as Infinispan and Hazelcast iMDG have allowed the disaster recovery process to enter a whole new era. These add resilience by leveraging underlying features that can persist data through cache and synchronize state across multinode clusters and geographically distant sites.
This means a scenario that was generally considered impossible—0% downtime with 100% data protection—has become a reality. Despite all the excitement that this concept creates, it remains to be seen how well it ages.
[ Learn why event-driven architecture are ideal for hybrid cloud applications. ]
About the author
Iqbal is a software architecture enthusiast, serving as a senior middleware architect at Red Hat.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit