Skip to main content

How to build redundancy into your network (and what to avoid)

Redundancy with automated failover is good. But making the wrong decisions can make a high-availability solution worse than no redundancy at all.
How to use mtr

Prior to joining Red Hat as a Red Hat senior technical account manager (TAM) in 2015, I spent more than 20 years building customized, redundant firewall solutions. Nobody cares about firewalls, so I called them software-defined perimeter (SDP) appliances. I figured all good solutions need an acronym, and "software-defined" was a hot buzzword for a long time. I learned that I'm better at technology than marketing. I also learned lessons about doing redundancy right.

High-availability (HA) solutions and I go back to the very first VAXclusters in 1984. Technology has come a long way since those early days, but the concepts are still the same. Something has to pick up the load if a component dies, which means somebody has to make a failover decision. The challenge is automating the decision. Too eager, and transient conditions might start an ugly failover loop, with systems flapping up and down like a software yo-yo. Not eager enough, and redundancy is worthless because it doesn't respond to failures fast enough.

Even worse is split-brain, where both a designated active and standby system "think" they should assume the active role. Split-brain is inconvenient at best. More often, it's a disaster.

Let's say a system named Alice is active and a system named Sam makes failover decisions. Sam might be an overall manager, monitoring, say, a cluster of hypervisors with virtual machines. Or Sam might be another node in an HA cluster. Or maybe Sam is a standby firewall, network router, or other partner in an active/standby pair. In most HA solutions, Alice and Sam exchange hello messages over a special network that we'll call the Hello network. If Alice stops responding, Sam starts a failover.

It's never that simple

A few seconds after Alice went silent, somebody realized they accidentally pulled the power cord for the Hello network switch. Alice didn't die; Sam just lost touch with her because the Hello network had a problem. Someone quickly plugs the switch back in, hoping nothing bad happened. But it's too late; Sam already implemented his failover decision, and somebody—maybe Sam—assumed the active role, but Alice never relinquished it.

And now split-brain kicks in because both Alice and Sam "think" they're active. Maybe multiple hypervisors now run copies of the same virtual machines. Or maybe cluster members directly run applications against the same storage. Or maybe they're redundant network routers. Whatever the application, if both Alice and Sam "think" they have exclusive access to storage, and they're both writing to it, then storage will turn into an electronic pile of junk after a few milliseconds, and the whole system will die. Hope for good backups and a speedy recovery. This is not the outcome you want from redundancy.

[ Want to learn about edge, 5G, and related topics? Read the latest telecom news from Red Hat. ]

Can you use quorum?

Clusters use a concept called quorum to protect against split-brain. Just like a parliamentary meeting, each system contributes a specified number of votes, usually one, and the cluster operates only if the number of active votes is at least one more than half the number of total votes. Two-node clusters usually allocate a vote to a quorum disk shared by both systems, giving the cluster three total votes, so one node can continue operating if another node dies.

But even with a quorum, if Sam takes the active application role without Alice relinquishing it, the application could still collapse into a split-brain nightmare. To mitigate that problem, Alice could relinquish her active role if she loses touch with Sam. That solves split-brain but creates an opposite problem. Maybe somebody took Sam down for maintenance, or perhaps Sam's power failed. If Alice voluntarily relinquishes her active role every time she loses touch with Sam, the whole operation might shut down. Instead of improving reliability, poorly implemented redundancy makes it worse.

What about fencing?

Some clusters use an Intelligent Platform Management Interface (IPMI) network and a technique called fencing to erect a software fence around a problem node so surviving nodes can pick up the load. IPMI devices such as HPE iLO or Dell DRAC attach to servers and offer management functions such as remote consoles and power management. Some datacenters also use smart power distribution systems with programmable power outlets. If Alice stops responding on the Hello network, then Sam or another system in the cluster asks Alice's IPMI device to power-cycle Alice. The industry uses a colorful acronym for this, STONITH (shoot the other node in the head). After a cold boot, Alice will hopefully be healthy again while surviving nodes absorb the application load.

Sometimes fencing is over-eager. If Alice gets busy and falls behind answering Hello messages, another node power-cycles Alice. And so Alice's application load moves to one or more surviving nodes. If Alice's application load overburdens a surviving node, then somebody also power-cycles that node, and the application load moves somewhere else and overburdens another node. With each application restart, thousands of people call the help desk, wondering what's going on. The cycle continues until a human system admin steps in to stop the chaos.

So, now what?

The fundamental problem is the Hello network itself is a single point of failure. When standby Sam loses touch with active Alice over the Hello network, that information by itself doesn't tell Sam or Alice anything about why they lost contact. A dead Alice is one of many possibilities. Maybe the Hello network hiccuped, or maybe Sam is isolated from the world and Alice is fine. But too many HA solutions assume Alice is dead, and that's why their failovers are overly eager.

What if HA solutions kept a list of well-known hosts and used that list to characterize the environment before starting a failover? If Sam loses touch with Alice over the Hello network, Sam could probe the well-known host list. If at least one well-known host answers and Alice does not answer on any of her other interfaces, Sam can conclude that Alice has a problem, and Sam can start a failover. If Alice answers on any of her other interfaces, then Sam can notify the world that Alice stopped answering on the Hello network. Alice might be in trouble, but she is still alive, and somebody needs to apply human judgment and make a decision. And if nobody answers Sam's probes, then Sam knows he is isolated, and he needs to keep checking until he can interact with the world again.

What about Alice? If Alice cannot find Sam on any interface, and none of the well-known hosts answer, then Alice knows she's isolated, and so she relinquishes her active role and hopes Sam takes over. Alice has no way to know whether Sam took over, but she can find out and make her own failover decision once she's back online. But if Alice can see anyone on the network, then she keeps her active role and notifies the world something unusual is going on.

Redundancy with automated failover is good. But failover decisions are not always straightforward, and doing it wrong can make an HA solution worse than no redundancy at all. So, when you set up your HA environment, think it through.

Topics:   Networking   Troubleshooting   Infrastructure  
Author’s photo

D. Greg Scott

On weekdays, D. Greg Scott helps the world’s largest open source software company support the world's largest telecom companies. More about me

Try Red Hat Enterprise Linux

Download it at no charge from the Red Hat Developer program.