Teamwork might be the most powerful business tool on the planet. For evidence to back that up, I'll share a great story that illustrates what happens when ad-hoc teams around the world spontaneously come together to solve a critical customer problem.
I'm a Red Hat technical account manager (TAM). A few months ago, a customer contacted me about a major outage when system boot drives failed on dozens of hypervisor systems at four different sites—simultaneously. The outage killed hundreds of virtual machines and stressed backup sites across the United States.
This led to the first mystery. How do dozens of system boot drives simultaneously fail for multiple hypervisors at multiple sites? It also led to a goal: Get those sites back up and running ASAP, because downtime costs money.
After the hardware vendor replaced the failed boot drives and updated system firmware at all affected sites, it was time to recover. Our plan—reinstall a special Red Hat Enterprise Linux (RHEL) package called Red Hat Virtualization Hypervisor (RHV-H) from bare metal and connect it to the existing central storage. RHV-H is an appliance offering enough of RHEL to host virtual machines, managed by a central engine named RHV-M. We needed to set up RHV-M as a virtual machine hosted inside the environment it was to manage. This setup is called a self-hosted engine.
The first site had a running self-hosted engine on a not yet updated hypervisor with a bad boot disk. Maybe we could rescue this environment by building another hypervisor and migrating the self-hosted engine to it? Sam from the virtualization support team helped us, but without a working boot disk on the still-running hypervisor, the rescue effort proved fruitless. So we shut it down and rebuilt the first site from bare metal. Sam helped us work through it
While watching one of the affected systems go through its power-on self test (POST), I noticed the manufacturer's name for the boot drive. The POST messages said the boot "disk" was a set of solid-state drives (SSDs) and not spinning hard drives.
[ Download the whitepaper to learn how IT modernization can help alleviate technical debt. ]
Later, a few members of the Red Hat Enhanced Support team and I happened to talk about the situation, and I mentioned the SSD vendor. A few seconds later, Jacob shared an article that says a firmware bug turned batches of SSDs manufactured before March 2020 into useless bricks after 40,000 hours. The SSD manufacturer and system vendors who used those SSDs wrote advisories, issued patches, and caught most affected systems.
But not all of them.
That explained the simultaneous failures. It's not the first time buggy firmware took down a storage system. And probably not the last.
Armed with know-how from the first site, the customer and I started rebuilding the next site. But four of the 10 network interface cards (NICs) did not show up in the list of active NICs for some of the hypervisors. This was maddening because the
lspci command showed all the NICs.
That led to the next mystery. RHEL saw the raw cards on the PCI bus, so why didn't the network activate them? And what made them different from the first site, where all the NICs worked as expected? Sara and Theresa on the network team went through the logs and found messages like this:
ixgbe 0000:07:00.0: failed to load because an unsupported SFP+ or QSFP module type was detected.
This led to a knowledgebase article about unsupported small form-factor pluggable transceivers (SFP) on Intel 10GB NICs. The networking industry offers countless 10GB connector choices, and NIC vendors would suffer a logistical nightmare building cards to support every possible connector. So NIC vendors use SFPs to mate cards with an evolving list of fiber or copper wiring connectors. But Intel supports only a few SFP choices, and the SFPs with these NICs were not on Intel's supported list. The first site must have used supported SFPs. That was the difference.
[ Get a hands-on introduction to daily life as a developer crafting code on OpenShift in the O'Reilly eBook OpenShift for Developers. ]
The knowledgebase article offered several suggestions to work around the problem by loading the
ixgbe NIC driver with a special parameter to allow unsupported SFPs. Unloading and reloading the driver by hand with the special parameter worked. However, every suggestion to automatically load it that way at boot time failed.
Time for creativity. We needed a way to automate unloading and reloading the driver with the correct parameter at startup. This called for a systemd unit file. Albert, with the Red Hat Enhanced Support team, offered a draft he had hidden away in a Git repository. I customized Albert's draft to fit this scenario, and after some testing, everything worked. And then, I updated the knowledge base article with the new tactics.
We weren't finished. Because SSDs with bad firmware crashed these sites, others might also be sitting on the same ticking time bomb. So, I documented the SSD problem in another knowledgebase article and included links to vendor bulletins with firmware updates.
But since system vendors disguise raw SSDs with their own model and version numbers, how would anyone know whether their raw SSDs were affected? Dwight came up with a method, at least for one vendor's server offering, and edited the original article with his suggestion. Other team members offered clarification in further edits, too.
Meanwhile, those four sites are back up and running.
In many organizations, people talk about finding ways to break down walls, foster teamwork, and unleash creativity. Here's the secret—you don't need any secret formula. You need a bunch of people, including managers all the way up to the CEO, who are passionate about embracing open.
Transparency is good. So is accountability with inclusive leadership to build trust. Because teams with good mutual accountability who trust each other can achieve far more than an individual can do on their own. Plenty of books offer advice on how to make it work. I recommend starting with The Great Game of Business by Jack Stack and The Open Organization by Jim Whitehurst.
As a TAM, I depend on Red Hat's culture of teamwork all the time to deliver amazing solutions to customers. Because, truth be told, I'm not smart enough to do it on my own. Not even close.
Thank you to Jennifer, Vicki, Tyler, and others who helped polish this article. Teamwork for the win. Teamwork is not a big deal with open organizations because it's a natural part of everyone's job. And that's why teamwork is a big deal.