The day the virtual machine manager died
Late on a Tuesday afternoon, I had somewhere to be after work that made driving all the way home and then back again a waste of time. So, I was in my office late, killing some time and getting some work done before I had to go. I went to log into my Red Hat Enterprise Virtualization (RHEV) 3.6 manager to do something or other. My login was rejected. Which was odd, but I’d seen this before.
See, my RHEV manager was a VM running on a stand-alone Kernel-based Virtual Machine (KVM) host, separate from the cluster it manages. I had been running RHEV since version 3.0, before hosted engines were a thing, and I hadn’t gone through the effort of migrating. I was already in the process of building a new set of clusters with a new manager, but this older manager was still controlling most of our production VMs. It had filled its disk again, and the underlying database had stopped itself to avoid corruption.
See, for whatever reason, we had never set up disk space monitoring on this system. It's not like it was an important box, right?
So, I logged into the KVM host that ran the VM, and started the well-known procedure of creating a new empty disk file, and then attaching it via
virsh. The procedure goes something like this: Become
dd to write a stream of zeros to a new file, of the proper size, in the proper location, then use
virsh to attach the new disk to the already running VM. Then, of course, log into the VM and do your disk expansion.
I logged in, ran
sudo -i, and started my work. I ran
cd /var/lib/libvirt/images, ran
ls -l to find the existing disk images, and then started carefully crafting my
bs=1k count=40000000 if=/dev/zero ...
Which was the next disk again?
<Tab> of=vmname-disk2.img <Back arrow, Back arrow, Back arrow, Back arrow, Backspace> Don’t want to
dd over the existing disk, that’d be bad. Let’s change that
2 to a
Enter. OH CRAP, I CHANGED THE
2 TO A
2 NOT A
I still get sick thinking about this. I’d done the stupidest thing I possibly could have done, I started
root, over the top of an EXISTING DISK ON A RUNNING VM. What kind of idiot does that?! (The kind that’s at work late, trying to get this one little thing done before he heads off to see his friend. The kind that thinks he knows better, and thought he was careful enough to not make such a newbie mistake. Gah.)
So, how fast does
dd start writing zeros? Faster than I can move my fingers from the
Enter key to the
Ctrl+C keys. I tried a number of things to recover the running disk from memory, but all I did was make things worse, I think. The system was still up, but still broken like it was before I touched it, so it was useless.
Since my VMs were still running, and I’d already done enough damage for one night, I stopped touching things and went home. The next day I owned up to the boss and co-workers pretty much the moment I walked in the door. We started taking an inventory of what we had, and what was lost. I had taken the precaution of setting up backups ages ago. So, we thought we had that to fall back on.
I opened a ticket with Red Hat support and filled them in on how dumb I’d been. I can only imagine the reaction of the support person when they read my ticket. I worked a help desk for years, I know how this usually goes. They probably gathered their closest coworkers to mourn for my loss, or get some entertainment out of the guy who’d been so foolish. (I say this in jest. Red Hat’s support was awesome through this whole ordeal, and I’ll tell you how soon. )
So, I figured the next thing I would need from my broken server, which was still running, was the backups I’d diligently been collecting. They were on the VM but on a separate virtual disk, so I figured they were safe. The disk I’d overwritten was the last disk I’d made to expand the volume the database was on, so that logical volume was toast, but I’ve always set up my servers such that the main mounts—
/root—were all separate logical volumes.
In this case,
/backup was an entirely separate virtual disk. So, I
scp -r’d the entire
/backup mount to my laptop. It copied, and I felt a little sigh of relief. All of my production systems were still running, and I had my backup. My hope was that these factors would mean a relatively simple recovery: Build a new VM, install RHEV-M, and restore my backup. Simple right?
By now, my boss had involved the rest of the directors, and let them know that we were looking down the barrel of a possibly bad time. We started organizing a team meeting to discuss how we were going to get through this. I returned to my desk and looked through the backups I had copied from the broken server. All the files were there, but they were tiny. Like, a couple hundred kilobytes each, instead of the hundreds of megabytes or even gigabytes that they should have been.
Happy feeling, gone.
Turns out, my backups were running, but at some point after an RHEV upgrade, the database backup utility had changed. Remember how I said this system had existed since version 3.0? Well, 3.0 didn’t have an engine-backup utility, so in my RHEV training, we’d learned how to make our own. Mine broke when the tools changed, and for who knows how long, it had been getting an incomplete backup—just some files from
No database. Ohhhh ... Fudge. (I didn’t say "Fudge.")
I updated my support case with the bad news and started wondering what it would take to break through one of these 4th-floor windows right next to my desk. (Ok, not really.)
At this point, we basically had three RHEV clusters with no manager. One of those was for development work, but the other two were all production. We started using these team meetings to discuss how to recover from this mess. I don’t know what the rest of my team was thinking about me, but I can say that everyone was surprisingly supportive and un-accusatory. I mean, with one typo I’d thrown off the entire department. Projects were put on hold and workflows were disrupted, but at least we had time: We couldn’t reboot machines, we couldn’t change configurations, and couldn’t get to VM consoles, but at least everything was still up and operating.
Red Hat support had escalated my SNAFU to an RHEV engineer, a guy I’d worked with in the past. I don’t know if he remembered me, but I remembered him, and he came through yet again. About a week in, for some unknown reason (we never figured out why), our Windows VMs started dropping offline. They were still running as far as we could tell, but they dropped off the network, Just boom. Offline. In the course of a workday, we lost about a dozen windows systems. All of our RHEL machines were working fine, so it was just some Windows machines, and not even every Windows machine—about a dozen of them.
Well great, how could this get worse? Oh right, add a ticking time bomb. Why were the Windows servers dropping off? Would they all eventually drop off? Would the RHEL systems eventually drop off? I made a panicked call back to support, emailed my account rep, and called in every favor I’d ever collected from contacts I had within Red Hat to get help as quickly as possible.
I ended up on a conference call with two support engineers, and we got to work. After about 30 minutes on the phone, we’d worked out the most insane recovery method. We had the newer RHEV manager I mentioned earlier, that was still up and running, and had two new clusters attached to it. Our recovery goal was to get all of our workloads moved from the broken clusters to these two new clusters.
Want to know how we ended up doing it? Well, as our Windows VMs were dropping like flies, the engineers and I came up with this plan. My clusters used a Fibre Channel Storage Area Network (SAN) as their storage domains. We took a machine that was not in use, but had a Fibre Channel host bus adapter (HBA) in it, and attached the logical unit numbers (LUNs) for both the old cluster's storage domains and the new cluster's storage domains to it. The plan there was to make a new VM on the new clusters, attach blank disks of the proper size to the new VM, and then use
dd (the irony is not lost on me) to block-for-block copy the old broken VM disk over to the newly created empty VM disk.
I don’t know if you’ve ever delved deeply into an RHEV storage domain, but under the covers it’s all Logical Volume Manager (LVM). The problem is, the LV’s aren’t human-readable. They're just universally-unique identifiers (UUIDs) that the RHEV manager’s database links from VM to disk. These VMs are running, but we don’t have the database to reference. So how do you get this data?
Luckily, I managed KVM and Xen clusters long before RHEV was a thing that was viable. I was no stranger to
virsh utility. With the proper authentication—which the engineers gave to me—I was able to
virsh dumpxml on a source VM while it was running, get all the info I needed about its memory, disk, CPUs, and even MAC address, and then create an empty clone of it on the new clusters.
Once I felt everything was perfect, I would shut down the VM on the broken cluster with either
virsh shutdown, or by logging into the VM and shutting it down. The catch here is that if I missed something and shut down that VM, there was no way I’d be able to power it back on. Once the data was no longer in memory, the config would be completely lost, since that information is all in the database—and I’d hosed that. Once I had everything, I’d log into my migration host (the one that was connected to both storage domains) and use
dd to copy, bit-for-bit, the source storage domain disk over to the destination storage domain disk. Talk about nerve-wracking, but it worked! We picked one of the broken windows VMs and followed this process, and within about half an hour we’d completed all of the steps and brought it back online.
We did hit one snag, though. See, we’d used snapshots here and there. RHEV snapshots are
lvm snapshots. Consolidating them without the RHEV manager was a bit of a chore, and took even more leg work and research before we could
dd the disks. I had to mimic the snapshot tree by creating symbolic links in the right places, and then start the
dd process. I worked that one out late that evening after the engineers were off, probably enjoying time with their families. They asked me to write the process up in detail later. I suspect that it turned into some internal Red Hat documentation, never to be given to a customer because of the chance of royally hosing your storage domain.
Somehow, over the course of 3 months and probably a dozen scheduled maintenance windows, I managed to migrate every single VM (of about 100 VMs) from the old zombie clusters to the working clusters. This migration included our Zimbra collaboration system (10 VMs in itself), our file servers (another dozen VMs), our Enterprise Resource Planning (ERP) platform, and even Oracle databases.
We didn’t lose a single VM and had no more unplanned outages. The Red Hat Enterprise Linux (RHEL) systems, and even some Windows systems, never fell to the mysterious drop-off that those dozen or so Windows servers did early on. During this ordeal, though, I had trouble sleeping. I was stressed out and felt so guilty for creating all this work for my co-workers, I even had trouble eating. No exaggeration, I lost 10lbs.
So, don’t be like Nate. Monitor your important systems, check your backups, and for all that’s holy, double-check your
dd output file. That way, you won't have drama, and can truly enjoy Sysadmin Appreciation Day!