Linux troubleshooting: Navigating in a perfect storm
In DevOps, one of the most difficult problems is those who actively block their own investigation. In contention for the championship title, the second-worst kind of issues are those that occur intermittently. This article is an adventure story from a time when Podman's upstream Continuous Integration/Continuous Deployment (CI/CD) system encountered both simultaneously.
A perfect storm
Once upon a time, Podman test automation started outgrowing its "big-boy" pants. This occurred years ago, when virtually all CI/CD systems were container-based. Podman, being a container (and pod) management (and debugging) tool, cannot be fully exercised within a container. Perhaps worse, the majority of commodity automation services only supported Ubuntu. This quickly became an absolute non-starter, since Podman needed to run on virtual machines (VMs). It also needed to run against multiple distributions, including Red Hat's upstream distribution, Fedora.
Having experience with CI/CD workflows grafted underneath a cloud-sdk + SSH (and/or Ansible) setup for cloud access, the increased complexity always seems to cause problems. Then one day, I stumbled upon Cirrus-CI. Cirrus-CI is a Git-centric automation tool capable of orchestrating containers and VMs, using a plethora of cloud providers. It let the Podman team pay for and manage the cloud service independently from its orchestration. We could retain full control over the VMs and log data, with Cirrus-CI only managing the orchestration.
The overall workflow goes something like this:
- A developer submits proposed code changes upstream into Podman's Git repository.
- Cirrus-CI notices, reads a configuration file, and then spins up the necessary VMs in our cloud.
- Native cloud metadata services manage script execution and logging back to Cirrus-CI.
- Cirrus-CI takes down the VMs and provides pass/fail feedback to the developer.
- Changes are made, and the cycle repeats until the code is accepted.
For years, this workflow has operated nearly flawlessly, with most problems centered around tests and scripts—faults that the Podman team easily manages. Rarely, issues within Cirrus-CI, the Internet, and/or our cloud provider result in orphaning (failing to remove) VMs. Otherwise, the support staff at Cirrus Labs has been fantastic, highly approachable, with top-class responsiveness and accountability.
Then one day in October 2020, after rotating in a set of freshly updated VM images (i.e., the disk image copied for every new VM instance), seemingly random jobs started failing. Initial investigation of script output provided no information. Literally, all output would suddenly stop, with no discernible pattern relative to other failures. As expected, Cirrus-CI would diligently clean up the VM and present the resulting failure back to the developer to figure out. Often upon re-executing the failed job, it would succeed without incident.
This situation went on for several weeks without problem-correspondence to any infrastructure outages in our cloud, GitHub, or Cirrus. The issue was somewhat infrequent, perhaps a handful of failures per day, out of hundreds of successful jobs. Fault-finding was difficult, and perpetually re-running jobs could not be a long-term workaround. Besides regular incident reports from my team, I could not discern any high-level pattern to the failures. Since the new VM images provided significant benefits, the cost of rolling back would also have been high.
For nearly any systemic problem like this, finding behavior patterns is a key element for successful troubleshooting. Since reliable job-success is the desired state, having at least some notion, or a hint, was a hard-requirement. Unfortunately, in this case, our ability to observe patterns was extremely limited by the random-failures and the highly desirable feature: Cleaning up disused cloud VMs, which waste real money.
Simple, repeated, manual test execution did not reproduce the problem at all. Nor did throwing more CPU and memory resources affect the behavior. I spent days pondering and brainstorming options to gather additional data associated with the failures. Finally, it struck me that I needed a way to selectively interrupt VM cleanup, but only in cases where the tests failed to finish running.
[ You might also like: From sysadmin to DevOps ]
In other words, I needed to somehow associate successful test completion (not just pass/fail) with permitting cleanup to occur. That is when I remembered a little check box I once saw while poking around the WebUI of our cloud: Deletion Protection. When this flag is set, Cirrus-CI will complain loudly because it is blocked from removing a VM, but otherwise, it will leave things undisturbed.
I needed to instrument our workflow so the VMs themselves could set and unset their own deletion protection flag. Something like this:
- VM enables deletion protection on itself.
- Run tests.
- VM disables its own deletion protection.
- Cirrus-CI cleans up or fails to clean up the VM and makes a stink about it.
In this way, when the tests complete, Cirrus-CI would happily remove the VM as usual. However, if the problem occurred, the tests would not complete, and Cirrus-CI would run into the (still) enabled deletion-protection. Fortunately, this workflow was entirely possible to realize via simple command line options to the cloud-management utility program I could install on the VMs.
With this workflow instrumentation in place, all I needed to do was repeatedly trigger the test matrix and wait for orphan VMs to show up. The fates were smiling on me that day because the problem reproduced nearly instantly. However, I ran it a few more times, so I'd have more than a few orphan VMs to inspect.
Unexpectedly, this is when the situation became even more interesting: I could not SSH into the orphan VMs. In fact, they wouldn't even respond to pings from inside or outside our cloud. I did a hard-reset on one VM and checked the system logs once it booted up. There was nothing. Zip. Nada. Not one single iota of a clue other than what we already knew: The tests simply stopped executing, dead, along with the rest of the system.
Since it was already my lucky day, I decided to push it and again went poking around my cloud's WebUI. Eventually, I found another incredibly handy little setting: Serial-console output log. This was basically a direct, low-level communication line straight into the kernel. If something were happening horrible enough for a complete system-hang, surely the kernel would be screaming out of its virtual serial port. Bingo, Yahtzee, and huzzah!
[ 1203.905090] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1203.912275] #PF: supervisor read access in kernel mode [ 1203.917588] #PF: error_code(0x0000) - not-present page [ 1203.922881] PGD 8000000124e2d067 P4D 8000000124e2d067 PUD 138d5e067 PMD 0 [ 1203.929939] Oops: 0000 [#1] SMP PTI ...blah...blah...blah [ 1204.052766] Call Trace: [ 1204.055440] bfq_idle_extract+0x52/0xb0 [ 1204.059468] bfq_put_idle_entity+0x12/0x60 [ 1204.063722] bfq_bfqq_served+0xb0/0x190 [ 1204.067718] bfq_dispatch_request+0x2c2/0x1070
Why, hello, old friend! It was indeed a very lucky day!
This was a kernel panic I had both seen a year earlier and had worked tirelessly for months with upstream kernel engineering to fix. It was a bug in the storage subsystem of the kernel responsible for improving efficiency and guaranteeing precious 0s and 1s are written to disk. In this case, rather than corrupt storage, the kernel threw up a white-flag and halted everything dead in its tracks.
Having worked nearly this exact same problem before, I was already aware of the one-line workaround. There was no need to drag my team through months of debugging again. I could simply report the recurrence and confidently deploy the workaround, knowing it would 100% address the problem:
echo mq-deadline > /sys/block/sda/queue/scheduler
Now, I do not recommend running that command willy-nilly on every Linux system you own. In this specific case, I knew from past experience it was completely safe to swap the buffering algorithm (also called an I/O Elevator). No Podman developer would even notice the change in any testing results.
[ Getting started with containers? Check out this free course. Deploying containerized applications: A technical overview. ]
Should you or someone you love ever encounter a kernel panic of their own, I recommend looking at this recent article on the topic. Otherwise, the main takeaway is this: When troubleshooting blind, solving the cause-obfuscation aspects is absolutely vital. Second is working with the data to make the problem more reproducible. You will have a very difficult time doing the latter without assistance from the former.