8 tips for reliable Linux system automation
The basis for this list is multiple years of experience supporting automation for the upstream container-runtimes team (podman, buildah, skopeo, etc.). I will not take full credit, as many of these tips are based on an amalgam of evolved experience and individual contributions from a large community of users and developers.
Most items below can be boiled down to a single principle: Eliminate or reduce complexity. This concept is based on a compound application of Murphy's Law: The more "breakable things" you have, the more likely Murphy will show up. Here are eight ways to avoid those chance encounters.
[ Readers also liked: Introducing the new Ansible Automation Hub ]
1. Reduce network dependencies
Reduce networking dependencies, especially on third-party services you have no control over. In addition, first and second party network services should be considered "avoid if possible." There are actually two aspects to this recommendation:
- From all perspectives, networking is a very complex system of related components and real-time relationships. Generally speaking, these all must operate nearly flawlessly from one end to the other, or you might have a bad day.
- Again, broadly speaking, networking failures are often transient and time-dependent (everybody wants them fixed quickly). This can make after-the-fact debugging incredibly difficult. Even with extensive logging, the unobserved effects may begin your bad day.
2. Reduce software dependencies
Where possible, reduce software dependencies, especially on third-party libraries. This includes both your core automation subject and any shared automation code. Unless you version-lock every single component up and down the stack, you risk breakage due to unexpected behavior or API changes somewhere. The situation is slightly better when you control the included code but still presents a risk.
Note: I will acknowledge that this tip can be fairly controversial, and it certainly doesn't make sense in many situations. Consider it a "think twice" reminder, especially when you find yourself wanting a library to bring in one simple function.
3. Arrange automation jobs
Arrange automation jobs in the order of descending failure-consequence. In other words, try to catch items with the greatest negative-impact as early on as possible. The idea here is to save resources (including time) for high-impact, low-detection-cost "Whoopsies." Some examples for VCS continuous integration (CI) testing:
- Are your third-party network services reachable? For example, can they be pinged and do their SSL certificates validate?
- Does your vendor or included code actually match the documented and configured list of requirements?
- Did somebody accidentally leave a "FIXME" comment in the newly-committed code?
- Are all the new commits signed?
- Do changes match the execution context, e.g., non-documented change during release testing or missing documentation/test updates with a code-change.
Over time, the effect of this workflow is that important checks will receive the most attention and most reliable maintenance (since failures tend to hold up the entire train). In turn, developers will also be able to cycle faster. For example, they won't be waiting a long time just to find out they misspelled their own name.
4. Keep jobs short
Keep jobs as short as possible and in easily repeatable "chunks." This is largely going to depend on the orchestration software, but most apps allow multiple stages of execution. Using another CI testing example, if you have unit, integration, and system tests to run (in that order), avoid running them all together, one after another, in a single script. This way, if the integration step fails, users aren't forced to re-run the unit-tests again. This improves reliability by not re-executing redundant operations, needlessly inviting Murphy back into the automation gear-train.
5. Avoid non-essential operations at runtime
Avoid non-essential operations (like install or configuration) at runtime. Instead, prepare your execution environments with all necessary bits ahead of time. This not only makes things run more efficiently, but it also helps to adhere to other tips in this article. It also permits observation and testing of the pre-built environment at build-time. If your environments are shared across jobs with some differing requirements, consider caching those components/packages within the image. Installing at runtime from a local cache is far safer and more reliable than hitting a remote repository over the network.
6. Use the right tools
Use the most basic tools available for the task at hand. For example, if you need to verify binary flags after applying a bit-mask, don't attempt to do this in a bash script. Similarly, if your C++ program simply executes a series of commands, use bash instead. This improves reliability by not exposing operations to side-effects unrelated to the core purpose of the job.
7. Track failures
Track failures based on the frequency of their signature. Most (but not all) of the time, automation failures will result in some indication being logged somewhere. Identify and classify (e.g., by request name) these so you can keep a centralized record of occurrence. Arguably this takes quite a bit of work to pull off, possibly requiring you to learn and interface with multiple services and APIs. However, with the results sorted by signature frequency, you will quickly spot which problems are affecting the greatest number of people. Those items should receive the most attention and will have the greatest impact on automation reliability.
8. Use comments effectively
Comment why not how. Assume any reader of your code can determine the way it functions. They cannot determine what you (the author) were thinking when you wrote the code. Automation involves lots of moving parts. Some of the relationships may not be obvious to an uninitiated reader. Comments are especially useful when they inform on component relationships.
For example, consider the following comment:
# Default variable value comes from CI unless executed manually.
# Detect this (`$CI == false`) to ensure the user did not leave
# the value blank.
You should easily imagine the code this adorns—some form of variable definition or validation. Further, it hinted toward an additional information source, "CI" (whatever that means in the script's context).
Helpful comments like this don't need to adorn every line of your script; target them. Focus on items impacted by external files or forces (including solar-flares). These details make automation more reliable by ensuring the "secret sauce" is continuously passed down to anyone charged with future enhancements or maintenance.
[ A free guide from Red Hat: 5 steps to automate your business. ]
In most situations, it will be impossible to follow all these tips. They are intended to serve as guidelines for compromise when alternative implementations are reasonable. Otherwise, to best serve your stake-holders, violating some of these principles will sometimes be necessary. Still, others (like writing good comments) will tend to have a subtle but steady effect over time. I will be the first to admit that doing things simply is often far more difficult than slapping on duct-tape. However, given time, most duct-tape goes dry and crusty, requiring you to re-fix the problem. Do your future self a favor, spend the time refactoring toward simplicity from the beginning.