The basis for this list is multiple years of experience supporting automation for the upstream container-runtimes team (podman, buildah, skopeo, etc.). I will not take full credit, as many of these tips are based on an amalgam of evolved experience and individual contributions from a large community of users and developers.
Most items below can be boiled down to a single principle: Eliminate or reduce complexity. This concept is based on a compound application of Murphy's Law: The more "breakable things" you have, the more likely Murphy will show up. Here are eight ways to avoid those chance encounters.
[ Readers also liked: Introducing the new Ansible Automation Hub ]
1. Reduce network dependencies
Reduce networking dependencies, especially on third-party services you have no control over. In addition, first and second party network services should be considered "avoid if possible." There are actually two aspects to this recommendation:
- From all perspectives, networking is a very complex system of related components and real-time relationships. Generally speaking, these all must operate nearly flawlessly from one end to the other, or you might have a bad day.
- Again, broadly speaking, networking failures are often transient and time-dependent (everybody wants them fixed quickly). This can make after-the-fact debugging incredibly difficult. Even with extensive logging, the unobserved effects may begin your bad day.
2. Reduce software dependencies
Where possible, reduce software dependencies, especially on third-party libraries. This includes both your core automation subject and any shared automation code. Unless you version-lock every single component up and down the stack, you risk breakage due to unexpected behavior or API changes somewhere. The situation is slightly better when you control the included code but still presents a risk.
Note: I will acknowledge that this tip can be fairly controversial, and it certainly doesn't make sense in many situations. Consider it a "think twice" reminder, especially when you find yourself wanting a library to bring in one simple function.
3. Arrange automation jobs
Arrange automation jobs in the order of descending failure-consequence. In other words, try to catch items with the greatest negative-impact as early on as possible. The idea here is to save resources (including time) for high-impact, low-detection-cost "Whoopsies." Some examples for VCS continuous integration (CI) testing:
- Are your third-party network services reachable? For example, can they be pinged and do their SSL certificates validate?
- Does your vendor or included code actually match the documented and configured list of requirements?
- Did somebody accidentally leave a "FIXME" comment in the newly-committed code?
- Are all the new commits signed?
- Do changes match the execution context, e.g., non-documented change during release testing or missing documentation/test updates with a code-change.
Over time, the effect of this workflow is that important checks will receive the most attention and most reliable maintenance (since failures tend to hold up the entire train). In turn, developers will also be able to cycle faster. For example, they won't be waiting a long time just to find out they misspelled their own name.
4. Keep jobs short
Keep jobs as short as possible and in easily repeatable "chunks." This is largely going to depend on the orchestration software, but most apps allow multiple stages of execution. Using another CI testing example, if you have unit, integration, and system tests to run (in that order), avoid running them all together, one after another, in a single script. This way, if the integration step fails, users aren't forced to re-run the unit-tests again. This improves reliability by not re-executing redundant operations, needlessly inviting Murphy back into the automation gear-train.
5. Avoid non-essential operations at runtime
Avoid non-essential operations (like install or configuration) at runtime. Instead, prepare your execution environments with all necessary bits ahead of time. This not only makes things run more efficiently, but it also helps to adhere to other tips in this article. It also permits observation and testing of the pre-built environment at build-time. If your environments are shared across jobs with some differing requirements, consider caching those components/packages within the image. Installing at runtime from a local cache is far safer and more reliable than hitting a remote repository over the network.
6. Use the right tools
Use the most basic tools available for the task at hand. For example, if you need to verify binary flags after applying a bit-mask, don't attempt to do this in a bash script. Similarly, if your C++ program simply executes a series of commands, use bash instead. This improves reliability by not exposing operations to side-effects unrelated to the core purpose of the job.
7. Track failures
Track failures based on the frequency of their signature. Most (but not all) of the time, automation failures will result in some indication being logged somewhere. Identify and classify (e.g., by request name) these so you can keep a centralized record of occurrence. Arguably this takes quite a bit of work to pull off, possibly requiring you to learn and interface with multiple services and APIs. However, with the results sorted by signature frequency, you will quickly spot which problems are affecting the greatest number of people. Those items should receive the most attention and will have the greatest impact on automation reliability.
8. Use comments effectively
Comment why not how. Assume any reader of your code can determine the way it functions. They cannot determine what you (the author) were thinking when you wrote the code. Automation involves lots of moving parts. Some of the relationships may not be obvious to an uninitiated reader. Comments are especially useful when they inform on component relationships.
For example, consider the following comment:
# Default variable value comes from CI unless executed manually.
# Detect this (`$CI == false`) to ensure the user did not leave
# the value blank.
You should easily imagine the code this adorns—some form of variable definition or validation. Further, it hinted toward an additional information source, "CI" (whatever that means in the script's context).
Helpful comments like this don't need to adorn every line of your script; target them. Focus on items impacted by external files or forces (including solar-flares). These details make automation more reliable by ensuring the "secret sauce" is continuously passed down to anyone charged with future enhancements or maintenance.
[ A free guide from Red Hat: 5 steps to automate your business. ]
Wrap up
In most situations, it will be impossible to follow all these tips. They are intended to serve as guidelines for compromise when alternative implementations are reasonable. Otherwise, to best serve your stake-holders, violating some of these principles will sometimes be necessary. Still, others (like writing good comments) will tend to have a subtle but steady effect over time. I will be the first to admit that doing things simply is often far more difficult than slapping on duct-tape. However, given time, most duct-tape goes dry and crusty, requiring you to re-fix the problem. Do your future self a favor, spend the time refactoring toward simplicity from the beginning.
저자 소개
Linux geek since Windows '98, tinkering professionally since 2004 at Red Hat. Red Hat Certified Architect, battle-hardened in support. Working the past five years as senior automation guru for the OpenShift container-runtimes team, focused mainly podman and buildah CI/CD.
유사한 검색 결과
Slash VM provisioning time on Red Hat Openshift Virtualization using Red Hat Ansible Automation Platform
Red Hat Ansible Automation Platform: Measuring Business Impact with Dashboard and Analytics
Technically Speaking | Taming AI agents with observability
You Can’t Automate Collaboration | Code Comments
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래