Let me begin by saying that I was neither hired nor trained to be a sysadmin. But I was interested in the systems side of things such as virtualization, cloud, and other technologies, even before I started working at Red Hat. I am a Senior Software Engineer in Test (Software Quality Engineering), but Red Hat, being positioned so uniquely because its products are something primarily used by sysadmins (or people with job responsibilities along similar lines) and also most of Red Hat’s products are primarily focused on backend systems-level instead of user application level. Our testing efforts include routine interaction with Red Hat’s Virtualization, OpenStack, Ansible Tower, and Hyperconverged Infrastructure.
When I was hired, I was purely focused on testing Red Hat CloudForms, which is management software for the aforementioned environments. But as one of our previous senior software engineers departed to take on another role within Red Hat, I saw an opportunity that interested me. I was already helping him and learning sysadmin tasks by then, so after looking at my progress and interest, I was a natural successor for the work in my team’s perspective. And hence, I ended up becoming a sysadmin who is working partly as a software engineer in testing.
With that background, I’ll explain what my day looks like.
Every day when I come in, I check our monitoring system’s dashboard to see if any of our 50 most important hosts are complaining regarding the over 490 checks that are running. If anything is wrong, I try to fix it, or I delegate the task to someone who can fix it for me. When things are not complaining, I try to improve our current setup, write more comprehensive automation to monitor the infrastructure, and think about how to automate remediation of things when they are broken or complaining.
What I mean by complaining is really an alert sent by our monitoring tool about the state of a given host or service. Since we live on the bleeding edge of Red Hat’s technology, we always need to have our systems up-to-date with the latest (mostly internal) builds of Red Hat Virtualization and Red Hat’s cloud software.
Some of that process has already been automated, and I try to leverage automation to redeploy things. But as I stated in a previous article on infrastructure-as-code, it is possible that my code breaks when running within new builds, and I need to debug and fix the automation when that happens. Some of my time, despite our automation, is wasted by the fact that people from my team create resources and forget to clean up. We have different cleanup scripts to remove things, but if resources don’t match conditions for cleanup, we end up having to remove them manually.
When none of the above is happening, we sometimes run into bugs or issues that were previously unknown and require a fix from the engineering team to get systems working again. In that case, I need to engage with relevant people to:
- Prove the problem is reproducible and indeed is a bug.
- Keep in touch with engineering to get a fix ready and apply the updates.
- Use any workaround, if one exists, to keep the systems from going insane.
From time to time, as you would guess, we run into a hardware malfunction or limitations that we must carefully understand and remediate. This issue includes, but is not limited to:
- Designing and understanding what hardware resources we have.
- Determining our requirements.
- Coming up with a design.
- Checking the design with the datacenter team for feasibility.
- Talking to vendors to procure new hardware.
- Talking to support when things are not looking good.
Every now and then, I also need to educate people on what to do and what not to do with our systems, as not everyone necessarily knows what each does and why they are designed and used the way they are. If I still have time, after all of that stuff, I try to spend that on automating tests for CloudForms using Python and Selenium.
The thing I love about my job is two-fold. One reason is that our infrastructure is internal and does not require pager duty, as people can often wait for me to come online to fix things. The other reason is that my team is globally distributed, so I can rely on others (although I am the only sysadmin) to try and fix problems or create workarounds in the meantime.
In Red Hat’s CloudForms quality engineering, we maintain a lot of infrastructure assembled from different internal products. To my knowledge, no other team has so much diversity in its infrastructure. Many teams only need to have their own product deployed, or maybe one or two others. This distinction has given me a lot of opportunities to learn.
All in all, if things are not working, people will surely know your name and find you. Also, any system malfunction can cause my team to miss targets, slip release dates, and cause a loss of revenue for Red Hat. And when we do release stuff on time, I can see that I made an impact by making sure our systems are available 24x7.
System administration is a position where you have great power with great responsibility. I like to think I use it wisely.