How we saved days of work with IT automation: A case study

2021년 9월 21일Kedar Vijay Kulkarni4분 읽기

In 2020, I was working on a team automating the process of creating new virtual machine (VM) images for the latest Red Hat Satellite builds. Our goal was to automate VM deployments, snapshots, cleanup, and template creation. It sounds easy, but it was a lot of work. Automation was obviously needed to save time for our team, and we picked Red Hat Ansible Automation Platform as the automation interface. That's where this story begins.

If you have worked with Red Hat Ansible Automation Platform, you know there are many things you need to configure to make it useful. For example, you need to set up login and authentication, then projects, credentials, inventories, inventory sources, job and workflow templates, notifications, schedules, and so on. All that work is why I helped create Red Hat Ansible Automation Platform Configuration-as-Code.

This automation project transforms all the things you need to do in the Ansible Automation Platform user interface to the YAML serialization language. The settings are then executed with a single playbook command that takes your entire Ansible Automation Platform from a fresh install to a fully functional service.

This is a huge win. Why? Once the configuration is written, the time it takes to stand up a new instance using the Configuration-as-Code method is less than 30 minutes. Prior to using this approach, it took us a day or longer (depending on who you asked to do it and their level of expertise) to deploy, set up, and configure a new instance and make it production-ready.

Before I developed the Config-as-Code method, a manual deployment could take one to three hours, and the configuration could take the rest of the day. It would likely be a two-person effort to speed up the process. For example, if there was one project with five credentials, two inventories, two inventory sources, 20 to 40 job templates, and five to 10 workflows, it could take hours to create them via a mouse-driven user interface. Say you did it once—painfully. What happens if you lose your instance? If you don't have any configuration written, reproducing this is based purely on your memory or team documentation.

This is why we found it essential to write the configuration first. Getting the configuration written correctly became a learning opportunity for my team. Why? The configuration didn't have a standard programming language, so the team had to learn the schema for YAML constructs. Once we got past that learning curve, we became more efficient.

Now that we had automated setup time, we were confident we would be back up and running in no time with the proper config files when a disaster happened. But what about getting to fully completed and tested YAML configurations?

To put this challenge into perspective, if you write a new playbook that runs in Ansible Automation Platform as a job template, you need to add appropriate projects to the project's YAML file, then add the proper credentials, inventories, and job templates in the correct files. This is a minimum of about 50 lines of code. To figure out this code and write it can take between 30 minutes (if you know what you are doing) to three to four hours (if you are new).

The process of writing the code only gets quicker as you get more practice. It is worth the time spent, though, as you get repeatability and consistency. You can apply all the pros and cons of Infrastructure-as-Code.

Next, you want to have your configuration (code) tested. This is where my team spent another couple of hours standing up a test instance that looked like the production instance and contained all the proposed changes. Next, we'd figure out what jobs were needed to test the merge request fully. Finally, we would merge it. This burden was a total workload of a day or two.

To tackle testing with automation, we devised an automated approach using GitLab continuous integration (CI). With our automation, every time a new pull request (PR) was opened, GitLab CI would create a new test instance for that PR. Automation saved two to four hours depending on who was tasked with deploying the instance. Now that GitLab was deploying it, more time was saved.

The next challenge was figuring out how the PR should be tested. With some smaller PRs, it was easy enough to quickly figure out what to test. Complex PRs touch more than a dozen files, and it was tough to anticipate what might break if the PR was not properly tested before merging to the master branch. Keep in mind, the production instance was running based on the code on the master branch.

To get over that challenge and save hours spent analyzing and then testing PRs, we devised a new project called Ansible Genealogist (in a private repository at the time of publishing), which examines PRs in minutes and documents what needs to be tested.

[ A free guide from Red Hat: 5 steps to automate your business. ]

Task	Time spent manually	Time spent using automation
Deploying a new Ansible Automation Platform instance—Production ready	~1-2 days	<30-45 mins
Deploying and configuring a test instance for testing new configuration before pushing it to production	~4-6 hours	<30-45 mins
Determining what needs to be tested for each new PR	~1-2 hours	<5-10 mins
Running tests	~2-6 hours (or longer for complex PRs)	<5 mins (just fire the automation test script and come back later to check outcomes)
Redeploying a production instance, as you just lost the one you had running in the production due to some outage	No estimate, it's a disaster, all hands on deck (maybe ~1-2 days if your team members know what to do and manage to get it all done)	<30-45 mins
Making changes to production, such as adding a new job template or updating a credential	Dreadful task—if you do something wrong, it is bad. If you decide to test changes before updating production, then you are looking at ~1 day of work	<30 mins, as your changes would be already tested as part of the PR process, pushing to prod is essentially CD of CI/CD

As you can see, through automation, we made tasks go from days to minutes. And no, we didn't automate ourselves out of a job because we kept getting more tasks to automate. The goal of our group was to automate standard administrative tasks for virtual machines—deployments, templates, snapshots, etc. Time savings was a critical part of this project. We also wanted to create repeatable events in case of disasters. Automation and templates gave us the opportunity to be far more efficient in disaster-recovery situations.

[ Download now: A system administrator's guide to IT automation.]