From ProdOps to DevOps: Surviving and thriving

A reluctant former ProdOps pro's story of learning to love DevOps.

Posted: August 23, 2019 by Scott McBrien (Red Hat)

I recall it vividly, and my former director enjoys reminding me. Several years ago, he was talking about DevOps. My reaction? This is a direct quote:

"Oh, you mean ProdOps done poorly?"

For many of us in Production Operations (ProdOps), change is the enemy. If something changes, there is now an opportunity for things that were working just fine to experience problems. It is like a game of Jenga. When will the tower fall because a seemingly minor change unbalances the whole stack of pieces? ProdOps teams hate change so much, that countless frameworks have been invented to "manage" changes; in reality, these frameworks make the procedure for effecting a change so onerous that most people give up and accept the status quo.

Actually, that statement is a bit unfair. These frameworks are an attempt to wrap planning and consensus around production changes, thus minimizing potential downtime caused by random or rogue changes (see Why the lone wolf mentality is a sysadmin mistake).

In today’s developer-driven workplaces, completely unchanging environments are no longer something business owners and developers will accept. At the time, coming from a ProdOps background, the thought of churning changes driven by our development team into our customer-facing environments was nightmare-inducing.

Over the course of the next year or so, every six months, we would forklift upgrade—that is, replace production systems with new systems to take their place, in our infrastructure. The idea was that we could do all of our work and testing on this new system, and when it was ready, swap it in for the older, now obsolete systems.

However, without fail after one of these upgrades, the operations and development teams would then spend three to 10 days applying updates, tweaks, and fixes to this new system that was now live and in use by customers. Why? Because during the six-month lifespan of one of these environments, small changes were made to resolve issues, but somehow, these changes weren’t added into the build process for making the replacement systems.

The previous paragraph is a prime example of ProdOps being done poorly. Not because the application developers were wedging code fixes into our customer-facing production systems. Not because during the six-month lifespan of a system we were making configuration or service changes, either. It was because we didn’t have a way to account for these changes—this drift—over time. This problem was caused partially by the process but also by the people. Someone would get woken up or called on the weekend and they’d quickly "fix" the issue, but later when building its replacement, we’d have no organizational reference to that update.

After a year or 18 months of this, I went to the director and said: "Hey, I want to move us to a more iterative, less invasive release process. I want to do DevOps."

As you could imagine, after some good-natured "I told you so-ing," my team set out to improve the process. When you think about DevOps, you might also think about Continuous Integration, Continuous Delivery (CI/CD). That was our first major improvement: Change the way the environment was built. In addition to tooling changes, we also committed to behavior changes. When we wedged a fix into production, it was also committed to the codebase, and consequently, it went into the build testing for the next deployment so we could verify that the issue had been permanently resolved.

As systems administrators, we adopted a new team motto:

"Trust, but verify."

We trusted what the development team told us about the release. We trusted what the development team told us about backported updates being reconciled. However, as system administrators, it was our responsibility to develop the test scenarios, post-update, to verify that those assertions were accurate.

Over the next year, we evolved our development and deployment processes so that we were making updates every two weeks (to coincide with the end of our AGILE development sprints). Because these updates were much smaller and more iterative, rather than evolutionary, those several days where we pulled our hair out troubleshooting random problems disappeared.

Since our build process was heavily automated, these updates were also fast, typically having only 20 to 30 seconds of "weird" system state after the update's application. This fact meant that instead of spending a weekend doing the upgrade, we could get one it deployed and tested prior to leaving for the day. We also moved deployments to Wednesdays so that we would have a couple of days of normal on-call time to handle issues if any arose prior to being away for the weekend. (Customers used our service over the weekend as well.) Lastly, in addition to improving our deployment and testing ability, we also improved the deployment's rollback ability. If a deployment did not function properly, we could revert the environment back in less than five minutes.

In the end, moving to a DevOps CI/CD model made the development and system administration teams happier. From a developer’s standpoint, they got to see their code and features in place quickly after they delivered them. If there were issues or bugs in the features, not only could the changes be reverted quickly, but the developers typically had a faster turn-around time on fixing the issues because they had recently been working on the code. They also didn’t have to participate in change approval calls and a variety of other obtuse processes.

From the system administration side, our environment was much more robust and less prone to unplanned outages. As an added bonus, we weren’t losing weekends for maintenance operations and received fewer calls during non-work hours. If a deployment didn’t function properly, we’d work with the developers—who were also still at work because deployments now happened during the workday—to resolve the issue. If we were unable to resolve it during the 30-minute maintenance window, we reverted it; then would try again at the next update two weeks later.

Moving to DevOps was a drastic shift away from my ProdOps background, but ultimately, it made big improvements in my job, relationships with other related team members, and unplanned interruptions on my personal time. The process was scary and at sometimes frustrating, but ultimately worthwhile.

Why the lone wolf mentality is a sysadmin mistake

Lone wolf sysadmins cause short- and long-term problems in team environments. Here's an example of where things went wrong, and also when things are done right.