A cloud architect's guide to operations

September 9, 2021Johan Swanepoel, Adam Goossens, Mohammad Ahmad , John Apple II, Maurice Burrows5-minute read

Because a private cloud is considered critical infrastructure, you not only need a robust cloud architecture but also an effective operations team supporting it. This team needs to be empowered by policies and practices that allow proactive measures to be constantly explored and integrated.

In the first article in this series on building cloud architectures, we highlighted various aspects of cloud architecture, and in the second article, we described self-services delivery. In this third article, we'll talk about another critical component of private cloud management practices: private cloud operations. We will describe cloud operations related to running and delivering a cloud platform and provide some practical tips on how to increase your team's operational efficiency.

If your organization is considering operating a private cloud, this article offers some insights that you may wish to integrate into your operations policies, practices, and culture.

Cloud operations

Cloud operations is the process of managing and delivering a cloud platform. It establishes and consistently applies procedures to ensure that the platform meets service-level objectives (SLOs) over time. It also includes strategies to ensure that the cloud is scalable, resilient, repairable, secure, and satisfies compliance requirements.

The first article in this series outlined the Capability Maturity Model Integration (CMMI) framework. CMMI includes more than a dozen process areas; the ones most relevant to cloud operations are:

Configuration Management (CM)
Causal Analysis and Resolution (CAR)
Measurement and Analysis (MA)
Decision Analysis and Resolution (DAR)
Supplier Agreement Management (SAM)

Overall, the impact and success of your private cloud project will be measured by its effect on real business problems.

Consistently apply operations practices

Successful teams try to avoid the "hero-culture," as it creates burnout, slows problem resolution, and potentially stops subject matter experts from delivering high-value work.

These teams establish standard operating procedures (SOPs) that they use for causal analysis and resolution. As a first phase, the teams document and test the SOPs manually and later automate them.

The teams also perform root-cause analysis to seek ways to rectify issues, identify proactive monitoring, or schedule work to address issues later.

"There is a difference between documenting and automating something for yourself versus for other people to run. Test the quality of the documentation and automation by letting others try to resolve issues with your documentation and automation."

—Australia-based OpenStack customer

[ You might also be interested in reading 6 architectural diagramming tools for cloud infrastructure. ]

Tips for operations expertise

We asked some experts to share their knowledge about cloud operations. Here are their key practices:

1. Define and practice recovery scenarios

Capture a set of failure scenarios to practice and test operational procedures against. First, a team member goes in and disrupts components in a test lab environment. Then, other team members test the troubleshooting and restoration procedures.

Some examples to start with might include:

Stopping the OpenStack messaging backbone or flooding a queue by stopping a consuming service
Terminating an etcd instance in a Kubernetes cluster and performing recovery
Dropping a database table from the OpenStack Nova database
Starting duplicate admin services on worker nodes causing intermittent connectivity issues (for example, neutron agents in OpenStack)
Causing SSL certificate issues by either expiring them on purpose or blacklisting SSLv3

This exercise serves two purposes: it skills up the whole team and identifies improvements in operational practices.

2. Measure and monitor

Most private cloud teams can report on the core system's health. They know that services are up and the communication backbone is healthy, and they have impressive dashboards showing the current load on compute, networking, and storage. Mature teams extend their monitoring to measure service-level indicators that align with SLOs. They implement end-to-end testing of key parts of the stack for trending and health-check purposes. They create stricter alerts and metrics better than their agreed consumer SLOs to help avoid consumer SLO breaches.

3. Plan on certification and compliance

Certification and compliance require early and open engagement with the internal security team, a secure supply chain for software packages, automation, and the ability to secure the provided services by default.

[ Thinking about security? Check out this free guide to boosting hybrid cloud security and protecting your business. ]

Engage early, openly, and frequently with security teams in your organization. Involve them in discussions during design and deployment, as this helps identify risks and address issues earlier in the delivery process. It is not uncommon for security teams to avoid providing prescriptive measures upfront;rather they can leverage industry-standard content and tools such as the Security Content Automation Protocol (SCAP) profiles as a basis for these conversations.

Document all security-related aspects of your private cloud, including:

IP addresses, ports, and protocols in use
Firewall rules required elsewhere within your organization
API endpoint security (for example, Transport Layer Security)
User authentication and authorization
Service accounts necessary for integration with third-party services
People with access to your environment at the infrastructure layer (that is, your cloud administrators with root privileges)
Patching plans, including frequency
Data management procedures such as encryption at rest or encryption in transit

Ensure you have a secure, trusted software supply chain. Be sure you understand the software provider's secure pipeline practices. Do they backport security features to older versions? This could lead to false positives in the third-party scanning software. Do they apply industry standards such as FedRAMP or FISMA in their engineering practices? Do they sign packages and provide a way to ingest them in a trusted way? Can you track the provenance of packages and code changes?

Where possible, work towards automated compliance. This includes using capabilities such as OpenSCAP, automation tools, and continuous compliance operators for Kubernetes-based clouds.

The workload is more malleable than the cloud infrastructure, and capabilities inherit security compliance from the components they run on. This means it's essential to provide certification guidance and assistance for the teams using the platform.

4. Engage the software supplier

High-performing teams understand the benefits of working within an ecosystem, whether they use vendor- or community-provided software. The combination of hardware and software providers means the team might want to take on integration work themselves. Most of the time, this integration work adds little value to the business. Hence, the team is advised to look to its partners to perform the integration work and perform only essential tasks and value-adding custom integrations.

These teams also attempt to engage proactively with the vendor and community to help shape solutions and include them in their planning and architecture discussions where possible.

Vendors have a range of people and programs that can provide value in different ways, such as:

Dedicated account architects that understand their environments: They can help advocate solutions and capabilities within the customer and vendor environments.
Cloud success architects to help drive validated architectures that are supportable: They are typically engaged by an account architect or project team to help ensure they have a supported architecture and design.
Customer success managers that understand the customer's objectives and success criteria: They make recommendations on activities, training, updates, and briefings that will assist the team.
High-touch and developer programs that provide closer access to engineering and product-management resources: These are specifically useful for helping achieve longer-term objectives.
Support teams that actively engage and work to provide as much visibility and access to the support team as possible: Special agreements could be required for remote access or access to sensitive logs.

Teams that engage early and jointly drive design and architecture with their teams typically have a smoother transition from build to production.

[ For more insight on getting support for your cloud goals, read An architect's guide to explaining cloud to your CEO. ]

Conclusion

Operations teams that progressively build the habits and processes described in this article will be on their way to building an effective practice for private cloud architecture or any other scalable system. Start with a clear goal of where you would like to end up and keep momentum by measuring your progress over time.

If you already have these things in place, consider applying the principles in chaos engineering. But before you do, ensure that you have resiliency built into every layer of your organization, including network, data, infrastructure, people, and culture.

About the authors

Johan Swanepoel

Johan has 19 years of experience in Information Technology in various sectors including Banking and Finance, and Government. For Red Hat, he worked as a Federal Government Technology Specialist. He successfully used CMMI models to establish a team that operated on DevOps principles for one of Australia's largest retail organizations. The practices used by his team became a catalyst for change in the broader IT organization.

Read full bio

Adam Goossens

Mohammad Ahmad

Senior Consultant

Mohammad has 20+ years of experience in multi-tiered system development and automated solutions. He has extensive experience in online services that use open-source Linux based software. Primarily focused on IT infrastructure with a background in open source web development. Mohammad has dedicated the last 5+ years to emerging technologies, primarily Kubernetes using OpenShift and automation using Ansible.

Read full bio

John Apple II

John, a Senior Technical Support Engineer, has 16 years of systems administration, operations, and IT management experience around UNIX, Linux, Performance/Capacity Management, Automation, Configuration Management, and OpenStack private clouds. Between 2010 and 2015, he spent 4.5 years as a Service Delivery Manager operating IT for a major financial institution in Australia using ITILv3 methodologies with team members stretching across 5 countries. In 2017, John joined Red Hat as a consultant.

Read full bio

Maurice Burrows

Maurice is a Senior Consultant at Red Hat with 30+ years experience in Information Technology. He has worked for vendors, system integrators and end user organizations and has experienced the challenges of each.

He's well versed in server hardware and operating systems, particularly Linux, which has been his dominant focus for the last 15 years. Maurice has dedicated the last 5 years of his career to private cloud deployment and operations.

Read full bio

Browse by channel

Explore all channels

A cloud architect's guide to operations

Cloud operations

Consistently apply operations practices

Tips for operations expertise

1. Define and practice recovery scenarios

2. Measure and monitor

3. Plan on certification and compliance

4. Engage the software supplier

Conclusion

About the authors

Johan Swanepoel

Adam Goossens

Mohammad Ahmad

John Apple II

Maurice Burrows

More like this

Browse by channel

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links