Because a private cloud is considered critical infrastructure, you not only need a robust cloud architecture but also an effective operations team supporting it. This team needs to be empowered by policies and practices that allow proactive measures to be constantly explored and integrated.
In the first article in this series on building cloud architectures, we highlighted various aspects of cloud architecture, and in the second article, we described self-services delivery. In this third article, we'll talk about another critical component of private cloud management practices: private cloud operations. We will describe cloud operations related to running and delivering a cloud platform and provide some practical tips on how to increase your team's operational efficiency.
If your organization is considering operating a private cloud, this article offers some insights that you may wish to integrate into your operations policies, practices, and culture.
Cloud operations is the process of managing and delivering a cloud platform. It establishes and consistently applies procedures to ensure that the platform meets service-level objectives (SLOs) over time. It also includes strategies to ensure that the cloud is scalable, resilient, repairable, secure, and satisfies compliance requirements.
- Configuration Management (CM)
- Causal Analysis and Resolution (CAR)
- Measurement and Analysis (MA)
- Decision Analysis and Resolution (DAR)
- Supplier Agreement Management (SAM)
Overall, the impact and success of your private cloud project will be measured by its effect on real business problems.
Consistently apply operations practices
Successful teams try to avoid the "hero-culture," as it creates burnout, slows problem resolution, and potentially stops subject matter experts from delivering high-value work.
These teams establish standard operating procedures (SOPs) that they use for causal analysis and resolution. As a first phase, the teams document and test the SOPs manually and later automate them.
The teams also perform root-cause analysis to seek ways to rectify issues, identify proactive monitoring, or schedule work to address issues later.
"There is a difference between documenting and automating something for yourself versus for other people to run. Test the quality of the documentation and automation by letting others try to resolve issues with your documentation and automation."
—Australia-based OpenStack customer
[ You might also be interested in reading 6 architectural diagramming tools for cloud infrastructure. ]
Tips for operations expertise
We asked some experts to share their knowledge about cloud operations. Here are their key practices:
1. Define and practice recovery scenarios
Capture a set of failure scenarios to practice and test operational procedures against. First, a team member goes in and disrupts components in a test lab environment. Then, other team members test the troubleshooting and restoration procedures.
Some examples to start with might include:
- Stopping the OpenStack messaging backbone or flooding a queue by stopping a consuming service
- Terminating an
etcdinstance in a Kubernetes cluster and performing recovery
- Dropping a database table from the OpenStack Nova database
- Starting duplicate admin services on worker nodes causing intermittent connectivity issues (for example, neutron agents in OpenStack)
- Causing SSL certificate issues by either expiring them on purpose or blacklisting SSLv3
This exercise serves two purposes: it skills up the whole team and identifies improvements in operational practices.
2. Measure and monitor
Most private cloud teams can report on the core system's health. They know that services are up and the communication backbone is healthy, and they have impressive dashboards showing the current load on compute, networking, and storage. Mature teams extend their monitoring to measure service-level indicators that align with SLOs. They implement end-to-end testing of key parts of the stack for trending and health-check purposes. They create stricter alerts and metrics better than their agreed consumer SLOs to help avoid consumer SLO breaches.
3. Plan on certification and compliance
Certification and compliance require early and open engagement with the internal security team, a secure supply chain for software packages, automation, and the ability to secure the provided services by default.
[ Thinking about security? Check out this free guide to boosting hybrid cloud security and protecting your business. ]
Engage early, openly, and frequently with security teams in your organization. Involve them in discussions during design and deployment, as this helps identify risks and address issues earlier in the delivery process. It is not uncommon for security teams to avoid providing prescriptive measures upfront;rather they can leverage industry-standard content and tools such as the Security Content Automation Protocol (SCAP) profiles as a basis for these conversations.
Document all security-related aspects of your private cloud, including:
- IP addresses, ports, and protocols in use
- Firewall rules required elsewhere within your organization
- API endpoint security (for example, Transport Layer Security)
- User authentication and authorization
- Service accounts necessary for integration with third-party services
- People with access to your environment at the infrastructure layer (that is, your cloud administrators with root privileges)
- Patching plans, including frequency
- Data management procedures such as encryption at rest or encryption in transit
Ensure you have a secure, trusted software supply chain. Be sure you understand the software provider's secure pipeline practices. Do they backport security features to older versions? This could lead to false positives in the third-party scanning software. Do they apply industry standards such as FedRAMP or FISMA in their engineering practices? Do they sign packages and provide a way to ingest them in a trusted way? Can you track the provenance of packages and code changes?
Where possible, work towards automated compliance. This includes using capabilities such as OpenSCAP, automation tools, and continuous compliance operators for Kubernetes-based clouds.
The workload is more malleable than the cloud infrastructure, and capabilities inherit security compliance from the components they run on. This means it's essential to provide certification guidance and assistance for the teams using the platform.
4. Engage the software supplier
High-performing teams understand the benefits of working within an ecosystem, whether they use vendor- or community-provided software. The combination of hardware and software providers means the team might want to take on integration work themselves. Most of the time, this integration work adds little value to the business. Hence, the team is advised to look to its partners to perform the integration work and perform only essential tasks and value-adding custom integrations.
These teams also attempt to engage proactively with the vendor and community to help shape solutions and include them in their planning and architecture discussions where possible.
Vendors have a range of people and programs that can provide value in different ways, such as:
- Dedicated account architects that understand their environments: They can help advocate solutions and capabilities within the customer and vendor environments.
- Cloud success architects to help drive validated architectures that are supportable: They are typically engaged by an account architect or project team to help ensure they have a supported architecture and design.
- Customer success managers that understand the customer's objectives and success criteria: They make recommendations on activities, training, updates, and briefings that will assist the team.
- High-touch and developer programs that provide closer access to engineering and product-management resources: These are specifically useful for helping achieve longer-term objectives.
- Support teams that actively engage and work to provide as much visibility and access to the support team as possible: Special agreements could be required for remote access or access to sensitive logs.
Teams that engage early and jointly drive design and architecture with their teams typically have a smoother transition from build to production.
[ For more insight on getting support for your cloud goals, read An architect's guide to explaining cloud to your CEO. ]
Operations teams that progressively build the habits and processes described in this article will be on their way to building an effective practice for private cloud architecture or any other scalable system. Start with a clear goal of where you would like to end up and keep momentum by measuring your progress over time.
If you already have these things in place, consider applying the principles in chaos engineering. But before you do, ensure that you have resiliency built into every layer of your organization, including network, data, infrastructure, people, and culture.