Managing a private cloud infrastructure can bring immense value to an organization. The value encompasses cost, data privacy, flexibility, and an enriching experience for the people and organization.
However, such a complex solution demands multiple skillsets, technical expertise, human resources, adequate training, a supportive culture, appropriate risk management, and a continuous improvement process.
In the first article in this series, we highlighted various aspects of cloud architecture; in the second article, we described self-services delivery; in the third article, we explored operations; in the fourth article, we tackled resource and capacity management; and in the fifth article, we discussed the elements of successful lifecycle management for a private cloud platform.
This article wraps up the series by exploring each of these aspects to arrive at a holistic approach to infrastructure governance.
[ Learn more about setting up governance for an open source project. ]
What is infrastructure governance?
Governance describes the system the cloud team operates and is held to account under. Ethics, risk management, compliance, and administration are all elements of governance. Ultimately, governance in organizations ensures that the investments support the business objectives.
As we wrote in the first article, Capability Maturity Model Integration (CMMI) provides a framework for the maturity of the processes that combine the people, procedures, and tools to deliver capabilities. The CMMI process areas we recommend for infrastructure governance are:
- Risk Management (RSKM)
- Strategic Service Management (STSM)
- Process and Quality Assurance (PPQA)
- Organizational Process Performance (OPP)
- Organizational Training (OT)
- Quantitative Work Management (QWM)
- Organizational Performance Management (OPM)
Effective governance ensures that all the private cloud team members understand the team's culture, procedures, processes, and roles.
Team composition and sizing
Your team's size depends upon your need for continuity of people, the number of clouds to manage, the level of cloud consumer engagement, support requirements, and bespoke customizations.
Team sizes can be kept small by following the strict observance of automation and standardization, ensuring that multiple clouds all look and operate the same. It's best to minimize or eliminate deployment of "unicorn" or bespoke cloud environments. A Site Reliability Engineer (SRE) is key to continually improving the day-to-day functions of the team and allows small teams to manage more environments without increasing their cognitive burden.
[ Get the free eBook: Hybrid Cloud Strategies for Dummies. ]
Our general rule of thumb advises a team of at least six people for private clouds with higher support burdens and customer engagement. For smaller cloud implementations with standard configurations, we recommend a minimum team of four. These numbers assume minimal bespoke customization and effective use of your cloud software vendor's support programs and relationships. Mature operations and lifecycle management practices improve the person-to-cloud environment ratio.
Typical roles required to manage cloud platforms include:
- Cloud architect: This role leads strategy for cloud adoption, cloud application design (OpenStack/multi-cloud/hybrid), management, and operations. This role also includes working closely with the software and hardware to drive strategy capability based on the organizational requirements.
- Cloud engineer: This role covers a large area of capabilities such as designing and building the automation capabilities to install, scale, and manage the cloud platform. As the environments and teams grow, the focus of this role shifts to maturing practices, especially around lifecycle management and operations. This role often has a development capability to deliver bespoke components to fill integration.
- Cloud operators: This role (beyond OpenStack or Kubernetes specialized knowledge) needs to be highly skilled in network debugging to assist with root-cause analysis. This role is also required to document and automate support processes.
- Cloud SRE: This role spans operational break-fix activities with an architect's view to identify areas of improvement that increase the platform's stability and scalability. It also tests features and develops capabilities to increase stability and scalability. This person must be your most experienced cloud engineer and place a heavy emphasis on improving the team's day-to-day operations.
- Project and change administration: This role ensures that the organization uses its technical professionals most effectively. Your level-three engineers should not be managing change-control processes, just as surgeons do not get involved with hospital administration tasks.
- Full-stack engineer: This role can onboard cloud consumers and design application capabilities and integrations, such as development pipelines. They should advise on application architectures that make the best use of the platform. This person is responsible for frictionless onboarding.
These roles should be augmented with hardware- and software-vendor subject matter experts to assist during the architecture and design stages, support health checks, and drive operational maturity. The team should ensure that knowledge transfer and co-working are part of the success criteria of vendor engagements.
Training and sustainment
OpenStack and Kubernetes are distributions with unique implementation features, much the way that different Linux operating systems are delivered as distributions. Typically, many skills are transferable between distributions; however, each distribution has specific ways of doing certain tasks, and recommended practices may vary.
[ Download the free eBook to answer the question Red Hat OpenShift and Kubernetes ... what's the difference? ]
We highly recommend that organizations invest in training and certification for the people who deliver their cloud platforms. Proactive training and certification combined with well-documented processes and a continuous relationship with your software and hardware vendors help create a sustainable platform.
Culture defines the set of acceptable behaviors for a group. Culture has a massive impact on the team's willingness to go the extra mile when needed, the way they act towards consumers of your cloud platform, their inclination to share knowledge within the team, and their willingness to innovate.
"Culture is to a group what personality is to an individual."
Successful teams intentionally seek to build a strong culture. It is a process that takes time and requires ownership from the whole team. The Open Practice Library has resources to help you define and create your team culture, an environment of collaboration, and technical engineering practices.
We recommend you start with a clear understanding of how the team supports the organization's mission and then create a mission statement for the team from that. Then establish a social contract to drive the team's internal and external interaction.
At the very least, everyone on the team should be able to answer the following questions:
- What are we here to do?
- How does that help the organization's mission?
- What is my role in that?
- How do I interact with my team?
- What do we value in our team members?
Everyone on the team, and especially the team leader, should act out these values consistently.
Work and risk management
Private clouds are often large projects that integrate with several external systems. A project-management approach helps you consider all required use cases and address them before "go-live."
Beyond project management, you need to apply decision-management practices. A decision framework defines the criteria to evaluate decisions against (and should align with) the mission. This helps alleviate decision bias and personal preferences from stronger-willed individuals on the team. You can incorporate standards such as ISO 9126 into evaluating technical solutions to ensure that the team thinks through all the aspects of a solution. You must document the architectural challenges, decisions, and evaluation criteria.
Driving continuous improvement
Mature teams can articulate their current and ideal maturity levels. This allows them to create a plan that increases their maturity over time.
We recommend using a model such as Capability Maturity Model Integration for Services (CMMI-SVC) to measure and map out your maturity journey. CMMI-SVC draws on concepts and practices from other service-focused standards and models, including:
- Information Technology Infrastructure Library (ITIL)
- ISO/IEC 20000: Information technology—Service management
- Control Objectives for Information and Related Technologies (COBIT)
- Information Technology Services Capability Maturity Model (ITSCMM)
Teams rate their current maturity against CMMI maturity level definitions and the five areas of practice and state their desired level of maturity. You can use it as a simple visual tool (as shown below) to communicate the current state and direction with the team and management.
Once you assess your team's maturity, use the CMMI and CMMI-SVC process areas to help your team identify areas of improvement in conjunction with the recommendations from this paper.
We believe that these processes and recommendations will help you deliver a successful private cloud platform in your organization.
Factors such as having the right team composition with the required range of skillsets and team sizing can play a significant role in the successful governance of a private cloud. Coupled with training and sustainment, the right culture, risk management, and continuous improvement processes managed in an orchestrated manner can multiply the value of what a private cloud has to offer.