A machine learning model for system capacity planning

14 novembre 20223 minuti (tempo di lettura)Automazione e gestione, AI/ML

Finding out the capacity of a system and planning for a deployment layout that meets the production traffic requirements is critical in industrial environments. Both the physical environment and the individual performance of the system's constituent components influence the system's capacity.

[ Learn best practices for implementing automation across your organization. Download The automation architect's handbook. ]

Three main factors influence a large-scale system's capacity:

The appropriate configuration values for each software component.
The appropriate configuration values for the compute node hosting the software component (such as how much CPU or RAM is required)
The number of instances of the software components (and compute nodes) needed to meet the capacity requirements.

While the configuration parameters for the individual components provide flexibility for tuning the system's performance, finding the appropriate values is difficult. The huge search space of parameter values makes this process challenging and costly. How can you do the load testing and calculate the software system's capacity to find the right software configurations for each component as efficiently as possible?

Having well-tuned software at the right capacity in production environments greatly helps operational expenditures (OPEX). It also makes the system less likely to miss important key performance indicators (KPIs), minimizing the risk that you'll breach software level agreements (SLAs) that form business contracts. Using automation in load testing can make the testing cycle more efficient and potentially shorter in subsequent releases (after achieving a stable baseline).

From an architectural perspective, it is important to design software components to integrate with advanced frameworks and tools to help make the software development process as efficient as possible.

One way to tackle this problem is with intelligent capacity planning, which combines load-test automation with machine learning.

[ What is edge machine learning? ]

MLASP: machine learning for capacity planning

I wrote a research paper published in Springer's EMSE Journal (DOI information, direct link to the full text), describing a process called Machine Learning Assisted System Performance and Capacity Planning (MLASP). This process has been proven in an industrial setting to provide good results in making load testing and capacity management tasks more efficient.

The process overview:

As the above diagram depicts, there are three major areas of the MLASP process:

Automated load testing: Apart from executing a load-testing process, data is generated and collected for the next stage, where machine learning-related activities will be performed. Load-test engineers are involved in this stage of the proposed process.
Machine learning modeling and training: A data scientist uses the data generated during load testing to create a model that can be used for predictions.
ML model serving (inferencing): This area uses the trained model to provide predictions for two scenarios:
- What-if scenario: The model provides a prediction based on specific inputs. This is used to find the outcome for the measured KPIs the model predicts from a determined set of configuration values.
- Find a configuration with a given (percentage) deviation from a desired target: This is used when you know a target value for a metric or KPI of interest, but don't know what set of system configurations will yield that output. Given the mathematical problem the machine learning model is trying to solve, you can only find an answer close enough to the desired target, having a defined deviation (for example, within 3% of the desired 100 transactions per second per node in the cluster's configuration). This is extremely important in production environments, as the operational teams may prepare in advance for higher demands on the system during special events like the holiday shopping season.

Various tools exist that can be used to perform the associated tasks at each stage in the process. The process is extensible and adaptable according to the needs of a specific project.

Therefore, from an architectural perspective, MLASP can be a very powerful tool to make the software development lifecycle processes more efficient in a software ecosystem, as it can help architects choose the right integrations for the systems they design.

MLASP benefits

In summary, the MLASP process' benefits are:

At the program or project level, it offers a mathematical model-based benchmarking and capacity-planning tool.
For project development, it can reduce time and effort for performing load testing as you use automation and reduce the number of load-testing runs. This leads to shorter delivery time and reduces the overall project costs.
Finally, having a tuned system for operations means there is an increase in platform efficiency as the system is not over-dimensioned. This also reduces SLA violations, and both contribute to reduced operational costs.

If you're interested in learning more about implementing MLASP on Red Hat OpenShift, you can check out my article MLASP: Machine learning assisted capacity planning An industrial experience report or the step-by-step guide in my full-length repo.

[ Check out Red Hat's Portfolio Architecture Center for a wide variety of reference architectures you can use. ]

Sull'autore

Arthur Vitui

Arthur is a senior data scientist specialist solution architect at Red Hat Canada. With the help of open source software, he is helping organizations develop intelligent application ecosystems and bring them into production using MLOps best practices.

He has over 15 years of experience in the design, development, integration, and testing of large-scale service enablement applications.

Arthur is pursuing his PhD in computer science at Concordia University, and he is a research assistant in the Software Performance Analysis and Reliability (SPEAR) Lab. His research interests are related to AIOps, with a focus on performance and scalability optimization.