Finding out the capacity of a system and planning for a deployment layout that meets the production traffic requirements is critical in industrial environments. Both the physical environment and the individual performance of the system's constituent components influence the system's capacity.
[ Learn best practices for implementing automation across your organization. Download The automation architect's handbook. ]
Three main factors influence a large-scale system's capacity:
- The appropriate configuration values for each software component.
- The appropriate configuration values for the compute node hosting the software component (such as how much CPU or RAM is required)
- The number of instances of the software components (and compute nodes) needed to meet the capacity requirements.
While the configuration parameters for the individual components provide flexibility for tuning the system's performance, finding the appropriate values is difficult. The huge search space of parameter values makes this process challenging and costly. How can you do the load testing and calculate the software system's capacity to find the right software configurations for each component as efficiently as possible?
Having well-tuned software at the right capacity in production environments greatly helps operational expenditures (OPEX). It also makes the system less likely to miss important key performance indicators (KPIs), minimizing the risk that you'll breach software level agreements (SLAs) that form business contracts. Using automation in load testing can make the testing cycle more efficient and potentially shorter in subsequent releases (after achieving a stable baseline).
From an architectural perspective, it is important to design software components to integrate with advanced frameworks and tools to help make the software development process as efficient as possible.
One way to tackle this problem is with intelligent capacity planning, which combines load-test automation with machine learning.
[ What is edge machine learning? ]
MLASP: machine learning for capacity planning
I wrote a research paper published in Springer's EMSE Journal (DOI information, direct link to the full text), describing a process called Machine Learning Assisted System Performance and Capacity Planning (MLASP). This process has been proven in an industrial setting to provide good results in making load testing and capacity management tasks more efficient.
The process overview:
As the above diagram depicts, there are three major areas of the MLASP process:
- Automated load testing: Apart from executing a load-testing process, data is generated and collected for the next stage, where machine learning-related activities will be performed. Load-test engineers are involved in this stage of the proposed process.
- Machine learning modeling and training: A data scientist uses the data generated during load testing to create a model that can be used for predictions.
- ML model serving (inferencing): This area uses the trained model to provide predictions for two scenarios:
- What-if scenario: The model provides a prediction based on specific inputs. This is used to find the outcome for the measured KPIs the model predicts from a determined set of configuration values.
- Find a configuration with a given (percentage) deviation from a desired target: This is used when you know a target value for a metric or KPI of interest, but don't know what set of system configurations will yield that output. Given the mathematical problem the machine learning model is trying to solve, you can only find an answer close enough to the desired target, having a defined deviation (for example, within 3% of the desired 100 transactions per second per node in the cluster's configuration). This is extremely important in production environments, as the operational teams may prepare in advance for higher demands on the system during special events like the holiday shopping season.
Various tools exist that can be used to perform the associated tasks at each stage in the process. The process is extensible and adaptable according to the needs of a specific project.
Therefore, from an architectural perspective, MLASP can be a very powerful tool to make the software development lifecycle processes more efficient in a software ecosystem, as it can help architects choose the right integrations for the systems they design.
MLASP benefits
In summary, the MLASP process' benefits are:
- At the program or project level, it offers a mathematical model-based benchmarking and capacity-planning tool.
- For project development, it can reduce time and effort for performing load testing as you use automation and reduce the number of load-testing runs. This leads to shorter delivery time and reduces the overall project costs.
- Finally, having a tuned system for operations means there is an increase in platform efficiency as the system is not over-dimensioned. This also reduces SLA violations, and both contribute to reduced operational costs.
If you're interested in learning more about implementing MLASP on Red Hat OpenShift, you can check out my article MLASP: Machine learning assisted capacity planning An industrial experience report or the step-by-step guide in my full-length repo.
[ Check out Red Hat's Portfolio Architecture Center for a wide variety of reference architectures you can use. ]
Sull'autore
Arthur is a senior data scientist specialist solution architect at Red Hat Canada. With the help of open source software, he is helping organizations develop intelligent application ecosystems and bring them into production using MLOps best practices.
He has over 15 years of experience in the design, development, integration, and testing of large-scale service enablement applications.
Arthur is pursuing his PhD in computer science at Concordia University, and he is a research assistant in the Software Performance Analysis and Reliability (SPEAR) Lab. His research interests are related to AIOps, with a focus on performance and scalability optimization.
Altri risultati simili a questo
Data-driven automation with Red Hat Ansible Automation Platform
Resilient model training on Red Hat OpenShift AI with Kubeflow Trainer
Technically Speaking | Platform engineering for AI agents
Technically Speaking | Driving healthcare discoveries with AI
Ricerca per canale
Automazione
Novità sull'automazione IT di tecnologie, team e ambienti
Intelligenza artificiale
Aggiornamenti sulle piattaforme che consentono alle aziende di eseguire carichi di lavoro IA ovunque
Hybrid cloud open source
Scopri come affrontare il futuro in modo più agile grazie al cloud ibrido
Sicurezza
Le ultime novità sulle nostre soluzioni per ridurre i rischi nelle tecnologie e negli ambienti
Edge computing
Aggiornamenti sulle piattaforme che semplificano l'operatività edge
Infrastruttura
Le ultime novità sulla piattaforma Linux aziendale leader a livello mondiale
Applicazioni
Approfondimenti sulle nostre soluzioni alle sfide applicative più difficili
Virtualizzazione
Il futuro della virtualizzazione negli ambienti aziendali per i carichi di lavoro on premise o nel cloud