Machine learning has become increasingly crucial in various sectors as more organizations leverage its power to make data-driven decisions. However, training models with large datasets can be time-consuming and computationally expensive. This is where distributed training and Open Data Hub come into play.
Open Data Hub is an open-source platform for running data science workloads on Openshift. It includes a suite of tools and services that make it easy to manage and scale machine learning workloads. Open Data Hub can be deployed on-premises or in the cloud, and it is compatible with a wide range of data science tools and frameworks.
Distributed training involves training machine learning models using multiple nodes simultaneously, speeding up the process and allowing for the use of larger datasets and more complex models. With the introduction of the new Distributed Workloads component, the Open Data Hub provides a scalable and flexible solution for running AI/ML model batch training jobs on large datasets. In this article, we will explore the capabilities that make up this Distributed Workloads component, namely: Multi-Cluster Application Dispatcher (MCAD) Queueing, job management, and dynamic provisioning of GPU infrastructure with InstaScale. We’ll introduce how you can leverage these new technologies to enable your own AI/ML model batch training.
Batch training is a popular technique used in machine learning for training models on large datasets. However, Batch training workflows tend to be “All or nothing“, meaning that when jobs for batch training are submitted, worker processes need to be scheduled and executed concurrently. Sufficient compute resources need to be available when the job begins to execute, otherwise training jobs will fail. Furthermore, different data science groups often share the same compute infrastructure, and certain workloads may need to preempt others for that valuable but limited compute time. The new Distributed Workloads component of the Open Data Hub solves all of these problems. It provides a queue onto which jobs can be submitted and not scheduled until appropriate resources are available. The Open Data Hub can dynamically scale the OpenShift clusters to meet the resource requirements for these jobs, and all of this is done within the confines of administrator-managed resource quotas, job groups, and priorities.
Open Data Hub’s new Distributed Workloads stack is comprised of the following features that make AI/ML model batch training at scale easy and efficient:
Ease of use with Codeflare SDK: The Codefare SDK is integrated into the out-of-the-box ODH notebook images and provides an interactive client for data scientists to define resource requirements (GPU, CPU, and memory) and to submit and manage training jobs.
Batch Management with MCAD: MCAD manages a queue of training jobs that data scientists submit. MCAD makes sure that jobs are not started until all required compute resources are available on the cluster. MCAD makes sure that a given team has not requested more aggregate resources than their quota allows, and makes surethat highest priority jobs are executed first. Finally, MCAD makes surethat all processes necessary to execute a distributed run are scheduled concurrently, meaning that compute cycles aren’t wasted waiting for processes to be scheduled.
Dynamic scaling with Instascale: Instascale works alongside MCAD to make sure that the OpenShift cluster contains sufficient compute resources to execute a job. Instascale optimizes compute costs by launching the right-sizes compute instance for a given job, and releasing these instances when they are no longer needed.
Flexibility : Open Data Hub supports a wide range of machine learning frameworks, including TensorFlow, PyTorch, and Ray (KubeRay).
Conclusion
In conclusion, the Distributed Workloads component available in the newly released Open Data Hub version 1.6 adds the capability to run and manage distributed batch training of AI/ML models. This is made possible by innovative features like MCAD queueing, job management and Instasacle. By leveraging these new capabilities, data scientists can greatly improve the efficiency of their machine learning workloads, leading to better results and faster innovation. To get started using this new stack, check out the Open Data Hub Quick Start guide.
Sobre el autor
Más como éste
How Red Hat OpenShift AI simplifies trust and compliance
Key considerations for 2026 planning: Insights from IDC
Technically Speaking | Platform engineering for AI agents
Technically Speaking | Driving healthcare discoveries with AI
Navegar por canal
Automatización
Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos
Inteligencia artificial
Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar
Nube híbrida abierta
Vea como construimos un futuro flexible con la nube híbrida
Seguridad
Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías
Edge computing
Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge
Infraestructura
Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo
Aplicaciones
Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones
Virtualización
El futuro de la virtualización empresarial para tus cargas de trabajo locales o en la nube