Machine learning has become increasingly crucial in various sectors as more organizations leverage its power to make data-driven decisions. However, training models with large datasets can be time-consuming and computationally expensive. This is where distributed training and Open Data Hub come into play.
Open Data Hub is an open-source platform for running data science workloads on Openshift. It includes a suite of tools and services that make it easy to manage and scale machine learning workloads. Open Data Hub can be deployed on-premises or in the cloud, and it is compatible with a wide range of data science tools and frameworks.
Distributed training involves training machine learning models using multiple nodes simultaneously, speeding up the process and allowing for the use of larger datasets and more complex models. With the introduction of the new Distributed Workloads component, the Open Data Hub provides a scalable and flexible solution for running AI/ML model batch training jobs on large datasets. In this article, we will explore the capabilities that make up this Distributed Workloads component, namely: Multi-Cluster Application Dispatcher (MCAD) Queueing, job management, and dynamic provisioning of GPU infrastructure with InstaScale. We’ll introduce how you can leverage these new technologies to enable your own AI/ML model batch training.
Batch training is a popular technique used in machine learning for training models on large datasets. However, Batch training workflows tend to be “All or nothing“, meaning that when jobs for batch training are submitted, worker processes need to be scheduled and executed concurrently. Sufficient compute resources need to be available when the job begins to execute, otherwise training jobs will fail. Furthermore, different data science groups often share the same compute infrastructure, and certain workloads may need to preempt others for that valuable but limited compute time. The new Distributed Workloads component of the Open Data Hub solves all of these problems. It provides a queue onto which jobs can be submitted and not scheduled until appropriate resources are available. The Open Data Hub can dynamically scale the OpenShift clusters to meet the resource requirements for these jobs, and all of this is done within the confines of administrator-managed resource quotas, job groups, and priorities.
Open Data Hub’s new Distributed Workloads stack is comprised of the following features that make AI/ML model batch training at scale easy and efficient:
Ease of use with Codeflare SDK: The Codefare SDK is integrated into the out-of-the-box ODH notebook images and provides an interactive client for data scientists to define resource requirements (GPU, CPU, and memory) and to submit and manage training jobs.
Batch Management with MCAD: MCAD manages a queue of training jobs that data scientists submit. MCAD makes sure that jobs are not started until all required compute resources are available on the cluster. MCAD makes sure that a given team has not requested more aggregate resources than their quota allows, and makes surethat highest priority jobs are executed first. Finally, MCAD makes surethat all processes necessary to execute a distributed run are scheduled concurrently, meaning that compute cycles aren’t wasted waiting for processes to be scheduled.
Dynamic scaling with Instascale: Instascale works alongside MCAD to make sure that the OpenShift cluster contains sufficient compute resources to execute a job. Instascale optimizes compute costs by launching the right-sizes compute instance for a given job, and releasing these instances when they are no longer needed.
Flexibility : Open Data Hub supports a wide range of machine learning frameworks, including TensorFlow, PyTorch, and Ray (KubeRay).
Conclusion
In conclusion, the Distributed Workloads component available in the newly released Open Data Hub version 1.6 adds the capability to run and manage distributed batch training of AI/ML models. This is made possible by innovative features like MCAD queueing, job management and Instasacle. By leveraging these new capabilities, data scientists can greatly improve the efficiency of their machine learning workloads, leading to better results and faster innovation. To get started using this new stack, check out the Open Data Hub Quick Start guide.
저자 소개
유사한 검색 결과
Implementing best practices: Controlled network environment for Ray clusters in Red Hat OpenShift AI 3.0
Solving the scaling challenge: 3 proven strategies for your AI infrastructure
Technically Speaking | Platform engineering for AI agents
Technically Speaking | Driving healthcare discoveries with AI
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래