Not All Machine Learning Workloads are Alike, So Why are We Treating Them Like They are?

31 de agosto de 2021Alex Handy3 minutos (tempo de leitura)

This is a guest blog by Dr. Ronen Dar, CTO and co-founder of Run:AI.

At each stage of the machine learning / deep learning (ML/DL) process, researchers are completing specific tasks, and those tasks have unique requirements from their AI infrastructure. Data scientists’ compute needs are typically aligned to the following tasks:

Build sessions - where data scientists consume GPU power in interactive sessions to develop and debug their models. These sessions require instant and always-available GPUs, but require low compute and memory.
Training models - DL models are generally trained in long sessions. Training is highly compute-intensive, can run on multiple GPUs, and typically requires very high GPU utilization. Performance (in terms of time-to-train) is highly important. In a project lifecycle, there are long periods of time during which many concurrent training workloads are running (e.g. while optimizing hyperparameters) but also long periods of idle time in which only a small number of experiments are utilizing GPUs.
Inference - in this phase of development, trained DL models are inferencing on requests from real-time applications or from periodical systems that apply offline batch inferencing. Inference typically induces low GPU utilization and memory footprint (compared to training sessions).

image1-Aug-25-2021-06-54-13-64-PM

Two for me, two for you

As you can see in the chart above, in order to accelerate the development of AI research, data scientists need access to as much or as little GPU as their development phase requires. Unfortunately, the standard Kubernetes scheduler typically allocates static, fixed numbers of GPU – ‘two for me, two for you’. This “one size fits all” approach to scheduling and allocating GPU compute means that if scientists are building models or working on inferencing they have too many GPUs, but when they try to train models they often have too few GPUs. In these environments, a significant number of GPUs remain idle at any given time. Simple resource sharing, like using another data scientist’s GPUs while they are idle, is not possible with static allocations.

Making GPU allocations dynamic

At Run:AI, we envisaged a scheduler that implements a concept which we refer to as “guaranteed quotas” to solve the static allocation challenge. Guaranteed quotas let users go over their static quota as long as idle GPUs are available.

How does a guaranteed quota system work?

Guaranteed quotas of GPUs, as opposed to projects with a static allocation, can use more GPUs than their quota allows. So ultimately the system allocates available resources to a job submitted to a queue, even if the queue is over quota. In cases where a job is submitted to an under-quota queue and there are not enough available resources to launch it, the scheduler starts to become smarter and pause a job from a queue that is over quota, while taking priorities and fairness into account.

Guaranteed quotas essentially break the boundaries of fixed allocations and make data scientists more productive, freeing them from limitations of the number of concurrent experiments they can run, or the number of GPUs they can use for multi-GPU training sessions. This greatly increases utilization of the overall GPU cluster. Researchers accelerate their data science and IT gains control over the full GPU cluster. Better scheduling greatly increases the utilization of the full cluster.

How do Run:AI and Red Hat Openshift Container Platform work together?

Run:AI creates an acceleration layer over GPU resources that manages granular scheduling, prioritization and allocation of compute power. A kubernetes-based dedicated batch scheduler, running on top of OCP, manages GPU-based workloads. It includes mechanisms for creating multiple queues, setting fixed and guaranteed resource quotas, and managing priorities, policies, and multi-node training. It provides an elegant solution to simplify complex ML scheduling processes. Companies using OpenShift-managed Kubernetes clusters can easily install Run:AI by using the operator available from the Red Hat Container Catalog.

Learn more about advanced scheduling for GPUs on OpenShift Container Platform at www.run.ai

Sobre o autor

Alex Handy

Principal Product Marketing Manager

Red Hatter since 2018, technology historian and founder of The Museum of Art and Digital Entertainment. Two decades of journalism mixed with technology expertise, storytelling and oodles of computing experience from inception to ewaste recycling. I have taught or had my work used in classes at USF, SFSU, AAU, UC Law Hastings and Harvard Law.

I have worked with the EFF, Stanford, MIT, and Archive.org to brief the US Copyright Office and change US copyright law. We won multiple exemptions to the DMCA, accepted and implemented by the Librarian of Congress. My writings have appeared in Wired, Bloomberg, Make Magazine, SD Times, The Austin American Statesman, The Atlanta Journal Constitution and many other outlets.

I have been written about by the Wall Street Journal, The Washington Post, Wired and The Atlantic. I have been called "The Gertrude Stein of Video Games," an honor I accept, as I live less than a mile from her childhood home in Oakland, CA. I was project lead on the first successful institutional preservation and rebooting of the first massively multiplayer game, Habitat, for the C64, from 1986: https://neohabitat.org . I've consulted and collaborated with the NY MOMA, the Oakland Museum of California, Cisco, Semtech, Twilio, Game Developers Conference, NGNX, the Anti-Defamation League, the Library of Congress and the Oakland Public Library System on projects, contracts, and exhibitions.