IMPORTANT: the examples in this blog are only valid for the corresponding version of OpenShift. If you have a newer version of OpenShift, such as 3.9, see this blog.
Running general-purpose compute workloads on Graphics Processing Units (GPUs) has become increasingly popular recently in a wide range of application domains, mirroring the increased ubiquity of deploying applications in Linux containers. Thanks to community participant Clarifai, Kubernetes became able to schedule workloads depending on GPUs beginning with version 1.3, enabling us to develop applications that are on the cutting edge of both trends with Kubernetes or OpenShift.
When folks talk about GPU-accelerated workloads (at least as of now), they are generally referring to NVIDIA-based GPUs, and applications developed leveraging the CUDA toolchain. These apps are typically stateful and run on dedicated resources, not the kind of stateless microservice greenfield apps we think of when we think of where Kubernetes shines today. But, the industry at-large is demonstrating a desire to expand the base of applications that can be run optimally in containers, orchestrated by Kubernetes.
In the past, Red Hat has experimented with technologies like Intel DPDK and Solarflare OpenOnload -- and it's immediately obvious that NVIDIA's progress in containerizing CUDA along with their hardware represents a microcosm of the technical challenges facing those other pieces of hardware, as well as Kubernetes in general, following known patterns for closed source applications wanting to integrate with the open source community.
For example -- distributions must be concerned with licensing, version management, QA procedures, kernel module and ABI/symbol conflicts that occur with any closed source driver and stack. These challenges precisely mirror those faced by many other hardware vendors, whether it's co-processors, FPGA, bypass accelerators or similar.
All of that said, the benefits of GPUs and other hardware accelerators over generic CPUs is often dramatic, leading to jobs completing potentially order(s) of magnitude faster. The demand for blending the benefits of hardware accelerators with a data-center-wide workload orchestration is reaching fever pitch. Typically, this line of thinking terminates in an important density, efficiency, and often a power-consumption exercise attributable to those efficiency gains.
I should note that due to the "alpha" state of GPU support in Kubernetes, the following run-through on how to connect OpenShift, running on RHEL, with an NVIDIA adapter inside an EC2 instance, is currently unsupported. Polishing some of the sharp corners is a community responsibility, and indeed there is plenty of work underway.
If you're interested in following upstream developments, I encourage you to monitor Kubernetes sig-node.
Environment
- RHEL 7.3, RHEL7.3 container image
- OpenShift 3.5.0.17OpenShift Master: EC2 m4.xlarge instance
- OpenShift Infra: EC2 m4.xlarge instance
- OpenShift Node: EC2 g2.xlarge instance
Howto
As with any nascent/alpha technology, documentation is somewhat lacking and there are a lot of disparate moving pieces to line up. Here is how we're able to get a basic smoke test of a GPU going on OpenShift 3.5:
Install the nvidia-docker RPM:
# rpm -ivh --nodeps https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0/nvidia-docker-1.0.0-1.x86_64.rpm
Set up a yum repo on the host that has the GPU card. This is because we will have to install proprietary NVIDIA drivers.
/etc/yum.repos.d/nvidia.repo:[NVIDIA]
name=NVIDIA
baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/
failovermethod=priority
enabled=1
Install the driver and the devel headers on the host. Note this step takes about 4 minutes to complete (rebuilding kernel module)
# yum -y install xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel
On the node with the GPU, ensure the new modules are loaded. On RHEL, the nouveau module will load by default. This prevents the nvidia-docker service from starting. The nvidia-docker service blacklists the nouveau module, but does not unload it. So you can either reboot the node, or remove the nouveau module manually:
# modprobe -r nouveau
# nvidia-modprobe
# systemctl restart nvidia-docker
On the node that has the GPU, update /etc/origin/node/node-config.yaml to set a single NVIDIA GPU in node capacity/allocatable. Note the kubelet flag is named experimental, and that this was a manual change. Also note that at the time of this writing, Kubernetes only supports a single GPU per node. These rough edges are where the “alpha” nature of GPU support in Kubernetes becomes apparent.
We hope to arrive, along with the community, at a unified hardware-discovery feature (perhaps a pod/agent), that feeds the scheduler with all of the hardware information to make intelligent workload-routing decisions possible going forward.
In /etc/origin/node/node-config.yamlkubeletArguments:
experimental-nvidia-gpus:
- '1'
Then restart the openshift-node service so this setting takes effect.
# systemctl restart atomic-openshift-node
Here is what the updated node capacity looks like. You can see that there's a new capacity field, and this can now be used by the Kubernetes scheduler to route pods accordingly.
# oc describe node ip-x-x-x-x.us-west-2.compute.internal
<snip>
Capacity:
alpha.kubernetes.io/nvidia-gpu: 1
cpu: 8
memory: 14710444Ki
pods: 250
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 1
cpu: 8
memory: 14710444Ki
pods: 250
And here is an example pod file that requests the GPU device. The default command is "sleep infinity" so that we can connect to the pod after it is created (using the "oc rsh" command) to do some manual inspection.
# cat openshift-gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: openshift-gpu-test
spec:
containers:
- command:
- sleep
- infinity
name: openshift-gpu
image: rhel7
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
Create a pod using the above definition:
# oc create -f openshift-gpu-test.yaml
Connect to the pod:
# oc rsh openshift-gpu-test
Inside the pod, install EPEL, RHEL and NVIDIA repos. Then install CUDA (note, here we could have used the nvidia/cuda:centos7 container image). This is again a place where the experience could be smoothed out to provide an all-in-one container that includes GPU/ML toolchains that developers can consume.
# yum install cuda -y
The cuda packages include some test utilities we can use to verify that the GPU can be accessed from inside the pod:
sh-4.2# /usr/local/cuda-8.0/extras/demo_suite/deviceQuery
/usr/local/cuda-8.0/extras/demo_suite/deviceQuery Starting...CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GRID K520"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4036 MBytes (4232052736 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 797 MHz (0.80 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GRID K520
Result = PASSsh-4.2# /usr/local/cuda-8.0/extras/demo_suite/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...Device 0: GRID K520
Quick ModeHost to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 8003.2Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5496.3Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 119111.3Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Some more snooping around to make sure the cgroups are set up correctly...on the host running the pod, get the container ID:
# docker ps | grep rhel7
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
deb709448449 rhel7 "sleep infinity" 44 minutes ago Up 44 minutes k8s_nvidia-gpu.134f09f4_openshift-gpu-test_default_cd8446e9-ed69-11e6-86f5-02fdcf6b20ab_da81cc2c
4d2608e1808c registry.fqdn/openshift3/ose-pod:v3.5.0.17 "/pod" 44 minutes ago Up 44 minutes k8s_POD.17e9e6be_nvidia-gpu-test_default_cd8446e9-ed69-11e6-86f5-02fdcf6b20ab_0fc36347
Check out the major/minor device numbers for the NVIDIA hardware. Note that these devices are created by the proprietary NVIDIA drivers installed earlier on the host system:
# ls -al /dev/nvidia*
crw-rw-rw-. 1 root root 195, 0 Feb 7 13:54 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 255 Feb 7 13:54 /dev/nvidiactl
crw-rw-rw-. 1 root root 247, 0 Feb 7 13:47 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 247, 1 Feb 7 13:47 /dev/nvidia-uvm-tools# egrep '247|195' /sys/fs/cgroup/devices/system.slice/docker-deb709448449bf1ef1366c08addc2e0d68188225d9973f4eb87f2e4658f85571.scope/devices.list
c 195:0 rwm
c 195:255 rwm
c 247:0 rwm
Summary
While GPU technology is still in alpha state both in Kubernetes and OpenShift (unsupported), and there are some rough edges, it does work well, and is making progress towards full support in the future.
Some of the important gaps that the community needs to resolve include:
- Proper handling of proprietary drivers (some DKMS or privileged-init-container-like technology to build/rebuild/securely handle modules).
- Manual configuration of the kubelet, necessitated by the lack of a hardware-fleecing facility (device discovery).
- Maximum of 1 GPU pod per node allowed, we should eventually be able to provide secure, multi-tenant access to multiple GPUs.
- For those interested in top-performance and the best possible efficiencies, Kubernetes should be able to understand physical NUMA topology of a system, and affine workload processes accordingly.
저자 소개
A 20+ year tech industry veteran, Jeremy is a Distinguished Engineer within the Red Hat OpenShift AI product group, building Red Hat's AI/ML and open source strategy. His role involves working with engineering and product leaders across the company to devise a strategy that will deliver a sustainable open source, enterprise software business around artificial intelligence and machine learning.
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
오리지널 쇼
엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리
제품
- Red Hat Enterprise Linux
- Red Hat OpenShift Enterprise
- Red Hat Ansible Automation Platform
- 클라우드 서비스
- 모든 제품 보기
툴
체험, 구매 & 영업
커뮤니케이션
Red Hat 소개
Red Hat은 Linux, 클라우드, 컨테이너, 쿠버네티스 등을 포함한 글로벌 엔터프라이즈 오픈소스 솔루션 공급업체입니다. Red Hat은 코어 데이터센터에서 네트워크 엣지에 이르기까지 다양한 플랫폼과 환경에서 기업의 업무 편의성을 높여 주는 강화된 기능의 솔루션을 제공합니다.