Abstract
In this whitepaper we run the AI workload, MLPerf Training v0.6, on the Red Hat® OpenShift® Container Platform with SUPERMICRO® hardware and compare it to the MLPerf Training v0.6 results published by NVIDIA. Red Hat OpenShift is a leading enterprise Kubernetes platform, powered by open source innovation and designed to make container adoption and orchestration simple.
This Supermicro and Red Hat reference architecture for OpenShift with NVIDIA GPUs describes how this AI infrastructure allows you to run and monitor MLPerf Training v0.6 in containers based on Red Hat® Enterprise Linux®. To our knowledge, this is the first time Red Hat Enterprise Linux-based containers were created for MLPerf v0.6 and running on Supermicro Multi-GPU system and 100G network, as opposed to commonly used NVIDIA NGC containers and Nvidia DGX-1. In addition to excellent performance, we demonstrate how OpenShift provides easy access to high-performance machine learning model training when running on this SUPERMICRO reference architecture.
Executive Summary
AI infrastructure is increasingly requiring higher performance computing. Companies are creating and training increasingly complex deep learning neural networks, or DNNs, in their datacenters. These DNNs are getting a lot of attention because they can outperform humans at classifying images and can beat the world’s best Go player. Training DNNs is very compute-intensive so it is worthwhile to reduce training times with hardware accelerators and software tuning. Last year a group of researchers from industry and universities developed MLPerf, a suite of compute intensive AI benchmarks representing real world AI workloads, for measuring performance of AI infrastructure. In Supermicro’s Cloud Center of Excellence (CCoE) in San Jose, CA, we created a reference architecture running Red Hat OpenShift Container Platform software and ran the latest version of MLPerf Training to assess its performance relative to published results from NVIDIA.
The Red Hat/Supermicro MLPerf Training v0.6 benchmark results closely match the NVIDIA DGX-1 closed division published results (within -6.13% to +2.29%; where negative means slower, and positive values are faster than the NVIDIA results). This outcome demonstrates that application containerization provides the benefit of software portability, improved collaboration, and data reproducibility without significantly affecting performance.
Solution Reference Architecture
The Supermicro Red Hat OpenShift Deep Learning solution is based on industry-leading GPU servers with the latest Intel® Xeon® processors, NVIDIA® Volta® GPUs, NVLink technology, making it an ultimate powerhouse for all your AI needs. By incorporating Supermicro’s BigTwin as a fundamental building block for OpenShift with Supermicro’s GPU servers, this solution becomes one of the industry leading containerized solutions for AI and Deep Learning. Partnered with NVIDIA, the reference architecture features the latest NVIDIA GPUs. Each compute node utilizes NVIDIA® Tesla® V100 GPUs for maximum parallel compute performance resulting in reduced training time for Deep Learning workloads. In addition, this solution presents a scale-out architecture with 10G/25G/100G networking options that is scalable to fit future growth.
Benchmark Results
Our MLPerf Training results (Figure 7), demonstrate that running Red Hat Enterprise Linux (RHEL)-based containers on Red Hat OpenShift Container Platform software and Supermicro hardware very closely matches the performance of the NVIDIA published results (Table 2). Nvidia’s published results are the bar we are striving to meet. Our timings are very similar to the NVIDIA published results; within -6.13% to +2.29%, where negative means slower, and positive values are faster than the NVIDIA results. For the longer running Mask R-CNN model our results were faster, while in the three shorter benchmarks our results were slightly slower. Overall, this shows that containerizing the benchmarks did not add significant overhead. One explanation for our comparable results is that containers add very little overhead, because essentially a container is just a process restricted to a private root filesystem and process namespace. Other possible reasons for the differences include slightly different software versions or hardware differences. For this paper we performed four of the six MLPerf Training v0.6 benchmarks. However, we did not have time to run the full suite of experiments needed to formally submit these results to the MLPerf consortium, hence our results are unverified by the MLPerf consortium.
Our logs are available at our GitLab site (https://gitlab.com/opendatahub/gpu-performance-benchmarks) and we have provided our best timings in comparison to Nvidia’s formal closed submission. Our results demonstrate we can match bare-metal MLPerf Training v0.6 timings with containerized MLPerf Training v0.6 on OpenShift when running on Supermicro hardware, and that containerization and the OpenShift Kubernetes platform do not add significant overhead.
We quantify the deep learning training performance of Red Hat Enterprise Linux-based containerized MLPerf Training v0.6 benchmarking suite running on Supermicro Super Servers with NVIDIA V100 GPUs, and compare these results with NVIDIA published DGX-1 MLPerf Training v0.6 results. MLPerf Training workloads are performance-sensitive and GPUs are particularly effective at running them efficiently by exploiting the parallelism inherent in their neural network models. In our lab we ran containerized MLPerf Training v0.6 to demonstrate the efficiency of training real-world AI models on OpenShift with Supermicro GPU optimized Super Servers and BigTwin and Ultra server optimized for Red Hat OpenShift. We demonstrate that model training with Red Hat Enterprise Linux-based containers runs as efficiently on OpenShift and Supermicro hardware as NVIDIA published MLPerf Training v0.6 NGC container results. In Figure 8, we show NVIDIA timeseries metrics displayed in Grafana dashboards for easy understanding of performance data and identification of bottlenecks.
The advantage of containerizing these models is that it provides software portability, improved collaboration, and data reproducibility without significantly affecting performance. Using Red Hat Enterprise Linux containers to run AI/ML applications hides the plumbing from the user and allows them to concentrate on data science so they can innovate faster. AI models can be difficult for non-experts to install and train due to many dependencies on specific versions of libraries and compilers. Additionally, maintaining multiple runtime environments allowing models with different dependencies to run on the same cluster is a burden for the cluster administrator. Containerizing these models helps overcome these challenges.
Conclusions and Future Work
In this paper we ran the AI workload MLPerf Training v0.6 to compare the performance of Supermicro hardware and Red Hat Enterprise Linux-based containers on OpenShift with NVIDIA MLPerf Training v0.6 published results. We were able to show that the Red Hat Enterprise Linux-based containers running on the OpenShift Container Platform and Supermicro hardware ran faster for the Mask R-CNN benchmark than NVIDIA’s MLPerf Training v0.6 results, and only slightly slower (0.5%-6%) for the shorter tests. Our GNMT timing was 6.13% slower which could have been due to the initial random seed used by the benchmark. Future work could include analysis of stability of GNMT for different random seeds.
For more information
APPENDIX
Installing the GPU device plug-in in OpenShift Container Platform
The following instructions were originally developed by Zvonko Kaiser (Red Hat Inc).
Host Preparation
NVIDIA drivers for Red Hat Enterprise Linux must be installed on the host with GPUs as a prerequisite for using GPUs with OpenShift. To prepare the GPU-enabled host we begin by installing NVIDIA drivers and the NVIDIA container enablement. The following procedures will make containerized GPU workloads possible in Red Hat OpenShift 3.11.
Part 1: NVIDIA Driver Installation
Step 1: NVIDIA drivers are compiled from source. The build process requires the kernel-devel package to be installed.
# yum -y install kernel-devel-`uname -r`
The NVIDIA-driver package requires the DKMS package from EPEL (Extra Package for Enterprise Linux). DKMS is not supported or packaged by Red Hat. Work is underway to remove the NVIDIA driver requirement on DKMS for Red Hat distributions. DKMS can be installed from the Red Hat EPEL repository.
Step 2: Install the EPEL repository.
# yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
Step 3: Install the NVIDIA drivers. The newest NVIDIA drivers are located in the referenced repository.
# yum install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel710.0.130-1.x86_64.rpm
Step 4: Install Auxiliary tools and libraries contained in the following packages. This will also install the nvidia-kmod package, which includes the NVIDIA kernel modules.
# yum -y install nvidia-driver nvidia-driver-cuda nvidia-modprobe
Step 5: Remove the nouveau kernel module, (otherwise the NVIDIA kernel module will not load). The installation of the NVIDIA driver package will blacklist the driver in the kernel command line (nouveau.modeset=0 rd.driver.blacklist=nouveau video=vesa:off ), so that the nouveau driver will not be loaded on subsequent reboots.
# modprobe -r nouveau
Step 6: Load the NVIDIA module and the unified memory kernel modules.
# nvidia-modprobe && nvidia-modprobe -u
Step 7: Verify that the installation and the drivers are working. Extracting the name of the GPU can later be used to label the node in OpenShift.
# nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g' Tesla-V100-SXM2-16GB
Adding the nvidia-container-runtime-hook
The version of docker shipped by Red Hat includes support for OCI runtime hooks. Because of this, we only need to install the nvidia-container-runtime-hook package. On other distributions of docker, additional steps may be necessary. See NVIDIA’s documentation for more information.
Step 1: install libnvidia-container and the nvidia-container-runtime repositories.
# curl -s -L https://nvidia.github.io/nvidia-container-runtime/centos7/nvidia-container-runtime.repo | tee /etc/yum.repos.d/nvidia-container-runtime.repo
Step 2: install an OCI prestart hook. The prestart hook is responsible for making NVIDIA libraries and binaries available in a container (by bind-mounting them in from the host). Without the hook, users would have to include libraries and binaries into each and every container image that might use a GPU. Hooks simplify management of container images by ensuring only a single copy of libraries and binaries are required. The prestart hook is triggered by the presence of certain environment variables in the container: NVIDIA_DRIVER_CAPABILITES=compute,utility.
# yum -y install nvidia-container-runtime-hook
This package will also install the config/activation files for docker/podman/cri-o. Beware that the hook json in the package will only work with cri-o >= 1.12 for crio-1.11 use the following json file:
# cat <<'EOF' >> /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
"hasbindmounts": true,
"hook": "/usr/bin/nvidia-container-runtime-hook",
"stage": [ "prestart" ]
}
EOF
Adding the SELinux policy module
To run NVIDIA containers contained and not privileged, we have to install an SELinux policy tailored for running CUDA GPU workloads. The policy creates a new SELinux type (nvidia_container_t) with which the container will be running.
Furthermore, we can drop all capabilities and prevent privilege escalation. See the invocation below to have a glimpse into how to start a NVIDIA container.
Next, install the SELinux policy module on all GPU worker nodes.
# wget https://raw.githubusercontent.com/zvonkok/origin-ci-gpu/master/selinux/nvidia-container.pp
# semodule -i nvidia-container.pp
Check and restore the labels on the node
The new SELinux policy heavily relies on the correct labeling of the host. Therefore, we must make sure that the required files have the correct SELinux label.
Step 1: Restorecon all files that the prestart hook will need.
# nvidia-container-cli -k list | restorecon -v -f -
Step 2: Restorecon all accessed devices.
# restorecon -Rv /dev
Step 3: Restorecon all files that the device plugin will need.
# restorecon -Rv /var/lib/kubelet
Everything is now set up for running a GPU-enabled container.
Verify SELinux and prestart hook functionality
To verify correct operation of driver and container enablement, try running a cuda-vector-add container. We can run the container with docker or podman.
# podman run --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL \ --security-opt label=type:nvidia_container_t \ docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
# docker run --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL \ --security-opt label=type:nvidia_container_t \ docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
If the test passes, the drivers, hooks and the container runtime are functioning correctly and we can move on to configuring Red Hat OpenShift.
Part 2: OpenShift 3.11 with the GPU Device Plugin
Install the NVIDIA device plugin. To schedule the device plugin on nodes that include GPUs, label the node as follows:
# oc label node <node-with-gpu> openshift.com/gpu-accelerator=true node "<node-with-gpu>" labeled
This label will be used in the next step.
Deploy the NVIDIA Device Plugin Daemonset
The next step is to deploy the NVIDIA device plugin. Note that the NVIDIA Device Plugin (and more generally, any hardware manufacturer’s plugin) is supported by the vendor, and is not shipped or supported by Red Hat.
Clone the following repository; there are several yaml files for future use.
# git clone https://github.com/redhat-performance/openshift-psap.git
# cd openshift-psap/blog/gpu/device-plugin
Here is an example daemonset (device-plugin/nvidia-device-plugin.yml) which will use the label we created in the last step so that the plugin pods will only run where GPU hardware is available.
Now create the NVIDIA device plugin daemonset.
# oc create -f nvidia-device-plugin.yml
Let’s verify the correct execution of the device plugin. You can see there is only one running since only one node was labeled in the previous step.
# oc get pods -n kube-system
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 1m
Once the pod is running, let’s have a look at the logs.
# oc logs nvidia-device-plugin-daemonset-7tvb6 -n kube-system
2018/07/12 12:29:38 Loading NVML 2018/07/12 12:29:38 Fetching devices.
2018/07/12 12:29:38 Starting FS watcher.
2018/07/12 12:29:38 Starting OS watcher.
2018/07/12 12:29:38 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/07/12 12:29:38 Registered device plugin with Kubelet
At this point the node itself will advertise the nvidia.com/gpu extended resource in its capacity:
# oc describe node <gpu-node> | egrep 'Capacity|Allocatable|gpu'
Capacity:
nvidia.com/gpu: 2
Allocatable:
nvidia.com/gpu: 2
Nodes that do not have GPUs installed will not advertise GPU capacity.
Deploy a pod that requires a GPU
Let’s run a GPU-enabled container on the cluster. We can use the cuda-vector-add image that was used in the host preparation step. Use the following file (device-plugin/cuda-vector-add.yaml) as a pod description for running the cuda-vector-add image in OpenShift. Note the last line requests one NVIDIA GPU from OpenShift. The OpenShift scheduler will see this and schedule the pod to a node that has a free GPU. Once the pod-create request arrives at a node, the Kubelet will coordinate with the device plugin to start the pod with a GPU resource.
First create a project that will group all of our GPU work.
# oc new-project nvidia
The next step is to create and start the pod.
# oc create -f cuda-vector-add.yaml
After a couple of seconds, the container finishes.
# oc get pods
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Completed 0 3s
nvidia-device-plugin-daemonset-s9ngg 1/1 Running 0 9m
Check the logs for any errors. We are looking for Test PASSED in the pod logs.
# oc logs cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
This output is the same as when we ran the container directly using podman or docker. If you see a permission denied error, check to see that you have the correct SELinux label.