Note: The following procedure can also be used to deploy the NVIDIA GPU Operator, since it follows the same prerequisites as the SRO operator. Docs are here.
The job of the Performance and Latency Sensitive Applications (PSAP) team at Red Hat is optimizing Red Hat OpenShift, the industry’s most comprehensive enterprise Kubernetes platform, to run compute-intensive enterprise workloads and HPC applications effectively and efficiently. As a team of Linux and performance enthusiasts who are always pushing the limits of what is possible with the latest and greatest upstream technologies, we are operating at the forefront of innovation with compelling proof-of-concept (POC) implementations and advanced deployment scenarios.
Overview
Driver containers are a novel way of including device specific kernel modules (kmods) within an OCI container. Since these kmods have close dependencies on kernel versions (and kernel headers), they need to be (re) compiled on the target host. The special resource operator (SRO for short) was designed for this purpose.
However, the SRO needs access to RHEL source code from the target host. And while this is fully automated in environments that can access the internet and ergo the RHEL source code, setting it up for disconnected environments requires some more configuration.
This blog post details the deployment of SRO/driver containers on disconnected (true disconnected and proxy) environments.
Prerequisites
You must have access to the internet to obtain the data that populates the mirror repository. In this procedure, you will place the mirror registry on a bastion host that has access to both your network and the internet. If you do not have access to a bastion host, use the method that best fits your restrictions to bring the contents of the mirror registry into your restricted network. You also must have a Red Hat Enterprise Linux (RHEL) server on your network to use as the registry host. The registry host MUST be able to access the internet, or at least allow access to the needed URL’s mentioned through this guide.
The cluster must be properly configured and entitled as seen in:
Part 1 - Setting the Mirror Registry and OLM Catalog
Procedure
[Bastion host]
Step 1: Create a Mirror Registry
Follow - installation-creating-mirror-registry_samples-operator-alt-registry
Note: You must ensure that your registry hostname is in the same DNS and that it resolves to the expected IP address. Otherwise, pulls will fail because cert x509 is for a hostname and not a public name.
Step 2: Authenticate the Mirror Registry
[Bastion host/Local host]
Now, let’s allow our cluster to reference images from the mirror registry we just built.
Follow installation-adding-registry-pull-secret_samples-operator-alt-registry.
[Optional] For authenticating your mirror registry, you need to configure additional trust stores for image registry access in our OCP cluster. You can create a ConfigMap in the openshift-config namespace and use its name in AdditionalTrustedCA in the image.config.openshift.io resource. This provides additional CAs that should be trusted when contacting external registries.
The ConfigMap key is the hostname + port of a registry for which this CA is to be trusted, and the base64-encoded certificate is the value for each additional registry CA to trust.
You can configure additional CAs with the following procedure:
bash
$ oc create configmap registry-config --from-file=<external_registry_address>=ca.crt -n openshift-config
$ oc edit image.config.openshift.io cluster
spec:
additionalTrustedCA:
name: registry-config
Note: if your <external_registry_address> contains a ':5000',.it should be written as ‘..5000’ to avoid this error:
bash
error: "xxxxxxxxxx::5000" is not a valid key name for a ConfigMap: a valid config key must consist of alphanumeric characters, '-', '_' or '.' (e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')
Step 3: Building an Operator Catalog Image
Note: For now, we need to tell the architecture we want to mirror into the registry using the oc CLI. To achieve this during both steps, you need to pass the flag --filter-by-os='linux/amd64’:
oc adm catalog build --filter-by-os='linux/amd64’ ….
oc adm catalog mirror --filter-by-os='linux/amd64’ ….
This prevents a known error due to the docker registry not supporting multiple architectures manifests.
[Optional] Mirror Images for HELM Deployment
After deploying the mirror image registry in step 2:
Mirror the images listed at: https://github.com/NVIDIA/gpu-operator/blob/master/bundle/manifests/gpu-operator.clusterserviceversion.yaml#L128
yaml
relatedImages:
- name: gpu-operator-image
image: nvcr.io/nvidia/gpu-operator@sha256:1a1c95d392ea2c055b09c9d074ab4d577a42d5d338109234d7a868bf2ebdfa8d
- name: dcgm-exporter-image
image: nvcr.io/nvidia/k8s/dcgm-exporter@sha256:85016e39f73749ef9769a083ceb849cae80c31c5a7f22485b3ba4aa590ec7b88
- name: container-toolkit-image
image: nvcr.io/nvidia/k8s/container-toolkit@sha256:b3f48033d7d9e1d5703b6ecffe35d219a45a17bdcf85374d78924dee9c8917be
- name: driver-image
image: nvcr.io/nvidia/driver@sha256:324e9dc265dec320207206aa94226b0c8735fd93ce19b36a415478c95826d934
- name: device-plugin-image
image: nvcr.io/nvidia/k8s-device-plugin@sha256:45b459c59d13a1ebf37260a33c4498046d4ade7cc243f2ed71115cd81054cd85
- name: gpu-feature-discovery-image
image: nvcr.io/nvidia/gpu-feature-discovery@sha256:82e6f61b715d710c60ac14be78953336ea5dbc712244beb51036139d1cc8d526
- name: cuda-sample-image
image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2
- name: dcgm-init-container-image
image: nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
Then follow this guide: https://docs.openshift.com/container-platform/4.6/openshift_images/image-configuration.html to configure the `registrySources` of OpenShift to pull those images from the mirror registry.
Part 2 - Setting the YUM Mirror and Driver Container
Note: Part 2 is only needed for SRO or the NVIDIA GPU Operator; the NFD operator does not need this step.
For setting up a YUM mirror, we can choose to use Red Hat Satellite or create a custom-made mirror following.
The packages we need to host in our mirror are:
- elfutils-libelf.${HOST_ARCH}
- elfutils-libelf-devel.${HOST_ARCH}
- kernel-headers-${GPU_NODE_KERNEL_VERSION}
- kernel-devel-${GPU_NODE_KERNEL_VERSION}
- kernel-core-${GPU_NODE_KERNEL_VERSION}
These packages are needed to run the driver container, as can be seen at: https://gitlab.com/nvidia/container-images/driver/-/blob/master/rhel8/nvidia-driver .
Note: You can get the $HOST_ARCH and $GPU_NODE_KERNEL_VERSION from `oc describe node` on one of the nodes.
With the YUM-mirror in place, the next step is to add the repository configuration to the driver container:
1. First, we create a ConfigMap containing the repository configuration file (my_mirror.repo)bash
oc create configmap yum-repos-d --from-file /path/to/my_mirror.repo
2. Add the mirror repository to the operator buildConfig. For SRO this information must be added to: https://github.com/openshift-psap/special-resource-operator/blob/master/config/recipes/nvidia-gpu/manifests/1000-state-driver.yaml
For the NVIDIA-GPU-Operator v1.4 and above (currently 1.5.2) and for versions before 1.4, follow the same instructions as SRO:
1. Create a configmap with custom repo list:
bash2. Specify repoConfig in values.yaml (If deploying from HELM:)
oc create configmap repo-config -n gpu-operator-resources --from-file /path/to/my_mirror.repo
yaml
driver:
repository: nvcr.io/nvidia
image: driver
version: "450.80.02"
repoConfig:
configMapName: repo-config
destinationDir: /etc/yum.repos.d
Or Edit the driver.repoConfig entry at the ClusterPolicy CR
3. Deploy the operator via HELM
4. Verify ConfigMap is mounted successfully with driver container
Now you are ready to deploy the SRO / GPU-operator to your disconnected OCPO cluster.
We believe that Linux containers and container orchestration engines, most notably Kubernetes, are well positioned to power future software applications spanning multiple industries and verticals. Red Hat has embarked on a mission to enable some of the most critical workloads, like machine learning, deep learning, artificial intelligence, big data analytics, high-performance computing, and telecommunications, with Red Hat OpenShift. The PSAP team is supporting this mission across multiple footprints (public, private, and hybrid cloud), industries, and application types.
Troubleshooting
- It is not mentioned in all the documentation, but it is good to start by deploying a medium-sized instance to host the registry.
Relevant links
- https://wiki.centos.org/HowTos/CreateLocalMirror
- https://docs.openshift.com/container-platform/4.1/builds/running-entitled-builds.html#running-builds-with-satellite-subscriptions
- https://docs.openshift.com/container-platform/4.4/builds/running-entitled-builds.html#builds-source-secrets-entitlements_running-entitled-builds
About the author
More like this
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit