Red-Hat-NVIDIA_co-brand_logo

Note: The following procedure can also be used to deploy the NVIDIA GPU Operator, since it follows the same prerequisites as the SRO operator. Docs are here.

The job of the Performance and Latency Sensitive Applications (PSAP) team at Red Hat is optimizing Red Hat OpenShift, the industry’s most comprehensive enterprise Kubernetes platform, to run compute-intensive enterprise workloads and HPC applications effectively and efficiently. As a team of Linux and performance enthusiasts who are always pushing the limits of what is possible with the latest and greatest upstream technologies, we are operating at the forefront of innovation with compelling proof-of-concept (POC) implementations and advanced deployment scenarios.  

Overview

Driver containers are a novel way of including device specific kernel modules (kmods) within an OCI container. Since these kmods have close dependencies on kernel versions (and kernel headers),  they need to be (re) compiled on the target host. The special resource operator (SRO for short) was designed for this purpose.

However, the SRO needs access to RHEL source code from the target host. And while this is fully automated in environments that can access the internet and ergo the RHEL source code, setting it up for disconnected environments requires some more configuration.

This blog post details the deployment of SRO/driver containers on disconnected (true disconnected and proxy) environments.

Prerequisites

You must have access to the internet to obtain the data that populates the mirror repository. In this procedure, you will place the mirror registry on a bastion host that has access to both your network and the internet. If you do not have access to a bastion host, use the method that best fits your restrictions to bring the contents of the mirror registry into your restricted network. You also must have a Red Hat Enterprise Linux (RHEL) server on your network to use as the registry host. The registry host MUST be able to access the internet, or at least allow access to the needed URL’s mentioned through this guide.

The cluster must be properly configured and entitled as seen in:

Part 1 - Setting the Mirror Registry and OLM Catalog

Procedure

[Bastion host]

Step 1: Create a Mirror Registry

Follow - installation-creating-mirror-registry_samples-operator-alt-registry  

Note: You must ensure that your registry hostname is in the same DNS and that it resolves to the expected IP address. Otherwise, pulls will fail because cert x509 is for a hostname and not a public name.

Step 2: Authenticate the Mirror Registry

[Bastion host/Local host]

Now, let’s allow our cluster to reference images from the mirror registry we just built.

Follow installation-adding-registry-pull-secret_samples-operator-alt-registry.  

[Optional] For authenticating your mirror registry,  you need to configure additional trust stores for image registry access in our OCP cluster. You can create a ConfigMap in the openshift-config namespace and use its name in AdditionalTrustedCA in the image.config.openshift.io resource. This provides additional CAs that should be trusted when contacting external registries.

The ConfigMap key is the hostname + port of a registry for which this CA is to be trusted, and the base64-encoded certificate is the value for each additional registry CA to trust.

You can configure additional CAs with the following procedure:

bash
$ oc create configmap registry-config --from-file=<external_registry_address>=ca.crt -n openshift-config
$ oc edit image.config.openshift.io cluster
spec:
 additionalTrustedCA:
name: registry-config

Note: if your <external_registry_address> contains a ':5000',.it should be written as ‘..5000’ to avoid this error:

bash
error: "xxxxxxxxxx::5000" is not a valid key name for a ConfigMap: a valid config key must consist of alphanumeric characters, '-', '_' or '.' (e.g. 'key.name',  or 'KEY_NAME',  or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')
Step 3: Building an Operator Catalog Image
  1. Follow Building an Operator catalog image 

  2. Follow Mirroring the OpenShift Container Platform image repository 

Note: For now, we need to tell the architecture we want to mirror into the registry using the oc CLI. To achieve this during both steps, you need to pass the flag --filter-by-os='linux/amd64’:

oc adm catalog build --filter-by-os='linux/amd64’ ….
oc adm catalog mirror --filter-by-os='linux/amd64’ ….

This prevents a known error due to the docker registry not supporting multiple architectures manifests. 

[Optional] Mirror Images for HELM Deployment

After deploying the mirror image registry in step 2:

Mirror the images listed at: https://github.com/NVIDIA/gpu-operator/blob/master/bundle/manifests/gpu-operator.clusterserviceversion.yaml#L128 

yaml
relatedImages:
   - name: gpu-operator-image
     image: nvcr.io/nvidia/gpu-operator@sha256:1a1c95d392ea2c055b09c9d074ab4d577a42d5d338109234d7a868bf2ebdfa8d
   - name: dcgm-exporter-image
     image: nvcr.io/nvidia/k8s/dcgm-exporter@sha256:85016e39f73749ef9769a083ceb849cae80c31c5a7f22485b3ba4aa590ec7b88
   - name: container-toolkit-image
     image: nvcr.io/nvidia/k8s/container-toolkit@sha256:b3f48033d7d9e1d5703b6ecffe35d219a45a17bdcf85374d78924dee9c8917be
   - name: driver-image
     image: nvcr.io/nvidia/driver@sha256:324e9dc265dec320207206aa94226b0c8735fd93ce19b36a415478c95826d934
   - name: device-plugin-image
     image: nvcr.io/nvidia/k8s-device-plugin@sha256:45b459c59d13a1ebf37260a33c4498046d4ade7cc243f2ed71115cd81054cd85
   - name: gpu-feature-discovery-image
     image: nvcr.io/nvidia/gpu-feature-discovery@sha256:82e6f61b715d710c60ac14be78953336ea5dbc712244beb51036139d1cc8d526
   - name: cuda-sample-image
     image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2
   - name: dcgm-init-container-image
     image: nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59

Then follow this guide: https://docs.openshift.com/container-platform/4.6/openshift_images/image-configuration.html to configure the `registrySources` of OpenShift to pull those images from the mirror registry.

Part 2 - Setting the YUM Mirror and Driver Container 

Note: Part  2 is  only needed for SRO or the NVIDIA GPU Operator; the NFD operator  does not need this step. 

For setting up a YUM mirror, we can choose to use Red Hat Satellite or create a custom-made mirror following.

The packages we need to host in our mirror are:

  • elfutils-libelf.${HOST_ARCH} 
  • elfutils-libelf-devel.${HOST_ARCH}
  • kernel-headers-${GPU_NODE_KERNEL_VERSION}
  • kernel-devel-${GPU_NODE_KERNEL_VERSION}
  • kernel-core-${GPU_NODE_KERNEL_VERSION}

These packages are needed to run the driver container, as can be seen at: https://gitlab.com/nvidia/container-images/driver/-/blob/master/rhel8/nvidia-driver .

Note: You can get the $HOST_ARCH and $GPU_NODE_KERNEL_VERSION from `oc describe node` on one of the nodes.

With the YUM-mirror in place, the next step is to add the repository configuration to the driver container: 

1. First, we create a ConfigMap containing the repository configuration file (my_mirror.repo)
bash
oc create configmap yum-repos-d --from-file /path/to/my_mirror.repo

2. Add the mirror repository to the operator buildConfig. For SRO this information must be added to: https://github.com/openshift-psap/special-resource-operator/blob/master/config/recipes/nvidia-gpu/manifests/1000-state-driver.yaml 

and:
https://github.com/openshift-psap/special-resource-operator/blob/master/config/recipes/nvidia-gpu/manifests/0000-state-driver-buildconfig.yaml 

For the NVIDIA-GPU-Operator  v1.4 and above (currently 1.5.2) and for versions before 1.4, follow the same instructions as SRO:

1. Create a configmap with custom repo list:

bash
oc create configmap repo-config -n gpu-operator-resources --from-file /path/to/my_mirror.repo
2. Specify repoConfig in values.yaml (If deploying from HELM:)
yaml
driver:
 repository: nvcr.io/nvidia
 image: driver
 version: "450.80.02"
 repoConfig:
   configMapName: repo-config
   destinationDir: /etc/yum.repos.d

Or Edit the driver.repoConfig entry at the ClusterPolicy CR

3. Deploy the operator via HELM

4. Verify ConfigMap is mounted successfully with driver container

Now you are ready to deploy the SRO / GPU-operator to your disconnected OCPO cluster.

We believe that Linux containers and container orchestration engines, most notably Kubernetes, are well positioned to power future software applications spanning multiple industries and verticals. Red Hat has embarked on a mission to enable some of the most critical workloads, like machine learning, deep learning, artificial intelligence, big data analytics, high-performance computing, and telecommunications, with Red Hat OpenShift. The PSAP team is supporting this mission across multiple footprints (public, private, and hybrid cloud), industries, and application types.

Troubleshooting

  • It is not mentioned in all the documentation, but it is good to start by deploying a medium-sized instance to host the registry.

Relevant links

 


关于作者

UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Virtualization icon

虚拟化

适用于您的本地或跨云工作负载的企业虚拟化的未来