Note: The following procedure can also be used to deploy the NVIDIA GPU Operator, since it follows the same prerequisites as the SRO operator. Docs are here.
The job of the Performance and Latency Sensitive Applications (PSAP) team at Red Hat is optimizing Red Hat OpenShift, the industry’s most comprehensive enterprise Kubernetes platform, to run compute-intensive enterprise workloads and HPC applications effectively and efficiently. As a team of Linux and performance enthusiasts who are always pushing the limits of what is possible with the latest and greatest upstream technologies, we are operating at the forefront of innovation with compelling proof-of-concept (POC) implementations and advanced deployment scenarios.
Overview
Driver containers are a novel way of including device specific kernel modules (kmods) within an OCI container. Since these kmods have close dependencies on kernel versions (and kernel headers), they need to be (re) compiled on the target host. The special resource operator (SRO for short) was designed for this purpose.
However, the SRO needs access to RHEL source code from the target host. And while this is fully automated in environments that can access the internet and ergo the RHEL source code, setting it up for disconnected environments requires some more configuration.
This blog post details the deployment of SRO/driver containers on disconnected (true disconnected and proxy) environments.
Prerequisites
You must have access to the internet to obtain the data that populates the mirror repository. In this procedure, you will place the mirror registry on a bastion host that has access to both your network and the internet. If you do not have access to a bastion host, use the method that best fits your restrictions to bring the contents of the mirror registry into your restricted network. You also must have a Red Hat Enterprise Linux (RHEL) server on your network to use as the registry host. The registry host MUST be able to access the internet, or at least allow access to the needed URL’s mentioned through this guide.
The cluster must be properly configured and entitled as seen in:
Part 1 - Setting the Mirror Registry and OLM Catalog
Procedure
[Bastion host]
Step 1: Create a Mirror Registry
Follow - installation-creating-mirror-registry_samples-operator-alt-registry
Note: You must ensure that your registry hostname is in the same DNS and that it resolves to the expected IP address. Otherwise, pulls will fail because cert x509 is for a hostname and not a public name.
Step 2: Authenticate the Mirror Registry
[Bastion host/Local host]
Now, let’s allow our cluster to reference images from the mirror registry we just built.
Follow installation-adding-registry-pull-secret_samples-operator-alt-registry.
[Optional] For authenticating your mirror registry, you need to configure additional trust stores for image registry access in our OCP cluster. You can create a ConfigMap in the openshift-config namespace and use its name in AdditionalTrustedCA in the image.config.openshift.io resource. This provides additional CAs that should be trusted when contacting external registries.
The ConfigMap key is the hostname + port of a registry for which this CA is to be trusted, and the base64-encoded certificate is the value for each additional registry CA to trust.
You can configure additional CAs with the following procedure:
bash
$ oc create configmap registry-config --from-file=<external_registry_address>=ca.crt -n openshift-config
$ oc edit image.config.openshift.io cluster
spec:
additionalTrustedCA:
name: registry-config
Note: if your <external_registry_address> contains a ':5000',.it should be written as ‘..5000’ to avoid this error:
bash
error: "xxxxxxxxxx::5000" is not a valid key name for a ConfigMap: a valid config key must consist of alphanumeric characters, '-', '_' or '.' (e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')
Step 3: Building an Operator Catalog Image
Note: For now, we need to tell the architecture we want to mirror into the registry using the oc CLI. To achieve this during both steps, you need to pass the flag --filter-by-os='linux/amd64’:
oc adm catalog build --filter-by-os='linux/amd64’ ….
oc adm catalog mirror --filter-by-os='linux/amd64’ ….
This prevents a known error due to the docker registry not supporting multiple architectures manifests.
[Optional] Mirror Images for HELM Deployment
After deploying the mirror image registry in step 2:
Mirror the images listed at: https://github.com/NVIDIA/gpu-operator/blob/master/bundle/manifests/gpu-operator.clusterserviceversion.yaml#L128
yaml
relatedImages:
- name: gpu-operator-image
image: nvcr.io/nvidia/gpu-operator@sha256:1a1c95d392ea2c055b09c9d074ab4d577a42d5d338109234d7a868bf2ebdfa8d
- name: dcgm-exporter-image
image: nvcr.io/nvidia/k8s/dcgm-exporter@sha256:85016e39f73749ef9769a083ceb849cae80c31c5a7f22485b3ba4aa590ec7b88
- name: container-toolkit-image
image: nvcr.io/nvidia/k8s/container-toolkit@sha256:b3f48033d7d9e1d5703b6ecffe35d219a45a17bdcf85374d78924dee9c8917be
- name: driver-image
image: nvcr.io/nvidia/driver@sha256:324e9dc265dec320207206aa94226b0c8735fd93ce19b36a415478c95826d934
- name: device-plugin-image
image: nvcr.io/nvidia/k8s-device-plugin@sha256:45b459c59d13a1ebf37260a33c4498046d4ade7cc243f2ed71115cd81054cd85
- name: gpu-feature-discovery-image
image: nvcr.io/nvidia/gpu-feature-discovery@sha256:82e6f61b715d710c60ac14be78953336ea5dbc712244beb51036139d1cc8d526
- name: cuda-sample-image
image: nvcr.io/nvidia/k8s/cuda-sample@sha256:2a30fe7e23067bc2c3f8f62a6867702a016af2b80b9f6ce861f3fea4dfd85bc2
- name: dcgm-init-container-image
image: nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
Then follow this guide: https://docs.openshift.com/container-platform/4.6/openshift_images/image-configuration.html to configure the `registrySources` of OpenShift to pull those images from the mirror registry.
Part 2 - Setting the YUM Mirror and Driver Container
Note: Part 2 is only needed for SRO or the NVIDIA GPU Operator; the NFD operator does not need this step.
For setting up a YUM mirror, we can choose to use Red Hat Satellite or create a custom-made mirror following.
The packages we need to host in our mirror are:
- elfutils-libelf.${HOST_ARCH}
- elfutils-libelf-devel.${HOST_ARCH}
- kernel-headers-${GPU_NODE_KERNEL_VERSION}
- kernel-devel-${GPU_NODE_KERNEL_VERSION}
- kernel-core-${GPU_NODE_KERNEL_VERSION}
These packages are needed to run the driver container, as can be seen at: https://gitlab.com/nvidia/container-images/driver/-/blob/master/rhel8/nvidia-driver .
Note: You can get the $HOST_ARCH and $GPU_NODE_KERNEL_VERSION from `oc describe node` on one of the nodes.
With the YUM-mirror in place, the next step is to add the repository configuration to the driver container:
1. First, we create a ConfigMap containing the repository configuration file (my_mirror.repo)bash
oc create configmap yum-repos-d --from-file /path/to/my_mirror.repo
2. Add the mirror repository to the operator buildConfig. For SRO this information must be added to: https://github.com/openshift-psap/special-resource-operator/blob/master/config/recipes/nvidia-gpu/manifests/1000-state-driver.yaml
For the NVIDIA-GPU-Operator v1.4 and above (currently 1.5.2) and for versions before 1.4, follow the same instructions as SRO:
1. Create a configmap with custom repo list:
bash2. Specify repoConfig in values.yaml (If deploying from HELM:)
oc create configmap repo-config -n gpu-operator-resources --from-file /path/to/my_mirror.repo
yaml
driver:
repository: nvcr.io/nvidia
image: driver
version: "450.80.02"
repoConfig:
configMapName: repo-config
destinationDir: /etc/yum.repos.d
Or Edit the driver.repoConfig entry at the ClusterPolicy CR
3. Deploy the operator via HELM
4. Verify ConfigMap is mounted successfully with driver container
Now you are ready to deploy the SRO / GPU-operator to your disconnected OCPO cluster.
We believe that Linux containers and container orchestration engines, most notably Kubernetes, are well positioned to power future software applications spanning multiple industries and verticals. Red Hat has embarked on a mission to enable some of the most critical workloads, like machine learning, deep learning, artificial intelligence, big data analytics, high-performance computing, and telecommunications, with Red Hat OpenShift. The PSAP team is supporting this mission across multiple footprints (public, private, and hybrid cloud), industries, and application types.
Troubleshooting
- It is not mentioned in all the documentation, but it is good to start by deploying a medium-sized instance to host the registry.
Relevant links
- https://wiki.centos.org/HowTos/CreateLocalMirror
- https://docs.openshift.com/container-platform/4.1/builds/running-entitled-builds.html#running-builds-with-satellite-subscriptions
- https://docs.openshift.com/container-platform/4.4/builds/running-entitled-builds.html#builds-source-secrets-entitlements_running-entitled-builds
저자 소개
유사한 검색 결과
Simplify Red Hat Enterprise Linux provisioning in image builder with new Red Hat Lightspeed security and management integrations
F5 BIG-IP Virtual Edition is now validated for Red Hat OpenShift Virtualization
Can Kubernetes Help People Find Love? | Compiler
Scaling For Complexity With Container Adoption | Code Comments
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래