This post was written by: Swati Sehgal, Alexey Perevalov, Killian Muldoon & Francesco Romani. This is part 3, and here are Parts 1 and Part 2.
Topology Aware Scheduling is all about enhancing the hardware awareness of Kubernetes at the cluster control plane level. The hard details, like everything else related to the infrastructure underlying a cluster, is held by kubelet.
Topology Aware Scheduling will bring changes to the way kubelet exposes this information to make hardware aware extensions easier to develop and consistent with Kubernetes’ understanding of the infrastructure it is managing. To do all this, we are relying on the Pod Resources API.
The Kubernetes pod resources API is a kubelet API introduced in Kubernetes 1.13 that enables monitoring applications to track the resource allocation of the pods.
The service that enables the API is implemented as gRPC listening on a unix domain socket and returning information about the kubelet's assignment of devices to containers.
The API as originally introduced offered an API call to list all the pods and learn about their resource assignment, which fits the proposed use case for monitoring applications.
A few months later, an effort to introduce more topology-aware scheduling began.
So, the scheduler will need to learn more detailed information about the node resource availability and allocation.
Topology-aware scheduling wants to provide a framework to enable generic topology-aware resource allocation, but the first step to solve this problem is working out the simpler case of numa-aware allocation. However, we will use “topology zone” in this post as an alias for the much simpler and more familiar term “NUMA zone” (or “NUMA cell” or “NUMA node,” which we consider synonymous).
The kubelet is the source of truth with respect to the node resource state; thus, extracting the resource state information from the kubelet quickly emerged as the right approach.
To implement this resource, reporting a few approaches were discussed, including introducing a new API. But we realized it is possible to generalize the pod resources API to export
all the information the topology-aware scheduling requires while remaining true to the spirit of the pod resources API itself.
To report enough data to enable the topology-aware scheduling, we need to export more information than the current API does.
When listing pod resources allocation:
- We need to attach topology information to all the allocated devicesr to properly account the resource availability for the topology zones on the node.
- In case of exclusive CPU allocation, we need to expose which CPUs are exclusively allocated to containers in the pods. This is a bit of a special case considering that all other resources that have a topology information fit into the more generic “resource” reporting (for example, devices and memory).
To do a proper placement, the scheduler needs to know about the per-NUMA zone available resources. The kubelet reports the available resource on a per-physical-node granularity, which is too coarse grained.
To overcome this limitation, a new Pod Resource API was added: GetAllocatableResources.
This API complements the List API and allows the consumers (the scheduler) to track allocated and allocatable resources on a per-NUMA zone basis.
Even with these important additions, the API is still not ideal because both List and GetAllocatableResources require the monitoring application to poll the kubelet.
If the monitoring application has a too slow monitoring loop, the scheduler likely gets stale information; on the other hand, if the monitoring application monitors very frequently, it adds extra load to the kubelet and to the system in general.
To further improve this, another extension to Pod Resources API is being developed.
The idea is to add Watch endpoints, which will report a stream of events to the monitoring application when resource allocation changes (for example, when pods are created or deleted) or if the resource availability changes (for example, if new device plug-ins are added or deleted). This further extension is planned to be submitted during 2021.
The expected flow to consume these APIs is as follows.
Polling approach for applications that do not need to react quickly to allocation changes:
- Connect to podresources endpoint
- Intial resource assessment: call GetAllocatableResources and List to learn about the resources available on this node.
- Loop forever:
- Optionally, call GetAllocatableResources to fully reconcile the resource state. This is optional because for some applications, the initial GetAllocatableResources call and proper resource tracking using List may be sufficient.
- Call List to learn about the current resource allocation
- Perform the business logic comparing the available resources and the allocated resources
- Sleep as needed
Event-based approach for applications that do need to react quickly to allocation changes:
- Connect to podresources endpoint
- Initial resource assessment: call GetAllocatableResources and List to learn about the resources available on this node.
- Register itself calling Watch (and optionally WatchAllocatable) to subscribe to allocation changes
- Wait forever:
- Both Watch endpoints will provide events reflecting resource allocation or availability (for example, new device plug-in registered) changes
- The APIs provide enough data to intelligently and deliberately reconcile the information coming from Watch streams with what was provided from GetAllocatableResources and List.
Should you want to learn more about the pod resources API and the changes proposed for topology-aware scheduling, you can start from the KEPs:
- The pod resources KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/606-compute-device-assignment
- The topology-aware scheduling extensions KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments
- GetAllocatableResource addition KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2403-pod-resources-allocatable-resources
Other References: Survey of Resource Management in Kubernetes for Performance Critical Workloads
À propos de l'auteur
Parcourir par canal
Automatisation
Les dernières nouveautés en matière d'automatisation informatique pour les technologies, les équipes et les environnements
Intelligence artificielle
Actualité sur les plateformes qui permettent aux clients d'exécuter des charges de travail d'IA sur tout type d'environnement
Cloud hybride ouvert
Découvrez comment créer un avenir flexible grâce au cloud hybride
Sécurité
Les dernières actualités sur la façon dont nous réduisons les risques dans tous les environnements et technologies
Edge computing
Actualité sur les plateformes qui simplifient les opérations en périphérie
Infrastructure
Les dernières nouveautés sur la plateforme Linux d'entreprise leader au monde
Applications
À l’intérieur de nos solutions aux défis d’application les plus difficiles
Programmes originaux
Histoires passionnantes de créateurs et de leaders de technologies d'entreprise
Produits
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Services cloud
- Voir tous les produits
Outils
- Formation et certification
- Mon compte
- Assistance client
- Ressources développeurs
- Rechercher un partenaire
- Red Hat Ecosystem Catalog
- Calculateur de valeur Red Hat
- Documentation
Essayer, acheter et vendre
Communication
- Contacter le service commercial
- Contactez notre service clientèle
- Contacter le service de formation
- Réseaux sociaux
À propos de Red Hat
Premier éditeur mondial de solutions Open Source pour les entreprises, nous fournissons des technologies Linux, cloud, de conteneurs et Kubernetes. Nous proposons des solutions stables qui aident les entreprises à jongler avec les divers environnements et plateformes, du cœur du datacenter à la périphérie du réseau.
Sélectionner une langue
Red Hat legal and privacy links
- À propos de Red Hat
- Carrières
- Événements
- Bureaux
- Contacter Red Hat
- Lire le blog Red Hat
- Diversité, équité et inclusion
- Cool Stuff Store
- Red Hat Summit