블로그 구독

This post was written by: Swati Sehgal, Alexey Perevalov, Killian Muldoon & Francesco Romani. This is part 3, and here are Parts 1 and Part 2

Topology Aware Scheduling is all about enhancing the hardware awareness of Kubernetes at the cluster control plane level. The hard details, like everything else related to the infrastructure underlying a cluster, is held by kubelet. 

Topology Aware Scheduling will bring changes to the way kubelet exposes this information to make hardware aware extensions easier to develop and consistent with Kubernetes’ understanding of the infrastructure it is managing. To do all this, we are relying on the Pod Resources API.

The Kubernetes pod resources API is a kubelet API introduced in Kubernetes 1.13 that enables monitoring applications to track the resource allocation of the pods.

The service that enables the API is implemented as gRPC listening on a unix domain socket and returning information about the kubelet's assignment of devices to containers.

The API as originally introduced offered an API call to list all the pods and learn about their resource assignment, which fits the proposed use case for monitoring applications.

A few months later, an effort to introduce more topology-aware scheduling began.

So, the scheduler will need to learn more detailed information about the node resource availability and allocation.

Topology-aware scheduling wants to provide a framework to enable generic topology-aware resource allocation, but the first step to solve this problem is working out the simpler case of numa-aware allocation. However, we will use “topology zone” in this post as an alias for the much simpler and more familiar term “NUMA zone” (or “NUMA cell” or “NUMA node,” which we consider synonymous).

The kubelet is the source of truth with respect to the node resource state; thus, extracting the resource state information from the kubelet quickly emerged as the right approach.

To implement this resource, reporting a few approaches were discussed, including introducing a new API. But we realized it is possible to generalize the pod resources API to export

all the information the topology-aware scheduling requires while remaining true to the spirit of the pod resources API itself.

To report enough data to enable the topology-aware scheduling, we need to export more information than the current API does.

When listing pod resources allocation:

  • We need to attach topology information to all the allocated devicesr to properly account the resource availability for the topology zones on the node.
  • In case of exclusive CPU allocation, we need to expose which CPUs are exclusively allocated to containers in the pods. This is a bit of a special case considering that all other resources that have a topology information fit into the more generic “resource” reporting (for example, devices and memory).

To do a proper placement, the scheduler needs to know about the per-NUMA zone available resources. The kubelet reports the available resource on a per-physical-node granularity, which is too coarse grained.

To overcome this limitation, a new Pod Resource API was added: GetAllocatableResources.

This API complements the List API and allows the consumers (the scheduler) to track allocated and allocatable resources on a per-NUMA zone basis.

Even with these important additions, the API is still not ideal because both List and GetAllocatableResources require the monitoring application to poll the kubelet.

If the monitoring application has a too slow monitoring loop, the scheduler likely gets stale information; on the other hand, if the monitoring application monitors very frequently, it adds extra load to the kubelet and to the system in general.

To further improve this, another extension to Pod Resources API is being developed.

The idea is to add Watch endpoints, which will report a stream of events to the monitoring application when resource allocation changes (for example, when pods are created or deleted) or if the resource availability changes (for example, if new device plug-ins are added or deleted). This further extension is planned to be submitted during 2021.

The expected flow to consume these APIs is as follows.

Polling approach for applications that do not need to react quickly to allocation changes:

  1. Connect to podresources endpoint
  2. Intial resource assessment: call GetAllocatableResources and List to learn about the resources available on this node.
  3. Loop forever:
    1. Optionally, call GetAllocatableResources to fully reconcile the resource state. This is optional because for some applications, the initial GetAllocatableResources call and proper resource tracking using List may be sufficient.
    1. Call List to learn about the current resource allocation
    2. Perform the business logic comparing the available resources and the allocated resources
    3. Sleep as needed

Event-based approach for applications that do need to react quickly to allocation changes:

  1. Connect to podresources endpoint
  2. Initial resource assessment: call GetAllocatableResources and List to learn about the resources available on this node.
  3. Register itself calling Watch (and optionally WatchAllocatable) to subscribe to allocation changes
  4. Wait forever:
    1. Both Watch endpoints will provide events reflecting resource allocation or availability (for example, new device plug-in registered) changes
    2. The APIs provide enough data to intelligently and deliberately reconcile the information coming from Watch streams with what was provided from GetAllocatableResources and List.

Should you want to learn more about the pod resources API and the changes proposed for topology-aware scheduling, you can start from the KEPs:

Other References: Survey of Resource Management in Kubernetes for Performance Critical Workloads

 


저자 소개

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Original series icon

오리지널 쇼

엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리