Today, we are contributing llm-d to the Cloud Native Computing Foundation (CNCF) as a Sandbox project.

This isn't just a hand-off of code. It’s a commitment to making high-performance AI serving a core, portable capability of the cloud-native stack. When we launched llm-d in May 2025, we set out to solve the massive capabilities gap between AI experimentation and mission-critical production inference at scale. By moving llm-d into the CNCF, we’re expanding the target of a multi-vendor coalition—including CoreWeave, IBM, Google, and NVIDIA—to build the open standard for distributed inference.

Inference powers the agentic era

As we enter an agentic future, the AI inferencing backstopping vast domains of enterprise agents is poised to wildly expand. It will become critical that the cost and complexity of inferencing doesn’t outweigh the business value of the agents themselves. But inference can be incredibly expensive, consuming vast amounts of specialized accelerators, and at scale, costs can soar further. The advanced capabilities of llm-d directly address this, delivering against enterprise Service Level Objectives while maximizing infrastructure efficiency. Moreover, organizations need the flexibility to deploy inference wherever it makes sense—data center, cloud, or edge—on their choice of hardware. This flexibility is only possible if the underlying ecosystem is built on open source and open standards.

Bridging the gap in the cloud-native landscape

While Kubernetes is the industry standard for orchestration, it wasn't originally built for the unique, stateful demands of large language model (LLM) inference. In a traditional microservice, a request is a request – each replica can process each one equally well. In generative AI, the cost of a request varies wildly based on prompt and output token lengths, model size and architecture, cache locality, and whether the model is in the prefill (compute-bound) or decode (memory-bound) phase.

Standard service routing is blind to these dynamics, which leads to inefficient placement and unpredictable latency. This is where llm-d bridges the gap. It functions as a specialized data-plane orchestration layer between high-level control planes like KServe and low-level engines like vLLM. Using Kubernetes-native primitives like Gateway API and LeaderWorkerSet (LWS), it transforms complex distributed inference into a manageable, observable cloud-native workload.

Strengthening the ecosystem through contribution

By contributing llm-d to the CNCF, we’re establishing well-lit paths—proven, replicable blueprints that turn fragmented AI components into modular, interoperable microservices. This contribution is about more than a single project; it's about enriching the entire cloud-native landscape so that inference becomes a first-class citizen of the same environment as traditional container-based applications.

A central part of this work is the endpoint picker (EPP). llm-d acts as a primary implementation for the Kubernetes gateway API inference extension (GAIE), and the EPP allows for programmable, inference-aware routing. This means the system makes routing decisions based on the actual state of the engine—optimizing for KV cache hit rates and hardware accelerator characteristics. This is a fundamental requirement for maintaining sustained throughput under strict service level objectives.

llm-d complements and extends the existing landscape within the CNCF:

  • Kubernetes: Provides the primary infrastructure platform for AI workloads.
  • Gateway API: Drives upstream alignment for AI-specific routing, ensuring that traffic management stays a core open component.
  • KServe: Acts as the high-level control plane that integrates with llm-d to support advanced features like disaggregated serving and prefix caching.
  • LeaderWorkerSet: Uses Kubernetes-native primitives to orchestrate complex multi-node replicas and expert parallelism, transforming engines like vLLM into manageable cloud-native workloads.
  • Prometheus & Grafana: Exports specialized metrics like time to first token (TTFT) to bring enterprise-grade observability to generative AI.

Scaling the future of inference together

Collaboration has been at the core of llm-d from its inception. When we announced llm-d last year at Red Hat Summit, the joint efforts of the project’s founding contributors, industry leaders, and academic supporters were a point of pride for Red Hat – not only for launching llm-d, but also for establishing a future-ready, collaborative foundation. In the 10 months since, llm-d has been adopted for both enterprise AI private MaaS, as well as large-scale AI initiatives. More importantly, the project’s open roots continue to deepen with a growing ecosystem of contributors and partners. Developers and companies are putting their trust in llm-d, and contributing the project to CNCF will support and maintain an open future. The road to successful, open source AI innovation is long, but together we’re building the infrastructure to get there.


关于作者

Brian Stevens is Red Hat's Senior Vice President and Chief Technology Officer (CTO) for AI, where he drives the company's vision for an open, hybrid AI future. His work empowers enterprises to build and deploy intelligent applications anywhere, from the datacenter to the edge. As Red Hat’s CTO of Engineering (2001-2014), Brian was central to the company’s initial growth and the expansion of its portfolio into cloud, middleware, and virtualization technologies.

After helping scale Google Cloud as its VP and CTO, Brian’s passion for transformative technology led him to become CEO of Neural Magic, a pioneer in software-based AI acceleration. Red Hat’s strategic acquisition of Neural Magic in 2025 brought Brian back to the company, uniting his leadership with Red Hat's mission to make open source the foundation for the AI era.

UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Virtualization icon

虚拟化

适用于您的本地或跨云工作负载的企业虚拟化的未来