Why we’re contributing llm-d to the CNCF: Standardizing the future of AI

24 de março de 2026Brian Stevens3 minutos (tempo de leitura)

Today, we are contributing llm-d to the Cloud Native Computing Foundation (CNCF) as a Sandbox project.

This isn't just a hand-off of code. It’s a commitment to making high-performance AI serving a core, portable capability of the cloud-native stack. When we launched llm-d in May 2025, we set out to solve the massive capabilities gap between AI experimentation and mission-critical production inference at scale. By moving llm-d into the CNCF, we’re expanding the target of a multi-vendor coalition—including CoreWeave, IBM, Google, and NVIDIA—to build the open standard for distributed inference.

Inference powers the agentic era

As we enter an agentic future, the AI inferencing backstopping vast domains of enterprise agents is poised to wildly expand. It will become critical that the cost and complexity of inferencing doesn’t outweigh the business value of the agents themselves. But inference can be incredibly expensive, consuming vast amounts of specialized accelerators, and at scale, costs can soar further. The advanced capabilities of llm-d directly address this, delivering against enterprise Service Level Objectives while maximizing infrastructure efficiency. Moreover, organizations need the flexibility to deploy inference wherever it makes sense—data center, cloud, or edge—on their choice of hardware. This flexibility is only possible if the underlying ecosystem is built on open source and open standards.

Bridging the gap in the cloud-native landscape

While Kubernetes is the industry standard for orchestration, it wasn't originally built for the unique, stateful demands of large language model (LLM) inference. In a traditional microservice, a request is a request – each replica can process each one equally well. In generative AI, the cost of a request varies wildly based on prompt and output token lengths, model size and architecture, cache locality, and whether the model is in the prefill (compute-bound) or decode (memory-bound) phase.

Standard service routing is blind to these dynamics, which leads to inefficient placement and unpredictable latency. This is where llm-d bridges the gap. It functions as a specialized data-plane orchestration layer between high-level control planes like KServe and low-level engines like vLLM. Using Kubernetes-native primitives like Gateway API and LeaderWorkerSet (LWS), it transforms complex distributed inference into a manageable, observable cloud-native workload.

Strengthening the ecosystem through contribution

By contributing llm-d to the CNCF, we’re establishing well-lit paths—proven, replicable blueprints that turn fragmented AI components into modular, interoperable microservices. This contribution is about more than a single project; it's about enriching the entire cloud-native landscape so that inference becomes a first-class citizen of the same environment as traditional container-based applications.

A central part of this work is the endpoint picker (EPP). llm-d acts as a primary implementation for the Kubernetes gateway API inference extension (GAIE), and the EPP allows for programmable, inference-aware routing. This means the system makes routing decisions based on the actual state of the engine—optimizing for KV cache hit rates and hardware accelerator characteristics. This is a fundamental requirement for maintaining sustained throughput under strict service level objectives.

llm-d complements and extends the existing landscape within the CNCF:

Kubernetes: Provides the primary infrastructure platform for AI workloads.
Gateway API: Drives upstream alignment for AI-specific routing, ensuring that traffic management stays a core open component.
KServe: Acts as the high-level control plane that integrates with llm-d to support advanced features like disaggregated serving and prefix caching.
LeaderWorkerSet: Uses Kubernetes-native primitives to orchestrate complex multi-node replicas and expert parallelism, transforming engines like vLLM into manageable cloud-native workloads.
Prometheus & Grafana: Exports specialized metrics like time to first token (TTFT) to bring enterprise-grade observability to generative AI.

Scaling the future of inference together

Collaboration has been at the core of llm-d from its inception. When we announced llm-d last year at Red Hat Summit, the joint efforts of the project’s founding contributors, industry leaders, and academic supporters were a point of pride for Red Hat – not only for launching llm-d, but also for establishing a future-ready, collaborative foundation. In the 10 months since, llm-d has been adopted for both enterprise AI private MaaS, as well as large-scale AI initiatives. More importantly, the project’s open roots continue to deepen with a growing ecosystem of contributors and partners. Developers and companies are putting their trust in llm-d, and contributing the project to CNCF will support and maintain an open future. The road to successful, open source AI innovation is long, but together we’re building the infrastructure to get there.

Sobre o autor

Brian Stevens

SVP and AI CTO

Brian Stevens is Red Hat's Senior Vice President and Chief Technology Officer (CTO) for AI, where he drives the company's vision for an open, hybrid AI future. His work empowers enterprises to build and deploy intelligent applications anywhere, from the datacenter to the edge. As Red Hat’s CTO of Engineering (2001-2014), Brian was central to the company’s initial growth and the expansion of its portfolio into cloud, middleware, and virtualization technologies.

After helping scale Google Cloud as its VP and CTO, Brian’s passion for transformative technology led him to become CEO of Neural Magic, a pioneer in software-based AI acceleration. Red Hat’s strategic acquisition of Neural Magic in 2025 brought Brian back to the company, uniting his leadership with Red Hat's mission to make open source the foundation for the AI era.

Read full bio