In this session, we’ll introduce LLM-D, a new Kubernetes-native framework for distributed LLM inference co-designed with Inference Gateway (IGW) and built on vLLM. Learn how LLM-D simplifies horizontally scaling LLMs across multiple GPUs and nodes, supports efficient model sharding and routing, and enables dynamic workload distribution. We’ll walk through its architecture, how it integrates with vLLM, and what makes it ideal for production-scale AI systems. Join us to explore how LLM-D unlocks the next level of LLM serving performance and flexibility.
Agenda
| Time | Session |
|---|---|
| 2:00 - 3:00 | vLLM Office Hours #27: Intro to llm-d for scaling LLM inference on Kubernetes |
Michael Goin
vLLM Committer and Principal Software Engineer, Red Hat
Robert Shaw
vLLM Committer and Director of Engineering, Red Hat