Your large language model (LLM) proof of concept (PoC) was a success. Now what? The jump from a single server to production-grade, distributed AI inference is where most enterprises hit a wall. The infrastructure that got you this far just can't keep up.
As discussed in a recent episode of the Technically Speaking podcast, most organizations' AI journey and PoCs begin with deploying a model on a single server—a manageable task. But the next step often requires a massive leap to distributed, production-grade AI inference. This is not simply a matter of adding more machines—we believe this requires a new kind of intelligence within the infrastructure itself—an AI-aware control plane that can help manage the complexity of these unique and dynamic workloads.
The new challenge: Distributed AI inference
Deploying LLMs at scale introduces a set of challenges that traditional infrastructure isn't designed to handle. A standard web server, for example, processes uniform requests. In contrast, an AI inference request can be unpredictable and resource-intensive, with variable demands on compute, memory, and networking.
Think of it like modern logistics. Moving a small package from one city to another is straightforward. But coordinating a global supply chain requires intelligent logistics management—a system that can track thousands of shipments, dynamically route different types of cargo, and tweak scheduling so everything arrives on time. Without that intelligence and careful coordination, the entire system breaks down. Similarly, without an intelligent infrastructure layer, scaling AI becomes inefficient, costly, and unreliable.
The complexity of these workloads is tied to the prefill and decode phases of LLM inference. The prefill phase processes the entire input prompt at once and is a compute-heavy task, while the decode phase generates the output tokens one at a time and is more dependent on memory bandwidth.
Most single-server deployments colocate these two phases on the same hardware, which can create bottlenecks and lead to poor performance, especially for high-volume workloads with a variety of request patterns. The real challenge is to optimize both time-to-first-token (from the prefill phase) and inter-token latency (from the decode phase) to maximize throughput, handle the most concurrent requests, and—critically for enterprise use—consistently meet defined service level objectives (SLOs).
A shared vision for a shared problem
The power of open source is clear in addressing this complex, industry-wide challenge. When a problem is shared by hardware vendors, cloud providers, and platform builders, the most effective solution is usually a collaborative one. Instead of having dozens of organizations working independently to solve the same problem, a shared open source project speeds innovation and will help establish a common standard.
The llm-d project is a prime example of this collaboration in action. Initiated by Red Hat and IBM Research, the project was quickly joined by a coalition of industry leaders, including Google and NVIDIA, all working toward a collaboratively developed vision.
As a technology, llm-d is designed to provide a "well-lit path"—a clear, proven blueprint for managing AI inference at scale. Instead of building everything from scratch, the community is focused on optimizing and standardizing the operational challenges of running AI workloads at scale.
llm-d: A blueprint for production-grade AI
The llm-d project is developing an open source control plane that enhances Kubernetes with specific capabilities needed for AI workloads. It does not replace Kubernetes but adds a specialized layer of intelligence and extends the runtime performance of vLLM into a distributed layer.
The llm-d community is focused on building features that have a direct impact on AI inference performance and efficiency, including:
- Semantic routing: llm-d's scheduler is aware of the unique resource requirements of each inference request. It can make smarter decisions about where to run a workload, making more efficient use of expensive resources and preventing costly over-provisioning. This goes beyond traditional load balancing by using real-time data, like the utilization of a model's key-value (KV) cache, to route requests to the most optimal instance.
- Workload disaggregation: llm-d separates complex inference tasks into smaller, manageable parts, specifically the prefill and decode phases. This provides granular control and enables the use of heterogeneous hardware, so the right resource is used for the right task to help reduce overall operational costs. For instance, a prefill pod can be optimized for compute-heavy tasks while a decode pod is tailored for memory-bandwidth efficiency. This enables a level of fine-grained optimization that is impossible with a monolithic approach.
- Support for advanced architectures: llm-d is designed to handle emerging model architectures, like mixture of experts (MoE), which require complex orchestration and parallelism across multiple nodes. By supporting wide parallelism, llm-d allows for the efficient use of these sparse models that are more performant and cost-effective than their dense counterparts but are more difficult to deploy at scale.
The llm-d community is taking the best ideas from fields like high-performance computing (HPC) and large-scale distributed systems, and working to avoid the rigid, specialized setups that can make these hard to use. It's strategically bringing together open technologies—like vLLM for model serving and the inference gateway for scheduling—to create a single, unified framework.
This focus on operationalizability and flexibility is a core design principle, and the project supports multiple hardware accelerators from vendors like NVIDIA, AMD, and Intel. By creating a flexible control plane that works across different hardware and environments, llm-d is working to establish a strong, lasting standard for the future of enterprise AI.
Final thoughts
For IT leaders focused on operationalizing AI today, the value of the llm-d project extends beyond its community. The work being done in this open source coalition—specifically the development of an intelligent, AI-aware control plane—is a direct response to the production challenges many organizations face today.
The advantages of llm-d are clear:
- Move beyond the single server: Scaling LLMs is not about adding more machines. It's about implementing a strategic layer of infrastructure that can intelligently manage distributed workloads, handle complex hardware, and optimize for cost and performance.
- Leverage open standards: The most robust solutions emerge from collaborative open source efforts, not proprietary silos. Adopting a platform that is aligned with these open standards will prevent vendor lock-in and provide a more flexible, future-proof environment for AI initiatives.
- Operationalize with a trusted partner: You do not have to be an expert in distributed systems or contribute directly to the llm-d project to benefit from its innovation. The value created in the community is integrated into supported enterprise platforms, such as Red Hat AI, which provides a consistent and trusted foundation on which to deploy and manage AI at scale.
The future of enterprise AI depends on a solid infrastructure foundation. The work of the llm-d community is building that foundation, and a platform like Red Hat AI can help you put it into practice.
Resource
The adaptable enterprise: Why AI readiness is disruption readiness
About the author
Chris Wright is senior vice president and chief technology officer (CTO) at Red Hat. Wright leads the Office of the CTO, which is responsible for incubating emerging technologies and developing forward-looking perspectives on innovations such as artificial intelligence, cloud computing, distributed storage, software defined networking and network functions virtualization, containers, automation and continuous delivery, and distributed ledger.
During his more than 20 years as a software engineer, Wright has worked in the telecommunications industry on high availability and distributed systems, and in the Linux industry on security, virtualization, and networking. He has been a Linux developer for more than 15 years, most of that time spent working deep in the Linux kernel. He is passionate about open source software serving as the foundation for next generation IT systems.
More like this
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Virtualization
The future of enterprise virtualization for your workloads on-premise or across clouds