What is vLLM?
vLLM is an inference server that speeds up gen AI inference in large language models (LLMs) by making better use of memory storage and graphics processing units (GPUs).
Using GPUs more efficiently helps LLMs perform calculations faster and at scale. This becomes increasingly important when organizations need real-time applications like chatbots or multimodal workflows.
This article highlights 3 real-world examples of how well-known companies are successfully using vLLM.
Why does vLLM matter for AI inference?
During inference, LLMs rely on key values to do a lot of math in a short period of time.
LLMs use key values to attach a numerical value to tokens (terms or phrases) to understand language and calculate answers. So, every token (key) is associated with a number (value) that allows the LLM to calculate a response.
AI inference uses key values during its 2 main phases:
- Prefill is when the model processes the input prompt. The key values for each token create the key value (KV) cache, which serves as the model’s short-term memory.
- Decode is when the model generates new tokens. It uses the existing KV cache to calculate the key values of a response.
LLMs store key values for every processed token in the KV cache. Since the cache grows according to prompt length and output generation, it takes up a lot of LLM memory storage. Traditional LLM memory management systems don’t organize calculations or use memory in the most efficient way, causing LLMs to move slowly.
vLLM uses a memory management technique that understands how KV cache is used during inference. It retrieves cache data in a way that identifies repetitive key values to help prevent memory fragmentation and reduce extra work for the LLM. This makes GPU memory usage more efficient and LLM inference faster.
4 key considerations for implementing AI technology
How does vLLM decrease GPU storage needs?
vLLM uses different technologies and techniques to use less storage and make inference faster:
- Continuous batching is when LLMs begin the inference process for the next batch of tokens, even if they haven’t finished calculating a prior token. (vLLM can multitask.)
- PagedAttention is a breakthrough technology that uses the KV cache to remember previous tokens and lean on its memory to save GPU storage.
- Speculative decoding uses a smaller, faster model to predict incoming tokens, which increases the speed and efficiency of the prefill stage.
- Quantization is the process of squeezing larger model parameters into smaller formats to reduce storage needs without sacrificing accuracy. There are various quantization methods for model customization.
Processing fewer tokens or generating a response a few seconds faster might seem inconsequential. But when enterprises use this memory-saving technique—across thousands of AI workloads, GPUs, and inference server calculations—it can save significant time, money, and resources.
This is a game changer for organizations that want to scale AI at the enterprise level.
Why are companies using vLLM?
Organizations are using AI inference in high-volume, high-variable workloads. But deploying LLMs consistently at scale requires a lot of computing power, resources, and specialized operational skills.
vLLM can overcome these challenges by making more efficient use of the hardware needed to support AI inference in the enterprise. This is why vLLM is especially attractive to industries that need flexibility and control in addition to speed.
As an open source solution, vLLM allows companies to:
- Own and manage their GPUs.
- Control their data.
- Experiment with new models as soon as they’re released.
This level of freedom offers a lower cost per token and fewer privacy concerns.
vLLM can be deployed across a variety of hardware including NVIDIA and AMD GPUs, Google TPUs, Intel Gaudi, and AWS Neuron. vLLM also isn’t restricted to specific hardware, meaning it works across the cloud, in the data center, or at the edge.
vLLM use cases at the enterprise level
From recruiting efforts to online gaming, scaling inference can become complex quickly.
The following examples show how enterprises are using the open source project, vLLM. These companies aren’t Red Hat customers but benefit from the broader vLLM community and the technology it produces.
How does Roblox use vLLM?
Roblox is an online gaming platform that hosts millions of users around the world. Users can create their own gaming experience and play games others have created.
Its latest feature, Assistant, an AI chatbot that helps create content, has increased the number of tokens processed to more than 1 billion per week. Additional features such as real-time AI chat translation and its voice safety model have also added inference complexity. This multimodality across millions of user interactions leads to more tokens to process, which requires more resources for inference.
To handle the increasing processing demands, Roblox adopted vLLM as its primary inference engine. Roblox specifically leans on vLLM’s speculative decoding capabilities for language tasks to serve its global customer base. Since adopting vLLM, Roblox has experienced a 50% reduction in latency to serve 4 billion tokens per week.
vLLM allows Roblox to scale and meet user demand as its platform continues to grow. Roblox chose vLLM because it aligns with its commitment to supporting open source technologies.
Listen to Roblox break down how they use vLLM in Red Hat’s vLLM office hours.
How does LinkedIn use vLLM?
LinkedIn adopted vLLM to support its wide range of gen AI use cases that cater to its large and active audience.
As 1 of the world’s largest professional networking sites, LinkedIn hosts more than 1 billion members in more than 200 countries. Now, vLLM allows LinkedIn to support more than 50 gen AI use cases, such as LinkedIn Hiring Assistant.
Using complex classification calculations, LinkedIn Hiring Assistant filters applicant qualifications like years of experience, skills, and previous employment. This helps recruiters match applicants to the best job fit.
But processing these wide-ranging classifications requires a lot of tokens (an average of 1,000 per candidate), and applicant pools can fill up with thousands of candidates.
More than 50% of applications share prefix tokens (qualifications share similarities). This makes LinkedIn Hiring Assistant a perfect use case for vLLM’s PagedAttention technology and continuous batching capabilities, which both reduce latency, prioritize high throughput, and lower the pressure on GPU storage.
Time Per Output Token (TPOT) reflects the average time it takes for a model to generate each individual token. So far, vLLM has helped LinkedIn improve its TPOT by 7%.
How does Amazon use vLLM?
Rufus, Amazon’s gen AI shopping assistant, aims to improve customer experience by decreasing decision fatigue. Rufus reportedly served 250 million customers in 2025, and that number continues to grow.
With a high number of customers using the gen AI shopping assistant, inference complexity increased. Amazon realized no single chip or instance had enough memory for Rufus to run smoothly.
Amazon prioritized scalable, multinode inference capabilities that maintain accuracy at faster speeds and lower latency. They achieved this by combining a multinode architecture solution that integrated with vLLM for smoother, faster inference.
By using vLLM’s continuous batching technique, the multinode architecture was able to intelligently schedule inference processing so token volume didn’t impact latency or performance.
Using vLLM to increase the efficiency and throughput of its LLMs allows Amazon to scale gen AI projects like Rufus that will continue to grow and evolve with its customers.
How will vLLM impact the future of inference?
vLLM continues to be the foundation for the future of AI inference due to its core capabilities:
- Speed: Inference capabilities are constantly improving. vLLM’s hardware and model providers contribute directly to the project to improve speed and model efficiency.
- Community: vLLM has a large open source community that continues to grow. All of the top 10 model contributors—such as Deepseek, NVIDIA, Meta, and Google—are creating models prebuilt for vLLM because of its efficiency.
- Flexibility: vLLM can be deployed across most AI hardware, including NVIDIA and AMD GPUs, Google TPUs, Intel Gaudi, AWS Neuron, and other accelerators like MetaX, Rebellions, and more. The diverse hardware support gives enterprises the flexibility they need to deliver outcomes with resources they already have.
- Day-zero support: When popular model builders like Meta or Google release a new model, vLLM is already familiar with its existing architectures. This means vLLM can offer day-zero (immediate) support for new models. So, vLLM is an accessible, out-of-the-box solution for enterprises that want to speed up their model deployment and lower costs.
vLLM also includes llm-d, a distributed inference framework for managing LLMs at scale in the hybrid cloud.
How Red Hat can help
Red Hat® AI is a suite of AI platforms built on Red Hat’s commitment to open source. As 1 of the largest commercial contributors to vLLM, we have a deep understanding of the technology and how it supports our AI platforms.
Powered by vLLM, Red Hat AI maximizes GPU use and supports faster response times. Its model compression capabilities increase inference efficiency without sacrificing performance. This is helpful in use cases where data needs another layer of security in a hybrid environment.
Red Hat AI includes Red Hat OpenShift® AI, a platform for building, deploying, and managing AI open source models with vLLM. Red Hat OpenShift AI combines the efficiency of vLLM with additional open source community-driven projects like llm-d, which uses a modular architecture that provides new levels of control, consistency, and more efficient resource scheduling. It incorporates fundamentals that change how LLMs run natively on Kubernetes and how enterprises are scaling their AI workloads.
Artificial intelligence (AI) at Red Hat
From live events to hands-on product demos to deep technical research, see what we're doing with AI at Red Hat.