How vLLM accelerates AI inference: 3 enterprise use cases

Published December 5, 2025•6-minute read

vLLM is an inference server that speeds up gen AI inference in large language models (LLMs) by making better use of memory storage and graphics processing units (GPUs).

Using GPUs more efficiently helps LLMs perform calculations faster and at scale. This becomes increasingly important when organizations need real-time applications like chatbots or multimodal workflows.

This article highlights 3 real-world examples of how well-known companies are successfully using vLLM.

Get an in-depth overview of vLLM

During inference, LLMs rely on key values to do a lot of math in a short period of time.

LLMs use key values to attach a numerical value to tokens (terms or phrases) to understand language and calculate answers. So, every token (key) is associated with a number (value) that allows the LLM to calculate a response.

AI inference uses key values during its 2 main phases:

Prefill is when the model processes the input prompt. The key values for each token create the key value (KV) cache, which serves as the model’s short-term memory.
Decode is when the model generates new tokens. It uses the existing KV cache to calculate the key values of a response.

LLMs store key values for every processed token in the KV cache. Since the cache grows according to prompt length and output generation, it takes up a lot of LLM memory storage. Traditional LLM memory management systems don’t organize calculations or use memory in the most efficient way, causing LLMs to move slowly.

vLLM uses a memory management technique that understands how KV cache is used during inference. It retrieves cache data in a way that identifies repetitive key values to help prevent memory fragmentation and reduce extra work for the LLM. This makes GPU memory usage more efficient and LLM inference faster.

Read about the benefits of scaling AI

vLLM uses different technologies and techniques to use less storage and make inference faster:

Continuous batching is when LLMs begin the inference process for the next batch of tokens, even if they haven’t finished calculating a prior token. (vLLM can multitask.)
PagedAttention is a breakthrough technology that uses the KV cache to remember previous tokens and lean on its memory to save GPU storage.
Speculative decoding uses a smaller, faster model to predict incoming tokens, which increases the speed and efficiency of the prefill stage.
Quantization is the process of squeezing larger model parameters into smaller formats to reduce storage needs without sacrificing accuracy. There are various quantization methods for model customization.

Processing fewer tokens or generating a response a few seconds faster might seem inconsequential. But when enterprises use this memory-saving technique—across thousands of AI workloads, GPUs, and inference server calculations—it can save significant time, money, and resources.

This is a game changer for organizations that want to scale AI at the enterprise level.

See how distributed inference speeds up AI at scale

Organizations are using AI inference in high-volume, high-variable workloads. But deploying LLMs consistently at scale requires a lot of computing power, resources, and specialized operational skills.

vLLM can overcome these challenges by making more efficient use of the hardware needed to support AI inference in the enterprise. This is why vLLM is especially attractive to industries that need flexibility and control in addition to speed.

As an open source solution, vLLM allows companies to:

Own and manage their GPUs.
Control their data.
Experiment with new models as soon as they’re released.

This level of freedom offers a lower cost per token and fewer privacy concerns.

vLLM can be deployed across a variety of hardware including NVIDIA and AMD GPUs, Google TPUs, Intel Gaudi, and AWS Neuron. vLLM also isn’t restricted to specific hardware, meaning it works across the cloud, in the data center, or at the edge.

vLLM vs. Ollama: When to use each framework

From recruiting efforts to online gaming, scaling inference can become complex quickly.

The following examples show how enterprises are using the open source project, vLLM. These companies aren’t Red Hat customers but benefit from the broader vLLM community and the technology it produces.

How does Roblox use vLLM?

Roblox is an online gaming platform that hosts millions of users around the world. Users can create their own gaming experience and play games others have created.

Its latest feature, Assistant, an AI chatbot that helps create content, has increased the number of tokens processed to more than 1 billion per week. Additional features such as real-time AI chat translation and its voice safety model have also added inference complexity. This multimodality across millions of user interactions leads to more tokens to process, which requires more resources for inference.

To handle the increasing processing demands, Roblox adopted vLLM as its primary inference engine. Roblox specifically leans on vLLM’s speculative decoding capabilities for language tasks to serve its global customer base. Since adopting vLLM, Roblox has experienced a 50% reduction in latency to serve 4 billion tokens per week.

vLLM allows Roblox to scale and meet user demand as its platform continues to grow. Roblox chose vLLM because it aligns with its commitment to supporting open source technologies.

Listen to Roblox break down how they use vLLM in Red Hat’s vLLM office hours.

How does LinkedIn use vLLM?

LinkedIn adopted vLLM to support its wide range of gen AI use cases that cater to its large and active audience.

As 1 of the world’s largest professional networking sites, LinkedIn hosts more than 1 billion members in more than 200 countries. Now, vLLM allows LinkedIn to support more than 50 gen AI use cases, such as LinkedIn Hiring Assistant.

Using complex classification calculations, LinkedIn Hiring Assistant filters applicant qualifications like years of experience, skills, and previous employment. This helps recruiters match applicants to the best job fit.

But processing these wide-ranging classifications requires a lot of tokens (an average of 1,000 per candidate), and applicant pools can fill up with thousands of candidates.

More than 50% of applications share prefix tokens (qualifications share similarities). This makes LinkedIn Hiring Assistant a perfect use case for vLLM’s PagedAttention technology and continuous batching capabilities, which both reduce latency, prioritize high throughput, and lower the pressure on GPU storage.

Time Per Output Token (TPOT) reflects the average time it takes for a model to generate each individual token. So far, vLLM has helped LinkedIn improve its TPOT by 7%.

How does Amazon use vLLM?

Rufus, Amazon’s gen AI shopping assistant, aims to improve customer experience by decreasing decision fatigue. Rufus reportedly served 250 million customers in 2025, and that number continues to grow.

With a high number of customers using the gen AI shopping assistant, inference complexity increased. Amazon realized no single chip or instance had enough memory for Rufus to run smoothly.

Amazon prioritized scalable, multinode inference capabilities that maintain accuracy at faster speeds and lower latency. They achieved this by combining a multinode architecture solution that integrated with vLLM for smoother, faster inference.

By using vLLM’s continuous batching technique, the multinode architecture was able to intelligently schedule inference processing so token volume didn’t impact latency or performance.

Using vLLM to increase the efficiency and throughput of its LLMs allows Amazon to scale gen AI projects like Rufus that will continue to grow and evolve with its customers.

Keep reading

AI infrastructure explained

AI infrastructure combines artificial intelligence and machine learning (AI/ML) technology to develop and deploy reliable and scalable data solutions.

What is an AI platform?

An AI platform is an integrated collection of technologies to develop, train, and run machine learning models.

What is agentic AI?

Agentic AI is a software system designed to interact with data and tools in a way that requires minimal human intervention.

How vLLM accelerates AI inference: 3 enterprise use cases

4 key considerations for implementing AI technology

How does Roblox use vLLM?

How does LinkedIn use vLLM?

How does Amazon use vLLM?

Artificial intelligence (AI) at Red Hat

Get started with AI for enterprise: A beginner’s guide

Keep reading

AI infrastructure explained

What is an AI platform?

What is agentic AI?

Artificial intelligence resources

Red Hat AI

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links