What is vLLM?

Published January 16, 2025•6-minute read

vLLM, which stands for virtual large language model, is a library of open source code maintained by the vLLM community. It helps large language models (LLMs) perform calculations more efficiently and at scale.

vLLM includes both an inference server (which manages network traffic), and an inference engine (to maximize computational speed). It works by speeding up the output of generative AI applications by making better use of the GPU memory through it’s PagedAttention algorithm.

The overall goal of vLLM is to maximize throughput (tokens processed per second) to serve many users at once.

Explore Red Hat AI

To understand the value of vLLM, it’s important to understand what an inference server does, and the baseline mechanics of how an LLM operates. From there, we can better understand how vLLM plays a role in improving the performance of existing language models.

Catch up on vLLM Office Hours for the latest developments

What is an inference server?

An inference server is software that helps an AI model make new conclusions based on its prior training. Inference servers feed the input requests through a machine learning model and return an output.

To infer is to conclude based on evidence. You may see your friend’s living room light on, but you don’t see them. You may infer that they’re home, but you don’t have absolute evidence to prove it.

A language model also doesn’t have absolute evidence about the meaning of a word or phrase (it’s a piece of software), so it uses its training as evidence. In a series of calculations based on data, it generates a conclusion. Just like when you calculate that if the light is off, it means your friend is out.

Learn more about AI inference

LLMs use math to make conclusions

When an LLM is being trained, it learns via mathematical calculations. When an LLM is generating a response (inference), it does so by performing a series of probability calculations (more math).

In order for an LLM to understand what you’re requesting from it, the model needs to understand how words relate to each other and how to make associations between words. Instead of learning about semantics and reasoning with words like humans do, LLMs “reason” with…you guessed it, math.

When an LLM responds to millions of users per day, it’s doing a lot of calculations. Processing all these calculations at once while in an application is live can be challenging. This is because (traditionally) the processing power involved in running an LLM can quickly consume significant memory.

vLLM architecture upgrades continue to improve resource efficiency for things like memory and speed.

What does AI look like at the enterprise?

A landmark study, Efficient Memory Management for Large Language Model Serving with PagedAttention, identified how existing LLM memory management systems do not organize calculations in the most efficient way. PagedAttention is a memory management technique introduced by vLLM that’s inspired by virtual memory and paging systems within operating systems.

This research recognizes how the key value (KV) cache (short term memory of an LLM) shrinks and grows during throughput, and offers vLLM as a solution for managing space and computing power in a more stable way.

Essentially, vLLM works as a set of instructions that encourage the KV cache to create shortcuts by continuously “batching” user responses.

Before we move forward, let’s quickly define what KV cache and continuous batching are.

What is KV cache?

KV stands for key value. Key value refers to the way that an LLM formulates a meaning for a word or phrase. Let’s say you’re processing the key value for an item on a menu: French fries (key) is charged at $3.99 (value). So, when a cashier rings up an order of french fries, the computed “value” of that “key” is $3.99. LLMs process KVs similarly in that they hold the corresponding value to each key (or token) in their cache.

Cache refers to a short term memory storage. Think about your personal computer–when things are running slowly, it’s common practice to “clear your cache” to make room for better, faster processing.

What is continuous batching?

Continuous batching is a technique used to process multiple queries simultaneously, with the aim of improving overall processing efficiency.

Consider this: A chatbot is getting thousands of queries each minute, and many of those queries pose similar questions, like, “what is the capital of India?” and “what is the capital of Ireland?” Both of these queries include the words “what is the capital of”, which is a string of tokens (words) that the LLM has to do a lot of calculations to create meaning from.

With vLLM, the chatbot can hold this string of tokens (“what is the capital of”) in a short term memory (KV cache) and send a single “translation request” rather than 2 separate ones.

In other words, instead of generating a brand new response, vLLMs allow the KV cache to hold memory and create shortcuts for new queries that are similar to previously computed calculations. Processing the calculations of similar queries in a batch (rather than individually) improves throughput and optimizes memory allocation.

vLLM can help optimize memory and expand token capacity for larger batch sizes and long-form context tasks.

What is Models-as-a-Service?

vLLM allows organizations to “do more with less” in a market where the hardware needed for LLM-based applications comes with a hefty price tag.

Building cost-efficient and reliable LLM services requires significant computing power, energy resources, and specialized operational skills. These challenges effectively put the benefits of customized, deployment-ready, and more security-conscious AI out of reach for many organizations.

vLLM and PagedAttention, the algorithm it’s built on, aim to address these challenges by making more efficient use of the hardware needed to support AI workloads.

Benefits of vLLM

Using vLLM as an inference server for LLMs has benefits such as:

Faster response time: By some calculations, vLLM achieves up to 24x higher throughput (how much data an LLM can process) compared to Hugging Face Transformers, a popular open source library for working with LLMs.

Reduced hardware costs: More efficient use of resources means fewer GPUs are needed to handle the processing of LLMs. For organizations working with extra large LLMs (those with hundreds of billions of parameters), vLLM can help maintain efficiency. Specifically, vLLM can be used alongside techniques like distributed inference to make the most of existing hardware and cut down on costs.

Scalability: vLLMs organize virtual memory so the GPU can handle more simultaneous requests from users. This is especially important for agentic AI applications, which must process many simultaneous requests in order to complete a single multi-step task.

Data privacy: Self-hosting an LLM with vLLM provides you with more control over data privacy and usage compared to using a third-party LLM service or application like ChatGPT.

Open source innovation: Community involvement in maintaining and sustaining vLLM allows for consistent improvements to code. Transparency in how users can access and modify code provides freedom for developers to use vLLM in whatever way meets their needs.

Find out how to deploy Llama with vLLM

PagedAttention is the primary algorithm that came out of vLLM. However, PagedAttention is not the only capability that vLLM provides. Additional performance optimizations that vLLM can offer include:

PyTorch Compile/CUDA Graph - for optimizing GPU memory.
Quantization - for reducing memory space required to run models.
Tensor parallelism - for breaking up the work of processing among multiple GPUs.
Speculative decoding - for speeding up text generation by using a smaller model to predict tokens and a larger model to validate that prediction.
Flash Attention - for improving the efficiency of transformer models.

Aside from the optimization abilities vLLM offers, its flexible nature has also helped it grow in popularity. vLLM works with both small and large language models and integrates with popular models and frameworks. Finally, the open-source nature of vLLM allows for code transparency, customization, and faster bug fixes.

Learn how vLLM supports inference for the open source community

vLLM and llm-d

llm-d is an open source framework that integrates and builds on the power of vLLM. It's a recipe for performing distributed inference and was built to support increasing resource demands of LLMs.

Think of it this way, if vLLM helps with speed, llm-d helps with coordination. vLLM and llm-d work together to intelligently route traffic through the model and make processing happen as quickly and efficiently as possible.

Learn about llm-d

Red Hat® AI uses open source innovation to meet the challenges of wide-scale enterprise AI, and vLLM is a critical tool in our toolbox.

With Red Hat AI, you have access to Red Hat® AI Inference Server to optimize model inference across the hybrid cloud for faster, cost-effective deployments. Powered by vLLM, the inference server maximizes GPU utilization and enables faster response times.

Learn more about the Red Hat AI Inference Server

Red Hat AI Inference Server includes the Red Hat AI repository, a collection of third-party validated and optimized models that allows model flexibility and encourages cross-team consistency. With access to the third-party model repository, enterprises can accelerate time to market and decrease financial barriers to AI success.

Explore our repository on Hugging Face

Learn more about validated models by Red Hat AI

Red Hat AI is powered by open source technologies and a partner ecosystem that focuses on performance, stability, and GPU support across various infrastructures.

Browse our partner ecosystem

Keep reading

What is explainable AI?

Explainable AI (XAI) techniques, applied during the machine learning (ML) lifecycle, make AI outputs more understandable and transparent to humans.

Agentic AI vs. generative AI

Agentic AI and generative AI explained: Learn how each works, their unique strengths, and how they can collaborate for smarter solutions.

How vLLM accelerates AI inference: 3 enterprise use cases

This article highlights 3 real-world examples of how well-known companies are successfully using vLLM.

What is vLLM?

What is an inference server?

LLMs use math to make conclusions

Red Hat AI

What is KV cache?

What is continuous batching?

Benefits of vLLM

vLLM and llm-d

Generative AI use cases with Red Hat AI

The adaptable enterprise: Why AI readiness is disruption readiness

Keep reading

What is explainable AI?

Agentic AI vs. generative AI

How vLLM accelerates AI inference: 3 enterprise use cases

Artificial intelligence resources

Red Hat OpenShift AI

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links