What is vLLM?
vLLM, which stands for virtual large language model, is a library of open source code maintained by the vLLM community. It helps large language models (LLMs) perform calculations more efficiently and at scale.
Specifically, vLLM is an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.
How does vLLM work?
To understand the value of vLLM, it’s important to understand what an inference server does, and the baseline mechanics of how an LLM operates. From there, we can better understand how vLLM plays a role in improving the performance of existing language models.
What is an inference server?
An inference server is software that helps an AI model make new conclusions based on its prior training. Inference servers feed the input requests through a machine learning model and return an output.
To infer is to conclude based on evidence. You may see your friend’s living room light on, but you don’t see them. You may infer that they’re home, but you don’t have absolute evidence to prove it.
A language model also doesn’t have absolute evidence about the meaning of a word or phrase (it’s a piece of software), so it uses its training as evidence. In a series of calculations based on data, it generates a conclusion. Just like when you calculate that if the light is off, it means your friend is out.
LLMs use math to make conclusions
When an LLM is being trained, it learns via mathematical calculations. When an LLM is generating a response (inference), it does so by performing a series of probability calculations (more math).
In order for an LLM to understand what you’re requesting from it, the model needs to understand how words relate to each other and how to make associations between words. Instead of learning about semantics and reasoning with words like humans do, LLMs “reason” with…you guessed it, math.
When an LLM responds to millions of users per day, it’s doing a lot of calculations. Processing all these calculations at once while in an application is live can be challenging. This is because (traditionally) the processing power involved in running an LLM can quickly consume significant memory.
Red Hat AI
vLLM uses PagedAttention to process calculations more efficiently
A landmark study, Efficient Memory Management for Large Language Model Serving with PagedAttention, identified how existing LLM memory management systems do not organize calculations in the most efficient way. PagedAttention is a memory management technique introduced by vLLM that’s inspired by virtual memory and paging systems within operating systems.
This research recognizes how the key value (KV) cache (short term memory of an LLM) shrinks and grows during throughput, and offers vLLM as a solution for managing space and computing power in a more stable way.
Essentially, vLLM works as a set of instructions that encourage the KV cache to create shortcuts by continuously “batching” user responses.
Before we move forward, let’s quickly define what KV cache and continuous batching are.
What is KV cache?
KV stands for key value. Key value refers to the way that an LLM formulates a meaning for a word or phrase. Let’s say you’re processing the key value for an item on a menu: French fries (key) is charged at $3.99 (value). So, when a cashier rings up an order of french fries, the computed “value” of that “key” is $3.99. LLMs process KVs similarly in that they hold the corresponding value to each key (or token) in their cache.
Cache refers to a short term memory storage. Think about your personal computer–when things are running slowly, it’s common practice to “clear your cache” to make room for better, faster processing.
What is continuous batching?
Continuous batching is a technique used to process multiple queries simultaneously, with the aim of improving overall processing efficiency.
Consider this: A chatbot is getting thousands of queries each minute, and many of those queries pose similar questions, like, “what is the capital of India?” and “what is the capital of Ireland?” Both of these queries include the words “what is the capital of”, which is a string of tokens (words) that the LLM has to do a lot of calculations to create meaning from.
With vLLM, the chatbot can hold this string of tokens (“what is the capital of”) in a short term memory (KV cache) and send a single “translation request” rather than 2 separate ones.
In other words, instead of generating a brand new response, vLLMs allow the KV cache to hold memory and create shortcuts for new queries that are similar to previously computed calculations. Processing the calculations of similar queries in a batch (rather than individually) improves throughput and optimizes memory allocation.
How can vLLM help your organization?
vLLM allows organizations to “do more with less” in a market where the hardware needed for LLM-based applications comes with a hefty price tag.
Building cost-efficient and reliable LLM services requires significant computing power, energy resources, and specialized operational skills. These challenges effectively put the benefits of customized, deployment-ready, and more security-conscious AI out of reach for many organizations.
vLLM and PagedAttention, the algorithm it’s built on, aim to address these challenges by making more efficient use of the hardware needed to support AI workloads.
Benefits of vLLM
Using vLLM as an inference server for LLMs has benefits such as:
Faster response time: By some calculations, vLLM achieves up to 24x higher throughput (how much data an LLM can process) compared to Hugging Face Transformers, a popular open source library for working with LLMs.
Reduced hardware costs: More efficient use of resources means fewer GPUs are needed to handle the processing of LLMs.
Scalability: vLLMs organize virtual memory so the GPU can handle more simultaneous requests from users.
Data privacy: Self-hosting an LLM with vLLM provides you with more control over data privacy and usage compared to using a third-party LLM service or application like ChatGPT.
Open source innovation: Community involvement in maintaining and sustaining vLLM allows for consistent improvements to code. Transparency in how users can access and modify code provides freedom for developers to use vLLM in whatever way meets their needs.
Why vLLM is becoming a standard for enhancing LLM performance
PagedAttention is the primary algorithm that came out of vLLM. However, PagedAttention is not the only capability that vLLM provides. Additional performance optimizations that vLLM can offer include:
- PyTorch Compile/CUDA Graph - for optimizing GPU memory.
- Quantization - for reducing memory space required to run models.
- Tensor parallelism - for breaking up the work of processing among multiple GPUs.
- Speculative decoding - for speeding up text generation by using a smaller model to predict tokens and a larger model to validate that prediction.
- Flash Attention - for improving the efficiency of transformer models.
Aside from the optimization abilities vLLM offers, its flexible nature has also helped it grow in popularity. vLLM works with both small and large language models and integrates with popular models and frameworks. Finally, the open-source nature of vLLM allows for code transparency, customization, and faster bug fixes.
Manage your AI, the open source way
The Red Hat® AI portfolio uses open source innovation to meet the challenges of wide-scale enterprise AI, and vLLM is a critical tool within our toolbox.
vLLM is one of multiple inference serving runtimes offered with Red Hat® OpenShift® AI.
OpenShift AI is a flexible, scalable MLOps platform with tools to build, deploy, and manage AI-enabled applications. OpenShift AI supports the full lifecycle of AI/ML experiments and models, on-premise and in the public cloud.
Red Hat Completes Acquisition of Neural Magic to Fuel Optimized Generative AI Innovation Across the Hybrid Cloud
Red Hat has completed its acquisition of Neural Magic, a pioneer in software and algorithms that accelerate generative AI (gen AI) inference workloads, furthering its vision of high-performing AI workloads that address customer use cases wherever needed across the hybrid cloud.