What is vLLM?
vLLM is an inference server that speeds up gen AI inference in large language models (LLMs) by making better use of memory storage and graphics processing units (GPUs).
Using GPUs more efficiently helps LLMs perform calculations faster and at scale. This becomes increasingly important when organizations need real-time applications like chatbots or multimodal workflows.
This article highlights 3 real-world examples of how well-known companies are successfully using vLLM.
Why does vLLM matter for AI inference?
During inference, LLMs rely on key values to do a lot of math in a short period of time.
LLMs use key values to attach a numerical value to tokens (terms or phrases) to understand language and calculate answers. So, every token (key) is associated with a number (value) that allows the LLM to calculate a response.
AI inference uses key values during its 2 main phases:
- Prefill is when the model processes the input prompt. The key values for each token create the key value (KV) cache, which serves as the model’s short-term memory.
- Decode is when the model generates new tokens. It uses the existing KV cache to calculate the key values of a response.
LLMs store key values for every processed token in the KV cache. Since the cache grows according to prompt length and output generation, it takes up a lot of LLM memory storage. Traditional LLM memory management systems don’t organize calculations or use memory in the most efficient way, causing LLMs to move slowly.
vLLM uses a memory management technique that understands how KV cache is used during inference. It retrieves cache data in a way that identifies repetitive key values to help prevent memory fragmentation and reduce extra work for the LLM. This makes GPU memory usage more efficient and LLM inference faster.
4 wichtige Überlegungen zur Implementierung von KI-Technologie
Künstliche Intelligenz (KI) bei Red Hat
Von Live-Veranstaltungen über praktische Produktdemos bis hin zu fundierter technischer Forschung – erfahren Sie, was wir bei Red Hat mit KI erreichen.