What is vLLM?
vLLM is an inference server that speeds up gen AI inference in large language models (LLMs) by making better use of memory storage and graphics processing units (GPUs).
Using GPUs more efficiently helps LLMs perform calculations faster and at scale. This becomes increasingly important when organizations need real-time applications like chatbots or multimodal workflows.
This article highlights 3 real-world examples of how well-known companies are successfully using vLLM.
Why does vLLM matter for AI inference?
During inference, LLMs rely on key values to do a lot of math in a short period of time.
LLMs use key values to attach a numerical value to tokens (terms or phrases) to understand language and calculate answers. So, every token (key) is associated with a number (value) that allows the LLM to calculate a response.
AI inference uses key values during its 2 main phases:
- Prefill is when the model processes the input prompt. The key values for each token create the key value (KV) cache, which serves as the model’s short-term memory.
- Decode is when the model generates new tokens. It uses the existing KV cache to calculate the key values of a response.
LLMs store key values for every processed token in the KV cache. Since the cache grows according to prompt length and output generation, it takes up a lot of LLM memory storage. Traditional LLM memory management systems don’t organize calculations or use memory in the most efficient way, causing LLMs to move slowly.
vLLM uses a memory management technique that understands how KV cache is used during inference. It retrieves cache data in a way that identifies repetitive key values to help prevent memory fragmentation and reduce extra work for the LLM. This makes GPU memory usage more efficient and LLM inference faster.
AI 기술 구현의 4가지 핵심 고려 사항
Red Hat에서의 인공지능(AI)
라이브 이벤트부터 핸즈온 제품 데모, 심층적인 기술 관련 연구에 이르기까지, Red Hat이 다양한 측면에서 AI 발전에 어떻게 기여하고 있는지 알아보세요.