Training large language models (LLMs) is a significant undertaking, but a more pervasive and often overlooked cost challenge is AI inference. Inference is the procedure by which a trained AI model processes new input data and generates an output. As organizations deploy these models in production, the costs can quickly become substantial, especially with high token volumes, long prompts, and growing usage demands. To run LLMs in a cost-effective and high-performing way, a comprehensive strategy is essential.
This approach addresses two critical areas: optimizing the inference runtime and optimizing the model itself.
Optimizing the inference runtime
Basic serving methods often struggle with inefficient GPU memory usage, suboptimal batch processing, and slow token generation. This is where a high-performance inference runtime becomes critical. vLLM, is the de facto, open source library that helps LLMs perform calculations more efficiently and at scale.
vLLM addresses these runtime challenges with advanced techniques, including:
- Continuous batching: Instead of processing requests one by one, vLLM groups tokens from multiple sequences into batches. This minimizes GPU idle time and significantly improves GPU utilization and inference throughput.
- PagedAttention: This memory management strategy efficiently handles large key-value (KV) caches. By dynamically allocating and managing GPU memory pages, PagedAttention greatly increases the number of concurrent requests and supports longer sequences without memory bottlenecks.
Optimizing the AI model
In addition to optimizing the runtime, organizations can also compress models to reduce their memory footprint and computational requirements. The two primary techniques are quantization and sparsity.
- Quantization: This technique reduces a model’s numerical values, specifically its weights and activations, using fewer bits per value. This process significantly reduces the memory needed to store model parameters. For example, a 70-billion parameter Llama model can be shrunk from approximately 140 GB to as small as 40 GB. This means models can run on fewer resources and can double computational throughput without significantly degrading accuracy.
- Sparsity: Sparsity reduces computational demands by setting some of the model’s parameters to zero, allowing systems to bypass unnecessary operations. This can substantially reduce model complexity, decreasing memory usage and computational load resulting in faster inference and lower operational costs.
Red Hat AI: Putting the strategy into practice
To help organizations implement this strategic approach, the Red Hat AI portfolio provides a unified set of solutions for achieving high-performance inference at scale.
Red Hat AI addresses both model and runtime optimization through its powerful set of tools and assets:
- Red Hat AI Inference Server: Red Hat provides an enterprise-ready and supported vLLM engine that uses continuous batching and memory-efficient methods. By increasing throughput and reducing GPU usage, the runtime helps organizations maximize the return on their expensive AI hardware.
- Access to validated and optimized models: Red Hat AI provides access to a repository of pre-evaluated and performance-tested models that are ready for use. These models are rigorously benchmarked against multiple evaluation tasks and can be found on the Red Hat AI Hugging Face repository, which allows organizations to achieve rapid time to value.
- Included LLM Compressor: The Red Hat LLM toolkit provides a standardized way to apply compression techniques like quantization. This toolkit is what Red Hat uses to offer optimized models allowing customers to optimize their own fine-tuned or customized models.
By leveraging Red Hat AI, organizations can deploy high-performing, cost-effective models on a wide variety of hardware setups, helping teams meet rising AI demands while controlling costs and complexity.
To learn more about the fundamentals of inference performance engineering and model optimization, download the free e-book, Get started with AI Inference.
Resource
Get started with AI for enterprise: A beginner’s guide
About the author
Carlos Condado is a Senior Product Marketing Manager for Red Hat AI. He helps organizations navigate the path from AI experimentation to enterprise-scale deployment by guiding the adoption of MLOps practices and integration of AI models into existing hybrid cloud infrastructures. As part of the Red Hat AI team, he works across engineering, product, and go-to-market functions to help shape strategy, messaging, and customer enablement around Red Hat’s open, flexible, and consistent AI portfolio.
With a diverse background spanning data analytics, integration, cybersecurity, and AI, Carlos brings a cross-functional perspective to emerging technologies. He is passionate about technological innovations and helping enterprises unlock the value of their data and gain a competitive advantage through scalable, production-ready AI solutions.
More like this
Implementing best practices: Controlled network environment for Ray clusters in Red Hat OpenShift AI 3.0
Solving the scaling challenge: 3 proven strategies for your AI infrastructure
Technically Speaking | Platform engineering for AI agents
Technically Speaking | Driving healthcare discoveries with AI
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Virtualization
The future of enterprise virtualization for your workloads on-premise or across clouds