A strategic approach to AI inference performance

September 15, 2025Carlos Condado2-minute read

Training large language models (LLMs) is a significant undertaking, but a more pervasive and often overlooked cost challenge is AI inference. Inference is the procedure by which a trained AI model processes new input data and generates an output. As organizations deploy these models in production, the costs can quickly become substantial, especially with high token volumes, long prompts, and growing usage demands. To run LLMs in a cost-effective and high-performing way, a comprehensive strategy is essential.

This approach addresses two critical areas: optimizing the inference runtime and optimizing the model itself.

Optimizing the inference runtime

Basic serving methods often struggle with inefficient GPU memory usage, suboptimal batch processing, and slow token generation. This is where a high-performance inference runtime becomes critical. vLLM, is the de facto, open source library that helps LLMs perform calculations more efficiently and at scale.

vLLM addresses these runtime challenges with advanced techniques, including:

Continuous batching: Instead of processing requests one by one, vLLM groups tokens from multiple sequences into batches. This minimizes GPU idle time and significantly improves GPU utilization and inference throughput.
PagedAttention: This memory management strategy efficiently handles large key-value (KV) caches. By dynamically allocating and managing GPU memory pages, PagedAttention greatly increases the number of concurrent requests and supports longer sequences without memory bottlenecks.

Optimizing the AI model

In addition to optimizing the runtime, organizations can also compress models to reduce their memory footprint and computational requirements. The two primary techniques are quantization and sparsity.

Quantization: This technique reduces a model’s numerical values, specifically its weights and activations, using fewer bits per value. This process significantly reduces the memory needed to store model parameters. For example, a 70-billion parameter Llama model can be shrunk from approximately 140 GB to as small as 40 GB. This means models can run on fewer resources and can double computational throughput without significantly degrading accuracy.
Sparsity: Sparsity reduces computational demands by setting some of the model’s parameters to zero, allowing systems to bypass unnecessary operations. This can substantially reduce model complexity, decreasing memory usage and computational load resulting in faster inference and lower operational costs.

Red Hat AI: Putting the strategy into practice

To help organizations implement this strategic approach, the Red Hat AI portfolio provides a unified set of solutions for achieving high-performance inference at scale.

Red Hat AI addresses both model and runtime optimization through its powerful set of tools and assets:

Red Hat AI Inference Server: Red Hat provides an enterprise-ready and supported vLLM engine that uses continuous batching and memory-efficient methods. By increasing throughput and reducing GPU usage, the runtime helps organizations maximize the return on their expensive AI hardware.
Access to validated and optimized models: Red Hat AI provides access to a repository of pre-evaluated and performance-tested models that are ready for use. These models are rigorously benchmarked against multiple evaluation tasks and can be found on the Red Hat AI Hugging Face repository, which allows organizations to achieve rapid time to value.
Included LLM Compressor: The Red Hat LLM toolkit provides a standardized way to apply compression techniques like quantization. This toolkit is what Red Hat uses to offer optimized models allowing customers to optimize their own fine-tuned or customized models.

By leveraging Red Hat AI, organizations can deploy high-performing, cost-effective models on a wide variety of hardware setups, helping teams meet rising AI demands while controlling costs and complexity.

To learn more about the fundamentals of inference performance engineering and model optimization, download the free e-book, Get started with AI Inference.

About the author

Carlos Condado

Sr. Product Marketing Manager

Carlos Condado is a Senior Product Marketing Manager for Red Hat AI. He helps organizations navigate the path from AI experimentation to enterprise-scale deployment by guiding the adoption of MLOps practices and integration of AI models into existing hybrid cloud infrastructures. As part of the Red Hat AI team, he works across engineering, product, and go-to-market functions to help shape strategy, messaging, and customer enablement around Red Hat’s open, flexible, and consistent AI portfolio.

With a diverse background spanning data analytics, integration, cybersecurity, and AI, Carlos brings a cross-functional perspective to emerging technologies. He is passionate about technological innovations and helping enterprises unlock the value of their data and gain a competitive advantage through scalable, production-ready AI solutions.

Read full bio

Keep exploring

Browse by channel

Explore all channels

A strategic approach to AI inference performance

Optimizing the inference runtime

Optimizing the AI model

Red Hat AI: Putting the strategy into practice

Get started with AI for enterprise: A beginner’s guide

About the author

Carlos Condado

More like this

Keep exploring

Browse by channel

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links