That GPU you're running? You're most likely not using it to the fullest.
You’ve deployed your large language model (LLM). It’s working, but many production deployments waste significant GPU capacity through suboptimal configurations. Your hardware spends time idle, waiting for data to move, or re-computing work it already did.
The obvious fix? Switch to a smaller model. Except, not really. You’ll sacrifice accuracy for minimal speedup while ignoring the 2–5x performance sitting dormant in your current setup.
What is AI optimization?
AI optimization is the process of refining how LLMs interact with hardware to eliminate waste and maximize performance. Much like streamlining a logistics network so every vehicle carries a full load, techniques like quantization and continuous batching help organizations achieve 3 to 5 times more throughput from their existing graphics processing units (GPUs). By moving beyond default configurations it's possible to reduce operational costs by 60% to 80% while providing faster, more consistent response times for end users.
Here's what's actually happening
LLM inference has 2 phases with different resource demands. Prefill (processing your prompt) is compute-bound and prefers large batches. Decode (generating tokens) is memory-bound and requires fast memory bandwidth. Default configurations treat them the same, and GPUs end up underused.
Whether you're running vLLM, TensorRT-LLM, or custom infrastructure, the following 7 AI optimizations can help you use your hardware more efficiently. Production deployments using these techniques consistently deliver 2-5x more throughput from the exact same silicon, or reduce costs by 60-80% while providing the same level of performance.
1. Quantization: The low-hanging 2-4x speedup
Running FP16 (16-bit floating point) or BF16 (16-bit brain floating point) weights in production is leaving 50-75% performance on the table. Modern quantization has come a long way. You can compress your model to INT8 (8-bit integer) or even INT4 (4-bit integer) while retaining over 99% of baseline accuracy.
Start with FP8 (8-bit floating point) if you're on NVIDIA H100 or Hopper GPUs. You'll preserve accuracy and see close to 2x throughput improvements. For even more aggressive compression, activation-aware weight quantization (AWQ) or generalized post-training quantization (GPTQ) can take you down to 4-bit weights with 3-4x memory reduction, letting you serve larger models on fewer GPUs. Here’s some additional reading for optimizing models, Optimizing generative AI models with quantization.
The reason quantization works so well comes down to where your bottleneck lives. When you're serving a handful of users in real-time (low batch sizes), your GPU spends most of its time waiting for weights to move from video random access memory (VRAM) to processors. This is memory-bound inference. Cut the weight size in half and you'll get a linear speedup in tokens per second. But when you're processing thousands of prompts at once (high batch sizes), the bottleneck shifts to raw compute power. Here, quantization still helps by unlocking specialized hardware kernels like NVIDIA H100's FP8 Tensor Cores that crunch numbers faster.
The catch: Not all quantization is created equal. When quantization is tuned properly, GPTQ generally achieves better results (for more, see the "'Give Me BF16 or Give Me Death'? Accuracy-Performance Trade-Offs in LLM Quantization" paper on ArXiv).
If you’re not sure where to start, you can try a free, open source project for model compression—LLM Compressor. If you’d rather skip the work of quantizing the models yourself, that’s covered, too—Red Hat AI provides a repository of 600+ pre-quantized open source models from leading providers, available in the Red Hat AI Hugging Face repository.
Every model that is compressed and published to our Hugging Face repository comes paired with performance metrics and accuracy results in comparison to the baseline model. You can simply replace your existing baseline model with a quantized version by copying and replacing the model stub.
2. Automatic prefix caching: Stop re-computing the same prompts
If you're running retrieval-augmented generation (RAG), chatbots with system prompts, or any workflow with repeated prefixes, you're often wasting 30-80% of your compute on redundant prefills. Every time someone asks a question about the same PDF, you're reprocessing that entire document from scratch. Prefix caching fixes this issue by computing identical prompt prefixes and then reusing those results when the same prefix pops up in another prompt.
One thing that you can do to get as much value out of this technique is to lead your prompt with the static content (system instructions, context for documents), allowing for that content to be cached once, and only once. The variable content (actual prompt information) can then be accessed faster and more efficiently.
The impact of this step mostly shows up in time-to-first-token (TTFT). LLMs process prompts in 2 distinct phases—prefill, where the entire prompt gets processed (expensive and compute-intensive), and decode, where tokens generate 1 at a time. Prefix caching eliminates the redundant prefill work entirely. For RAG and agent workloads, you'll see a noticeable speedup in TTFT and dramatically lower costs per query.
Most modern inference engines enable this automatically. vLLM has it built in, but the key here is to make sure that you are structuring your prompts in such a way that you get the full benefit.
3. Disaggregated prefill and decode: A smarter way to serve LLMs
Prefill and decode have opposite resource profiles. Prefill is compute-heavy, batch-friendly, and optimized for high throughput. Decode, on the other hand, is memory-bound, latency-sensitive, and performs best with smaller batches. Running them on the same GPU is like using a sports car for grocery shopping. Basically, the hardware is capable of so much more, but it’s not being used to its fullest potential.
The solution is to split prefill and decode into separate GPU pools. Route long prompts to a prefill cluster optimized for throughput, where large batches can be run optimally. Generation requests can then be sent to a decode cluster that prepped for latency, running small batches with non-volatile memory express (NVMe)-backed key-value (KV) cache offloading for longer contexts.
If you’re interested in scaling your inference by leveraging disaggregated prefill and decode, check out llm-d, a high-performance distributed inference serving stack optimized for LLM production deployments on Kubernetes.
4. Flash Attention: The kernel upgrade that changed everything
Standard attention is O(N²) in memory, meaning that an algorithm's execution time grows proportional to the square of the input size N. This is completely fine for 512 tokens, but catastrophic for 32k. Flash Attention rewrites the math to be more memory-efficient while staying mathematically identical to the original operation.
Most modern frameworks like vLLM ship with Flash Attention built in. For custom implementations, you can use xformers, Triton, or framework-native versions. The key is enabling it at compilation time or through inference engine flags. If something isn’t working as expected, more often than not it can be fixed with a config change rather than by rewriting the code.
The Flash Attention performance gain comes from how memory is handled. Instead of just writing intermediate attention matrices to slow down high-bandwidth memory (HBM), it combines operations and uses tiling to keep everything in the fast on-chip static random-access memory (SRAM).
The result is 3-10x faster attention computation and 10-20x more efficient memory usage. Without Flash Attention, running 32k token context windows on consumer GPUs is nearly impossible. With it, long context windows become something we don’t have to worry about.
5. Continuous batching: Never let your GPU idle
Traditional static batching waits for a full batch before starting. For example, if the first request finishes early, the GPU sits idle waiting for the slowest request to complete. That's like waiting for the slowest person to finish eating before anyone can leave the table.
In AI serving, this is called "ead-of-line blocking." If request A needs 5 tokens and request B needs 500, request A's GPU "seat" stays unused for 495 cycles. Continuous batching fixes this by letting request C sit down the millisecond request A stands up.
Modern inference engines like vLLM, TensorRT-LLM, and Text Generation Inference (TGI) implement continuous batching out of the box. The key here is tuning your maximum batch size to the point where adding 1 more request increases latency by more than 10% but improves throughput by less than 1%.
The result is GPU use staying around the 90-100% mark, instead of dropping to around 50%. In a production environment, this means approximately twice the throughput and around half the latency variance.
6. Aggressive KV cache management: The hidden bottleneck
Everyone focuses on model weights, but the KV cache is where requests live or die. Poor KV cache management means out-of-memory (OOM) errors, swapping to CPU (10-50x slower), or rejected requests. This is really one of the hidden bottlenecks that can wreak havoc on your deployment.
Each token you generate creates a KV cache entry. A 32k context serving 16 concurrent requests means 512k cache entries all sitting on GPU memory (VRAM). Without an intelligent system to manage this memory, an OOM error is imminent. The solution is PagedAttention (vLLM) or equivalent memory paging for treating KV cache like virtual memory.
Configure your gpu-memory-utilization setting to leave headroom for KV cache growth. You don’t really want to set the utilization to 99% and then have to debug failing requests. For larger contexts, you can use KV cache quantization (FP8) to cut memory usage in half, or just offload older cache entries to NVMe storage when they aren’t actively being used.
Monitor your KV cache hit rate and swap events closely. If you're using vLLM, check the/metrics endpoint. If you see frequent swapping between GPU and CPU memory, you're leaving massive performance on the table. Fixing KV cache management often unlocks more headroom than any other single AI optimization.
7. Speculative decoding: Generate 2-3 tokens for the price of 1
LLMs generate 1 token at a time, but what if you could guess the next 3-5 tokens and verify them in 1 pass?
This is called speculative decoding. Use a small draft model (0.5-2B parameters) to speculatively generate tokens. Red Hat AI Hugging Face repository has a few EAGLE (extrapolation algorithm for greater language-model efficiency) models to pick from.
Those tokens are then verified with your target model in a singular forward pass. You simply accept the correct tokens, reject the wrong ones, and continue onward. The draft model is usually a lot faster, but also a lot less accurate. When it gets the prediction right, which is typically in a fraction of the time, you can effectively get 2-4 tokens for the price of 1. When it happens to be wrong, then you’re only losing microseconds.
This technique works extremely well for code generation, structured output, and other predictable generation tasks. It’s less effective when it comes to generating content or other fresh generations. The speedup can reach upwards of 3x for the right workloads, and the best part is that there isn’t any variance in the output between the baseline model and the model with the draft model giving predictions, you’re just getting there a whole lot faster. If you’re curious to learn more about how speculative decoding works, you can learn more here.
The AI optimization cheat sheet
Your pain point | Fix this first | This too! |
High latency (TTFT) | Prefix caching | Disaggregated prefill |
Low throughput | Quantization | Continuous batching |
OOM errors | KV cache management | Quantization |
Long context (>8k) | Flash Attention | KV cache quantization |
High $/token | Quantization | Speculative decoding |
Every AI optimization you skip is a performance gain you're leaving on the table. The difference between "good enough" and "production-grade" AI infrastructure is 3-5x more performance at half the cost. The best thing about all of these is that they're free to configure!
Default configurations are built for demos, not production. Tune your stack carefully and deliver the best-in-class performance, efficiently, that your end users appreciate.
Ressource
Bien débuter avec l'inférence d'IA
À propos de l'auteur
Sawyer Bowerman is an AI Developer Advocate on Red Hat’s AI team based in Boston, MA. He specializes in high-performance model serving and inference, focusing on scaling open source ecosystems like vLLM and llm-d to make large language models more efficient and accessible for developers. He is dedicated to bridging the gap between raw model performance and real-world developer productivity through open-source innovation.
Plus de résultats similaires
233% 3-year return on investment and 13 months to payback with Red Hat AI
Red Hat OpenShift sandboxed containers 1.12 and Red Hat build of Trustee 1.1 bring confidential computing to bare metal and AI workloads
Technically Speaking | Build a production-ready AI toolbox
Technically Speaking | Platform engineering for AI agents
Parcourir par canal
Automatisation
Les dernières nouveautés en matière d'automatisation informatique pour les technologies, les équipes et les environnements
Intelligence artificielle
Actualité sur les plateformes qui permettent aux clients d'exécuter des charges de travail d'IA sur tout type d'environnement
Cloud hybride ouvert
Découvrez comment créer un avenir flexible grâce au cloud hybride
Sécurité
Les dernières actualités sur la façon dont nous réduisons les risques dans tous les environnements et technologies
Edge computing
Actualité sur les plateformes qui simplifient les opérations en périphérie
Infrastructure
Les dernières nouveautés sur la plateforme Linux d'entreprise leader au monde
Applications
À l’intérieur de nos solutions aux défis d’application les plus difficiles
Virtualisation
L'avenir de la virtualisation d'entreprise pour vos charges de travail sur site ou sur le cloud