Use case

Fast, efficient inference with Red Hat AI

When you optimize inference, your models get faster, smarter, and more reliable.

Inference is at the core of generative AI. But as models get more complex, inference becomes slower and things can get complicated.

To inference at scale, models need a lot of storage, memory, and compute power—which can take the majority of your budget. And the rapid adoption of agentic AI intensifies compute workload even further.

Red Hat® AI optimizes inference to help you stay cost effective, allow your teams to scale, and reliably support agentic AI.

Choose any model, on any accelerator, in any cloud environment.

Maximize existing infrastructure to reduce cost-per-token and increase throughput.

Scale dynamically with intelligent distributed inference and insight into unpredictable demand.

What you can do

Red Hat AI supports fast, consistent, and cost-effective inference at scale. Driven by open source technologies like vLLM and llm-d, it offers the flexibility to scale across the hybrid cloud with the model and accelerator of your choice.

Why you should care about inference

Deploy and scale across the hybrid cloud

Maintain operational consistency across different hardware accelerators (GPUs, TPUs) and run models on premise, in the cloud, or at the edge.

Choose your models and accelerators

Choose any combination of models and hardware accelerators with a consistent operational experience. Build a unified Model-as-a-Service architecture without rebuilding your whole stack.

Compress and quantize models of any size

Reduce compute utilization and its related costs while maintaining high model response accuracy.

Increase throughput by reducing cost-per-token

Maximize your existing infrastructure using vLLM and llm-d. By optimizing available resources, low latency and high throughput allow you to run cost-effective inference and agents at scale.

Manage the end-to-end model lifecycle

Build with familiar tools and frameworks on a single, centralized platform with a Kubernetes core.

Ensure reliable operation at scale

All inference workloads are governed through controlled access, policy enforcement, and observability.

Models-as-a-Service with Red Hat AI

Find out more about Models-as-a-Service that are scalable, open, and cost-efficient by design.

233% ROI with Red Hat AI

A Forrester Consulting study, commissioned by Red Hat, found that a composite organization—based on current Red Hat AI customers—realized an ROI of 233% by deploying Red Hat AI.¹

Learn how it works

Red Hat AI offers flexible, open source-powered deployment options to deliver efficient, cost-effective, and controlled inference across models, agents, and applications.

AI model inference with Red Hat AI | Red Hat Explains. Video duration: 4:19

Features

Red Hat AI offers exceptional control over models, agents, and hardware to improve inference at scale.

vLLM

Maximize throughput and GPU utilization

vLLM is an inference engine designed to maximize throughput and accelerate response times across hardware accelerators. It uses the PagedAttention algorithm to optimize GPU utilization and speed up the output of generative AI applications.

Use vLLM to optimize the deployment of any gen AI model, on any AI accelerator, while maintaining controlled and predictable inference behavior in production environments.

Learn more about vLLM

llm-d

Speed up distributed inference at scale

llm-d is a Kubernetes-native, open source framework that speeds up distributed LLM inference at scale.

This means when an AI model receives complicated queries with a lot of data, llm-d provides a framework that makes processing faster. Its accessible, modular architecture makes llm-d an ideal platform for distributed LLM inference at scale. This supports scalable inference while maintaining consistency, control, and governance across distributed workloads.

Learn more about llm-d

GenAI-specific telemetry

Get insights to meet strict service level objectives (SLOs)

Use metrics and insights of models in production to find out where and how your models can improve. See model-specific performance metrics like Time-to-First Token, KV-cache hit rate, and GPU utilization. Use these metrics to monitor performance, detect anomalies, and help inference meet operational, security, and policy requirements.

Model optimization toolkit

Compress and quantize models to reduce resource constraints

Optimize your choice of foundational or custom models with a diverse model toolkit. Use techniques like quantization or sparsity to reduce hardware requirements and lower inference costs.

Tools like LLM Compressor are included in the toolkit. It uses the latest model compression research to make LLMs smaller, more energy efficient, and faster. This reduces hardware requirements and improves efficiency—without sacrificing accuracy.

Beyond its core functionality, LLM Compressor integrates broadly with other tools and platforms. It supports inference within the Hugging Face Transformers ecosystem, enabling accuracy validation pre-deployment. It also interfaces with fine-tuning frameworks, allowing users to maintain sparsity during supervised training.

It helps achieve all of the above while maintaining validation, reproducibility, and control over model behavior before deployment.

Learn more about LLM Compressor

Models-as-a-Service

Manage internal model access with an open, portable strategy

Red Hat AI includes the integration of a managed API gateway that allows AI platform engineers to set up internal models-as-a-service (MaaS) capabilities. It provides an open, modular, and vendor-neutral way to deploy and operate models across hybrid cloud environments.

Governed access to models through a centralized MaaS architecture lets you control who can access specific models, enforce policies, and monitor usage across users, applications, and agents. This supports reliable, auditable, and policy-driven model consumption at scale.

With easier ways to consume AI models and GPU resources, developers can streamline access to API end points and platform engineers can control, govern, and monitor access consumption for their high-performing, self-hosted models.

When paired with an inference stack that supports the unpredictable demand and scale of models and agents, an open strategy to manage model access provides a strong foundation for agentic AI, fine-tuning, and AI at scale.

See the MaaS documentation

Red Hat AI model catalog

Choose a gen AI model from our validated collection

Use any gen AI model or choose from our optimized collection of open source, third-party models, validated to run efficiently across the Red Hat AI platform.

Red Hat AI model validation is done using open-source tooling such as GuideLLM, Language Model Evaluation Harness, and vLLM. This supports reproducibility for customers and ensures models are validated, trusted, and consistently deployed across environments.