-
Soluções e documentação Red Hat AI
Uma plataforma de soluções e serviços para desenvolvimento e implantação de IA na nuvem híbrida.
Red Hat AI Enterprise
Crie, desenvolva e implante aplicações com tecnologia de IA na nuvem híbrida.
Red Hat AI Inference Server
Otimize o desempenho do modelo com o vLLM e realize inferências de forma mais rápida, econômica e em escala.
Red Hat Enterprise Linux AI
Desenvolva, teste e execute modelos de IA generativa com recursos de inferência otimizados.
Red Hat OpenShift AI
Crie e implante modelos e aplicações com IA em escala em ambientes híbridos.
-
Aprenda Básico
-
Parceiros de IA
Use case
Fast, efficient inference with Red Hat AI
When you optimize inference, your models get faster, smarter, and more reliable.
Inference is at the core of generative AI. But as models get more complex, inference becomes slower and things can get complicated.
To inference at scale, models need a lot of storage, memory, and compute power—which can take the majority of your budget. And the rapid adoption of agentic AI intensifies compute workload even further.
Red Hat® AI optimizes inference to help you stay cost effective, allow your teams to scale, and reliably support agentic AI.
Choose any model, on any accelerator, in any cloud environment.
Maximize existing infrastructure to reduce cost-per-token and increase throughput.
Scale dynamically with intelligent distributed inference and insight into unpredictable demand.
What you can do
Red Hat AI supports fast, consistent, and cost-effective inference at scale. Driven by open source technologies like vLLM and llm-d, it offers the flexibility to scale across the hybrid cloud with the model and accelerator of your choice.
Deploy and scale across the hybrid cloud
Maintain operational consistency across different hardware accelerators (GPUs, TPUs) and run models on premise, in the cloud, or at the edge.
Choose your models and accelerators
Choose any combination of models and hardware accelerators with a consistent operational experience. Build a unified Model-as-a-Service architecture without rebuilding your whole stack.
Compress and quantize models of any size
Reduce compute utilization and its related costs while maintaining high model response accuracy.
Increase throughput by reducing cost-per-token
Maximize your existing infrastructure using vLLM and llm-d. By optimizing available resources, low latency and high throughput allow you to run cost-effective inference and agents at scale.
Manage the end-to-end model lifecycle
Build with familiar tools and frameworks on a single, centralized platform with a Kubernetes core.
Ensure reliable operation at scale
All inference workloads are governed through controlled access, policy enforcement, and observability.
Models-as-a-Service with Red Hat AI
Find out more about Models-as-a-Service that are scalable, open, and cost-efficient by design.
233% ROI with Red Hat AI
A Forrester Consulting study, commissioned by Red Hat, found that a composite organization—based on current Red Hat AI customers—realized an ROI of 233% by deploying Red Hat AI.1
Learn how it works
Red Hat AI offers flexible, open source-powered deployment options to deliver efficient, cost-effective, and controlled inference across models, agents, and applications.
AI model inference with Red Hat AI | Red Hat Explains. Video duration: 4:19
Features
Red Hat AI offers exceptional control over models, agents, and hardware to improve inference at scale.
Maximize throughput and GPU utilization
vLLM is an inference engine designed to maximize throughput and accelerate response times across hardware accelerators. It uses the PagedAttention algorithm to optimize GPU utilization and speed up the output of generative AI applications.
Use vLLM to optimize the deployment of any gen AI model, on any AI accelerator, while maintaining controlled and predictable inference behavior in production environments.
Speed up distributed inference at scale
llm-d is a Kubernetes-native, open source framework that speeds up distributed LLM inference at scale.
This means when an AI model receives complicated queries with a lot of data, llm-d provides a framework that makes processing faster. Its accessible, modular architecture makes llm-d an ideal platform for distributed LLM inference at scale. This supports scalable inference while maintaining consistency, control, and governance across distributed workloads.
Get insights to meet strict service level objectives (SLOs)
Use metrics and insights of models in production to find out where and how your models can improve. See model-specific performance metrics like Time-to-First Token, KV-cache hit rate, and GPU utilization. Use these metrics to monitor performance, detect anomalies, and help inference meet operational, security, and policy requirements.
Compress and quantize models to reduce resource constraints
Optimize your choice of foundational or custom models with a diverse model toolkit. Use techniques like quantization or sparsity to reduce hardware requirements and lower inference costs.
Tools like LLM Compressor are included in the toolkit. It uses the latest model compression research to make LLMs smaller, more energy efficient, and faster. This reduces hardware requirements and improves efficiency—without sacrificing accuracy.
Beyond its core functionality, LLM Compressor integrates broadly with other tools and platforms. It supports inference within the Hugging Face Transformers ecosystem, enabling accuracy validation pre-deployment. It also interfaces with fine-tuning frameworks, allowing users to maintain sparsity during supervised training.
It helps achieve all of the above while maintaining validation, reproducibility, and control over model behavior before deployment.
Manage internal model access with an open, portable strategy
Red Hat AI includes the integration of a managed API gateway that allows AI platform engineers to set up internal models-as-a-service (MaaS) capabilities. It provides an open, modular, and vendor-neutral way to deploy and operate models across hybrid cloud environments.
Governed access to models through a centralized MaaS architecture lets you control who can access specific models, enforce policies, and monitor usage across users, applications, and agents. This supports reliable, auditable, and policy-driven model consumption at scale.
With easier ways to consume AI models and GPU resources, developers can streamline access to API end points and platform engineers can control, govern, and monitor access consumption for their high-performing, self-hosted models.
When paired with an inference stack that supports the unpredictable demand and scale of models and agents, an open strategy to manage model access provides a strong foundation for agentic AI, fine-tuning, and AI at scale.
Choose a gen AI model from our validated collection
Use any gen AI model or choose from our optimized collection of open source, third-party models, validated to run efficiently across the Red Hat AI platform.
Red Hat AI model validation is done using open-source tooling such as GuideLLM, Language Model Evaluation Harness, and vLLM. This supports reproducibility for customers and ensures models are validated, trusted, and consistently deployed across environments.
Your vendors are your choice
We work with software and hardware vendors and open source communities to offer a holistic AI solution.
Access partner products and services that are tested, supported, and certified to perform with our technologies.
What's next?
Try it
Lorem ipsum dolor sit amet consectetur. Tristique sapien gravida adipiscing.
Buy it
Lorem ipsum dolor sit amet consectetur. Tristique sapien gravida adipiscing.
Get up and running
Lorem ipsum dolor sit amet consectetur. Tristique sapien gravida adipiscing.
Talk to a Red Hatter
1 Forrester Consulting study, commissioned by Red Hat. “Forrester Total Economic Impact™ Of Red Hat AI." February 2026.