Skip to content AI
  • Overview

    • AI news
    • Technical blog
    • Live AI events
    • Inference explained
    • See our approach
  • Products

    • Red Hat AI Enterprise
    • Red Hat AI Inference
    • Red Hat Enterprise Linux AI
    • Red Hat OpenShift AI
    • Explore Red Hat AI
  • Engage & learn

    • Learning hub
    • AI topics
    • AI partners
    • Services for AI
Hybrid cloud
  • Platform solutions

    • Artificial intelligence

      Build, deploy, and monitor AI models and apps.

    • Linux standardization

      Get consistency across operating environments.

    • Application development

      Simplify the way you build, deploy, and manage apps.

    • Automation

      Scale automation and unite tech, teams, and environments.

  • Use cases

    • Virtualization

      Modernize operations for virtualized and containerized workloads.

    • Digital sovereignty

      Control and protect critical infrastructure.

    • Security

      Code, build, deploy, and monitor security-focused software.

    • Edge computing

      Deploy workloads closer to the source with edge technology.

  • Explore solutions
  • Solutions by industry

    • Automotive
    • Financial services
    • Healthcare
    • Industrial sector
    • Media and entertainment
    • Public sector (Global)
    • Public sector (U.S.)
    • Telecommunications

Discover cloud technologies

Learn how to use our cloud products and solutions at your own pace in the Red Hat® Hybrid Cloud Console.

Products
  • Platforms

    • Red Hat AI iconartificial intelligence, Red Hat Enterprise Linux AI, Red Hat OpenShift AI, RHEL AI, machine learning38382025-03-12T19:43:40.963Zimage/svg+xmlRed Hat AI iconartificial intelligence, Red Hat Enterprise Linux AI, Red Hat OpenShift AI, RHEL AI, machine learningIconno2025-03-12T19:39:59.817ZTechnology iconStandardRed Hat AI

      Develop and deploy AI solutions across the hybrid cloud.

    • Red Hat Enterprise Linux iconRHEL, Linux platforms, CentOS2024-03-01T15:26:42.958ZpendingTRA3b65dd25-844d-49bb-93c1-30f5b34684f1Icon2024-03-01T15:26:42.958Ztruepending2024-03-21T00:40:29.326Zrhcc-audience:internalnoTechnology iconDER3b65dd25-844d-49bb-93c1-30f5b34684f1Standardyesrhcc-product:red-hat-enterprise-linuxTechnology iconimage/svg+xml2024-05-10T14:11:29.114ZRed Hat Enterprise Linux iconRHEL, Linux platforms, CentOSActivateActivate2024-05-10T14:11:29.836Zworkflow-process-serviceActivateworkflow-process-servicefalse2024-05-10T14:11:29.836Zworkflow-process-service2024-05-10T14:11:29.836ZUse technology icons to represent Red Hat products and components. Do not remove the icon from the bounding shape.Red Hat Enterprise Linux

      Support hybrid cloud innovation on a flexible operating system.

    • Red Hat OpenShift iconCloud, Containers, Kubernetes2024-03-01T15:26:53.684ZpendingTRA9ec76aa9-ef09-4c49-8816-01dd13970ca7Icon2024-03-01T15:26:53.684Ztruepending2024-03-21T00:39:44.126Zrhcc-audience:internalnoTechnology iconDER9ec76aa9-ef09-4c49-8816-01dd13970ca7Standardyesrhcc-product:red-hat-openshiftrhcc-product:red-hat-openshift-on-ibm-cloudrhcc-product:microsoft-azure-red-hat-openshiftrhcc-product:red-hat-openshift-service-on-awsrhcc-product:red-hat-openshift-container-platformrhcc-product:red-hat-openshift-platform-plusTechnology iconimage/svg+xml2024-05-10T14:18:23.703ZRed Hat OpenShift iconCloud, Containers, KubernetesActivateActivate2024-05-10T14:18:25.221Zworkflow-process-serviceActivateworkflow-process-servicefalse2024-05-10T14:18:25.221Zworkflow-process-service2024-05-10T14:18:25.221ZUse technology icons to represent Red Hat products and components. Do not remove the icon from the bounding shape.Red Hat OpenShift

      Build, modernize, and deploy apps at scale.

    • Red Hat Ansible Automation Platform iconManagement, edge2024-03-01T15:26:35.068ZpendingTRA759b57c4-760b-45a0-a939-821f47181964Icon2024-03-01T15:26:35.068Ztruepending2024-03-21T00:39:55.923Zrhcc-audience:internalnoTechnology iconDER759b57c4-760b-45a0-a939-821f47181964Standardyesrhcc-product:red-hat-ansible-automation-platformTechnology iconimage/svg+xml2024-05-10T14:04:00.014ZRed Hat Ansible Automation Platform iconManagement, edgeActivateActivate2024-05-10T14:04:01.784Zworkflow-process-serviceActivateworkflow-process-servicefalse2024-05-10T14:04:01.784Zworkflow-process-service2024-05-10T14:04:01.784ZUse technology icons to represent Red Hat products and components. Do not remove the icon from the bounding shape.Red Hat Ansible Automation Platform

      Implement enterprise-wide automation.

      New version
  • Featured

    • Red Hat AI Enterprise
    • Red Hat OpenShift Virtualization Engine
    • Red Hat Desktop
    • See all products
  • Try & buy

    • Start a trial
    • Buy online
    • Integrate with major cloud providers
  • Services & support

    • Consulting
    • Product support
    • Services for AI
    • Technical Account Management
    • Explore services
Training
  • Training & certification

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Featured

    • Red Hat Certified System Administrator exam
    • Red Hat System Administration I
    • Red Hat Learning Subscription trial (No cost)
    • Red Hat Certified Engineer exam
    • Red Hat Certified OpenShift Administrator exam
  • Services

    • Consulting
    • Partner training
    • Product support
    • Services for AI
    • Technical Account Management
Learn
  • Build your skills

    • Documentation
    • Hands-on labs
    • Hybrid cloud learning hub
    • Interactive demos
    • Training and certification
  • More ways to learn

    • Blog
    • Events and webinars
    • Podcasts and video series
    • Red Hat TV
    • Resource library

For developers

Discover resources and tools to help you build, deliver, and manage cloud-native applications and services.

Partners
  • For customers

    • Our partners
    • Red Hat Ecosystem Catalog
    • Find a partner
  • For partners

    • Partner Connect
    • Become a partner
    • Training
    • Support
    • Access the partner portal

Build solutions powered by trusted partners

Find solutions from our collaborative community of experts and technologies in the Red Hat® Ecosystem Catalog.

ConsoleDocsSupport Search

I'd like to:

  • Start a trial
  • Buy a learning subscription
  • Manage subscriptions
  • Contact sales
  • Contact customer service
  • See Red Hat jobs

Help me find:

  • Documentation
  • Developer resources
  • Tech topics
  • Architecture center
  • Security updates
  • Customer support

I want to learn more about:

  • AI
  • Application modernization
  • Automation
  • Cloud-native applications
  • Linux
  • Virtualization
New For you

Recommended

We'll recommend resources you may like as you browse. Try these suggestions for now.

  • Product trial center
  • Courses and exams
  • All products
  • Tech topics
  • Resource library
Log in

Get more with a Red Hat account

  • Console access
  • Event registration
  • Training & trials
  • World-class support

A subscription may be required for some services.

Log in or register
Contact us
Red Hat logo
  • Home
  • Resources
  • Red Hat AI Inference for production AI and agentic workloads

Red Hat AI Inference for production AI and agentic workloads

May 12, 2026•
Resource type: Overview
Download PDF

The operational reality of scaling AI

AI inference serves as the essential foundation for agentic AI. Autonomous agents perform multiple inference runs to plan, use tools, reason, and execute complex workflows in real time. Generative AI (gen AI) and these sophisticated workloads are fundamentally shifting enterprise infrastructure requirements. Unlike traditional cloud-native applications, large language model (LLM) inference is highly stateful and relies on nonuniform prompts. As organizations continue to turn AI-powered capabilities and agents into products, the volume of these compute requests scales exponentially. This compounding challenge often leads to unpredictable infrastructure costs and severe operational bottlenecks. According to industry reports, inference now represents the majority of AI operating spend, with industry forecasts projecting it will account for 75% of AI compute demand by 2030.1

To build a sustainable AI strategy, enterprises require the flexibility to collaborate with a diverse ecosystem of hardware, models, and cloud partners. Limiting deployment options restricts the ability to effectively maximize inference capacity, while simultaneously constraining the flexibility to deploy varied types of models and benefit from different graphics processing unit (GPU) tiers.

As user demand grows and agentic workflows multiply, organizations struggle to meet performance service-level agreements (SLAs) or deploy AI where it is needed most. Enterprises increasingly require the flexibility to run inference close to users for low-latency interactions, supporting real-time agentic reasoning, or within highly constrained environments. This local execution is vital for processing data tied to strict regulations or protecting proprietary information that preserves a company's domain expertise. Without this architectural freedom, the resulting fragmented infrastructure landscape limits the ability to efficiently support growing gen AI and agentic capabilities. 

Overview highlights

Optimize your existing infrastructure, control escalating costs, and serve AI models as a shared, centrally managed utility with the low latency required for agentic architectures.

Use advanced compute efficiency with distributed inference and optimization techniques to maximize accelerators.

Gain hybrid cloud flexibility by decoupling AI applications from specific infrastructure, allowing operational consistency between varied hardware, models and environments.

What organizations need now

Organizations are increasingly adopting an internal Model-as-a-Service (MaaS) pattern to solve these operational challenges and implement a reliable foundation for agentic AI. This operational model allows central IT to host, manage, and serve optimized AI models through standardized application programming interfaces (APIs). By treating models as shared, governed utilities, teams can standardize their AI consumption, reduce infrastructure costs, and gain the flexibility to meet a wide variety of gen AI and agentic use cases with control.

To implement this pattern, organizations need a highly scalable and flexible inference engine to run AI profitably and efficiently. Solutions should provide capabilities to make the most of both individual accelerators and the available infrastructure. They also require model optimization tools to reduce compute requirements and extract even more value from their current resources.

Targeted observability into gen AI-specific metrics helps teams maintain performance standards and track resource use. Furthermore, access to preoptimized, validated models helps accelerate deployment timelines and empowers developers to build faster.

Enterprise AI inference engine

Red Hat® AI Inference is an enterprise AI inference engine designed to power models across diverse environments. It provides a unified, hardware-agnostic platform to manage, orchestrate, and optimize AI workloads, acting as the core engine to deliver a flexible, private MaaS experience and a reliable foundation for agentic AI.

Solution benefits

  • Reduce AI infrastructure costs by boosting inference capacity and sharing resources efficiently across development teams.
  • Deliver hardware-agnostic AI performance across a wide ecosystem of accelerators, private datacenters, and public clouds.
  • Scale AI inference with efficient, distributed routing across broader infrastructure and targeted gen AI telemetry.

Improve efficiency

Enterprises can get more from hardware investments and reduce AI infrastructure costs by treating models as a shared, on-demand utility. Adopting a centralized MaaS strategy allows IT teams to serve models efficiently, decreasing fragmented and underutilized hardware. This approach helps organizations make the most of their existing compute resources.

The platform uses an optimized vLLM enterprise architecture to deliver fast and cost-effective inference and offers a model optimization toolkit to compress both foundation and customized models using techniques like quantization and sparsity. This combined approach lowers the underlying compute requirements while maintaining response accuracy for complex tasks. Organizations optimizing models with these methods have observed significant reductions in compute hours, with customers seeing up to 40% in cost savings while preserving baseline accuracy.2 These capabilities boost the performance of individual accelerators, and llm-d’s distributed inference compounds these benefits by efficiently distributing the inference load across the available fleet of GPUs.

Scale with control

Organizations can scale AI inference operations confidently by establishing an efficient foundation for their MaaS strategy and agentic architecture. By optimizing individual accelerators and their broader infrastructure, the engine helps run models at scale. It integrates llm-d’s inference-aware routing and disaggregated serving to balance traffic, manage capacity, and orchestrate reasoning models efficiently. This fleet-wide orchestration helps manage the rapid, continuous compute requests generated by agentic AI loops, with testing showing the ability to sustain up to twice the baseline of queries per second (QPS) under service-level objective constraints.3

Platform engineers gain operational insights through gen AI specific telemetry. Teams can track time-to-first-token, key-value (KV)-cache hit rates, and overall inference capacity alongside traditional central processing unit (CPU) and memory usage. These gen AI specific metrics can integrate with existing tools like Prometheus and Grafana, providing the data IT needs to monitor usage and manage capacity effectively.

Run anywhere

Organizations can maintain deployment flexibility with hardware-agnostic AI capabilities that operate across hybrid cloud environments. They can execute workloads on various platforms, spanning datacenters, edge locations, and major public clouds. This architectural freedom supports strong collaboration with a diverse ecosystem of hardware and cloud partners, equipping enterprises to flexibly meet a wide variety of business requirements.

The engine is designed to offer operational consistency across models, accelerators, and environments. By decoupling AI applications from the underlying infrastructure, enterprises can transition between different accelerators, models, and hardware configurations as their needs evolve, while optimizing inference and serving models with a common set of capabilities. Organizations can dynamically adapt their hybrid cloud AI strategy based on resource availability, hardware and model advancements, and pricing variations.

Build on an open foundation

Organizations can achieve goals faster with an enterprise AI inference platform built on trusted open source innovation. The solution includes a curated catalog of validated, containerized, and versioned open models ready for immediate deployment. AI and machine learning (ML) developers and engineers can bypass lengthy model optimization and validation cycles and begin building applications in less time.

The inference platform natively integrates with Red Hat OpenShift® and supports third-party Kubernetes environments to fit existing operational workflows. By relying on established open standards, teams can benefit from continuous, community-driven performance enhancements. This open approach provides the stability enterprise IT requires without sacrificing the pace of open source AI innovation.

Red Hat AI Inference capabilities

The platform provides a suite of capabilities to operationalize and manage complex gen AI and agentic workloads effectively.

  • Hardware and model agnostic execution: Supports diverse models and accelerators, including various GPUs, tensor processing units (TPUs), and specialized neural processors through optimized vLLM integration.
  • Fleet-wide orchestration: Uses llm-d to distribute inference requests, balance loads, and manage multinode scaling across Kubernetes clusters to handle bursts of agentic traffic.
  • Gen AI observability: Monitors throughput, latency, and hardware utilization, integrating natively with an existing monitoring infrastructure.
  • Model optimization toolkit: Offers model optimization techniques like quantization and sparsity to help models run more efficiently, use fewer resources, and lower operational costs.
  • Curated model repository: Offers access to third-party validated and optimized open source models to accelerate development.
  • Flexible deployment: Operates across Red Hat OpenShift and other enterprise Kubernetes environments for broad architectural compatibility.

Proof and credibility

The underlying architecture, grounded on vLLM and llm-d, consistently demonstrates strong performance in rigorous industry benchmarking. In MLPerf Inference v6.0 testing, a well-recognized performance benchmark in the AI field, Red Hat AI received the number 1 global throughput ranking for complex speech-to-text and vision workloads. The optimized engine:4 

  • Delivered 13% faster speech-to-text responses compared to competing setups using identical hardware. 
  • Outperformed newer B300 benchmarks by 50% using an optimized stack on B200 accelerators for vision tasks. 
  • Orchestrated 120B+ parameter reasoning models via intelligent llm-d routing, maintaining sub-3-second latency for live-agent interactions.

Experience high performance

Scaling agentic and gen AI into cost-effective production requires a strong operational foundation. Explore how this inference engine helps organizations regain control over their infrastructure and increase flexibility.

Explore the path to high-performance inference at scale

  • Visit the Red Hat AI Inference page

Test run Red Hat AI inference capabilities

  • Start a 60-day, no-cost trial to experience high-performance model serving
  1. “Artificial Intelligence Index Report 2025.” Stanford University Institute for Human-Centered Artificial Intelligence (HAI), 2025.

  2. Red Hat Blog. “Unleash the full potential of LLMs: Optimize for performance with vLLM,” February 2025.

  3. Red Hat. “llm-d: Kubernetes-native distributed inferencing,” May 2025.

  4. Red Hat press release. “Red Hat AI tops MLPerf Inference v6.0 with vLLM on Qwen3-VL, Whisper, and GPT-OSS-120B,” 1 April 2026.

Tags:Artificial intelligence

Red Hat logo

About Red Hat

Red Hat is the open hybrid cloud technology leader, delivering a trusted, consistent and comprehensive foundation for transformative IT innovation and AI applications. Its portfolio of cloud, developer, AI, Linux, automation and application platform technologies enables any application, anywhere—from the datacenter to the edge. As the world's leading provider of enterprise open source software solutions, Red Hat invests in open ecosystems and communities to solve tomorrow's IT challenges. Collaborating with partners and customers, Red Hat helps them build, connect, automate, secure, and manage their IT environments, supported by consulting services and award-winning training and certification offerings.

  • North America
  • Asia Pacific
  • Latin America
  • Europe, Middle East, and Africa
  • 888-REDHAT1
  • +6564904200
  • +5443297300
  • +0080073342835
  • www.redhat.com
  • apace@redhat.com
  • info-latam@redhat.com
  • europe@redhat.com
  • @red-hat
  • @redhat
  • @redhat
  • @red_hat

Copyright © 2026 Red Hat. Red Hat, the Red Hat logo, Ansible, and OpenShift are trademarks or registered trademarks of Red Hat, LLC or its subsidiaries in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries. The OPENSTACK logo and word mark are trademarks or registered trademarks of OpenInfra Foundation, used under license. All other trademarks are the property of their respective owners.

Red Hat logoLinkedInYouTubeFacebookXInstagram

Platforms

  • Red Hat AI
  • Red Hat Enterprise Linux
  • Red Hat OpenShift
  • Red Hat Ansible Automation Platform
  • See all products

Tools

  • Training and certification
  • My account
  • Customer support
  • Developer resources
  • Find a partner
  • Red Hat Ecosystem Catalog
  • Documentation

Try, buy, & sell

  • Product trial center
  • Red Hat Store
  • Buy online (Japan)
  • Console

Communicate

  • Contact sales
  • Contact customer service
  • Contact training
  • Social

About Red Hat

Red Hat is an open hybrid cloud technology leader, delivering a consistent, comprehensive foundation for transformative IT and artificial intelligence (AI) applications in the enterprise. As a trusted adviser to the Fortune 500, Red Hat offers cloud, developer, Linux, automation, and application platform technologies, as well as award-winning services.

  • Our company
  • How we work
  • Customer success stories
  • Analyst relations
  • Newsroom
  • Open source commitments
  • Our social impact
  • Jobs

Change page language

Red Hat legal and privacy links

  • About Red Hat
  • Jobs
  • Events
  • Locations
  • Contact Red Hat
  • Red Hat Blog
  • Inclusion at Red Hat
  • Cool Stuff Store
  • Red Hat Summit
© 2026 Red Hat

Red Hat legal and privacy links

  • Privacy statement
  • Terms of use
  • All policies and guidelines
  • Digital accessibility