Agentic AI doesn’t just move AI forward, it flips the infrastructure built for traditional inference on its head. Agentic AI, systems that reason, plan, use tools, and execute multistep tasks autonomously, is rapidly moving from research into production. 

This shift is not just about more compute, it fundamentally changes how infrastructure must perform, scale, and be optimized for continuous, multistage reasoning workflows.

But agents are fundamentally different workloads—they call models repeatedly, fan out across tools and data sources, and need to run continuously and cost-effectively. 

That demands a new kind of infrastructure, one that spans high-performance GPUs for complex reasoning, CPUs for inference tasks, orchestration and intelligent scheduling to tie it all together.

Together, AMD and Red Hat are enabling this transition, delivering an open, high-performance infrastructure foundation designed to help enterprises operationalize agentic AI at scale. 

High-throughput AI inference with AMD Instinct MI355X and Red Hat AI 3.4

Agentic workloads are inference-intensive by nature. Every reasoning step, every tool call, every decision an agent makes is a model inference. When agents orchestrate complex workflows, retrieving documents, generating code, validating outputs, they can issue dozens of inference calls per task. That makes GPU throughput and memory capacity critical.

Red Hat AI 3.4 now supports inference on AMD Instinct™ MI355X, AMD’s flagship data center GPU built on the AMD CDNA™ 4 architecture. With 288 GB of HBM3E memory and 8 TB/s of bandwidth, it can serve today’s largest open source models, the frontier models behind advanced agent reasoning, using fewer accelerators and lowering total cost of ownership. Built on TSMC 3nm process technology and delivering up to 5 PF of FP16 performance, it provides the throughput needed to scale agentic workloads efficiently. 

With Red Hat AI with ROCm 7 fully integrated, organizations can deploy AMD Instinct MI355X-accelerated inference on Red Hat OpenShift using the AMD GPU Operator for streamlined Day-0 configuration. Expanded datatype support for MXFP6 and MXFP4 enables more efficient quantized model serving, helping teams run more agents per GPU without sacrificing the reasoning quality that makes agents useful.

In practice, this means enterprises can run larger models with fewer accelerators, reducing infrastructure footprint, lowering power consumption, and improving cost per inference. For agentic AI, where workflows can generate dozens of model calls per task, this efficiency directly translates into faster response times and lower cost per interaction. 

Expanding AI deployment flexibility with AMD Instinct MI350P PCIe Card (preview)

Red Hat AI adds preview support for the AMD Instinct MI350P PCIeⓇ card accelerator, bringing AMD CDNA™ 4 architecture to standard data center server infrastructure. Designed as a dual slot PCIe form factor with a configurable 600W total board power (TBP) and passive air cooling, these cards integrate seamlessly into enterprise datacenters without major overhaul of power and cooling. 

The AMD Instinct MI350P PCIe cards support traditional 16-bit and 8-bit formats as well as MXFP6 and MXFP4 for efficient model serving and 144 GB of HBM3E memory with up to 4.0TB/s peak memory bandwidth, it enables high-performance inference in environments where open accelerator module (OAM)-based systems aren't practical. This means that enterprises of any size can have the ability to add cutting edge GPU technology into their infrastructure and efficiently run any small to large size model-based workloads with ease.

The AMD Instinct MI350P PCIe cards enable enterprises to expand their AI adoption across multiple workloads in a manageable, cost effective and scalable way. These accelerator cards facilitate deployment flexibility for agentic AI, allowing organizations to run inference closer to data, reduce latency, and support real-time, distributed agent workflows without rearchitecting existing infrastructure.

AMD Instinct MI350P PCIe Card


AMD Instinct MI350P PCIe Card

Distributed inference and workload placement for agentic AI

Agentic AI involves coordinating the right infrastructure for each part of the workflow. Red Hat AI provides Kubernetes-native capabilities for deploying, scaling, monitoring, and managing AI workloads across accelerator-backed and CPU-based infrastructure. Platform teams can configure accelerator profiles, enable AMD GPU-backed model serving, define project-level hardware resources, and use OpenShift scheduling, queue quota, and workload-management capabilities to place workloads on the appropriate infrastructure.

For model serving, Red Hat AI supports KServe and vLLM-based runtimes on AMD GPUs, including AMD Instinct accelerators, while CPU-based infrastructure can be used for orchestration, retrieval, preprocessing, routing, and lighter-weight inference tasks. OpenShift AI also includes llm-d for distributed AI inference at scale. llm-d extends vLLM-based serving with Kubernetes-native distributed inference capabilities such as prefill and decode disaggregation, or KV-cache-aware routing. Together with AMD Instinct GPUs and AMD EPYC™ CPUs, OpenShift AI and llm-d give enterprises a practical foundation for agentic AI platforms that can pair GPU-accelerated reasoning with efficient CPU-based workflow execution.

The result is a more efficient system that improves efficiency, reduces unnecessary GPU consumption, and scales AI services without linear cost increases.
 

High-performance vLLM CPU inference with AMD EPYC and AMD ZenDNN

Not every agent call needs a GPU. Agentic systems are inherently composable—a single workflow might route complex reasoning to a large model on a GPU while dispatching simpler tasks, such as classification, extraction, or routing, to smaller models. Running every call through a GPU is inefficient when CPUs can handle lower-latency steps effectively. In modern AI systems, CPUs are no longer just supporting infrastructure, they are a critical engine for scalable inference. 

Red Hat AI 3.4 introduces vLLM-CPU with the ZenDNN backend, bringing high-performance CPU inference to AMD EPYC processors. ZenDNN delivers tuned kernels and optimized primitives that enable frameworks like PyTorch to run efficiently on AMD EPYC CPUs—unlocking a powerful engine for scalable AI workloads. 

By extending native frameworks with AMD EPYC-optimized graph and operator enhancements, including fused patterns, vectorized execution, and AMD Optimizing CPU Libraries (AOCL) DLP microkernels, it enables zero-code-change acceleration through ZenDNN Upstreamed to vLLM. The result is a plug-and-play path to high-throughput inference on existing infrastructure, with broad compatibility via PyTorch torch.compile and upstream support in vLLM 0.18.0. 

[Alt text: Diagram of an AMD EPYC software stack showing vLLM and PyTorch at the top, followed by ZenTorch for graph optimizations and fusions tuned for AMD EPYC, ZenDNN as a library tuned for AMD EPYC, and an AMD EPYC processor image at the bottom.]

With INT8/INT4 quantization and optimized vLLM integration, generative AI (gen AI) models can run efficiently on CPU infrastructure, enabling low-cost, low-power solutions for hybrid workloads that are not strictly latency constrained. This allows AI inference to run alongside general-purpose computing on existing AMD EPYC CPU fleets, improving infrastructure use and reducing the need for dedicated accelerators in every deployment.

CPU inference is especially valuable for enterprise scenarios such as off-peak batch processing, hybrid workloads where AI is only a portion of total compute, and low-barrier AI adoption using existing infrastructure, skills, and air-cooled environments.

 

[Alt text: Infographic table showing five AI workload categories with icons: Opportunistic Bulk Processing for large batch processing with spare cycles; Opportunistic Real Time for latency-sensitive, small-batch inference; Performance for high-performance, cost-efficient compute; Hybrid Workloads for general-purpose workloads with integrated AI capability; and Software Incumbency for deep learning workloads leveraging CPU inference.]

The AMD 9005 series processors are purpose-built for the kind of parallel, always-on compute that agentic AI demands. With up to 192 cores (384 SMT threads), high memory bandwidth, large cache capacity, stronger core performance and expansive I/O, they provide the CPU horsepower needed to handle orchestration, retrieval, preprocessing, tool calls, routing, and model inference at scale. 

The upcoming, next‑generation AMD EPYC processor, codenamed "Venice," will extend these capabilities even further with up to 256 cores (512 threads), support for MRDIMMs delivering memory speeds up to 12.8 MT/s, and bandwidth up to 1.64 TB/s, along with PCie Gen 6 for high-throughput IO. 

In practice, this lets GPUs stay focused on the complex reasoning while AMD EPYC CPUs efficiently manage the surrounding agent workflow, serving many concurrent requests, maintaining low latency, and improving overall infrastructure utilization. 

Performance: AMD EPYC 9R45 in action

To demonstrate the capabilities of the ZenDNN backend, we conducted performance benchmarking using the AMD EPYC 9R45 96-Core Processor. The evaluation focused on a "chat_lite"CPU inference benchmark with a short-context 128:128 workload.

Test configuration

The environment utilized AWS m8a.metal-48xl instances. The architecture leveraged 5 distinct Red Hat AI Inference 3.4 instances, each allocated 32 cores with prefix caching enabled to optimize throughput. GuideLLM served as the primary evaluation tool, routed through an NGINX load balancer to orchestrate the vLLM-CPU performance evaluation.

[Alt text: Line chart titled “Throughput (tokens/sec) vs Concurrency” comparing Llama 3.1 8B Instruct and quantized Meta-Llama 3.1 8B Instruct WASB on AMD EPYC without SMT. Throughput increases as concurrency rises from about 32 to 160. The quantized model shows higher mean and P95 throughput than the non-quantized model across all concurrency levels, reaching about 2,400 tokens/sec mean and over 3,200 tokens/sec P95 at concurrency 160.]

 

[Alt text: Line chart titled “TTFT (ms) vs Concurrency” comparing time to first token for Llama 3.1 8B Instruct and quantized Meta-Llama 3.1 8B Instruct WASB on AMD EPYC without SMT. Mean TTFT increases gradually with concurrency for both models, while P95 TTFT for the non-quantized model rises sharply at concurrency 160 to about 8,000 ms. The quantized model maintains lower mean and P95 TTFT across the tested concurrency levels.] [Alt text: Line chart titled “ITL (ms) vs Concurrency” comparing inter-token latency for Llama 3.1 8B Instruct and quantized Meta-Llama 3.1 8B Instruct WASB on AMD EPYC without SMT. ITL increases as concurrency rises from about 32 to 160 for both models. The quantized model has lower mean and P95 inter-token latency across all concurrency levels, reaching about 130 ms mean and 165 ms P95 at concurrency 160, compared with about 210 ms mean and 260 ms P95 for the non-quantized model.]

High-throughput scaling

As concurrency increases, AMD EPYC demonstrates impressive scaling: 

  • Quantized efficiency: The quantized W8A8 model achieved a mean throughput of approximately 2,421 tokens/sec at a concurrency of 160.
  • Peak performance: At the same concurrency level, the P95 throughput for the quantized model climbed to over 3,260 tokens/sec, showcasing the processor's ability to handle bursty agentic workloads.
  • Base performance: Even the standard FP16 Llama-3.1-8B model maintained a steady mean throughput of roughly 1,500 tokens/sec at 160 concurrency.

Predictable latency for agentic workflows

For AI agents to feel responsive, time to first token (TTFT) and inter-token latency (ITL) are critical:

  • Responsiveness: The quantized model maintains a remarkably low mean TTFT, staying well under the 2s threshold even at 160 concurrency.
  • Fluidity: The mean ITL for the quantized model remains highly consistent, staying around 100ms-130ms across the testing spectrum.
  • End-to-end efficiency: The mean end-to-end (E2E) latency for a full generation on the quantized model was approximately 16.8 seconds at max concurrency, compared to over 27 seconds for the non-quantized version. 

We are continuing to collaborate closely with AMD on future releases to further tune and optimize performance on AMD EPYC processors, with vLLM and ZenDNN integrated into Red Hat AI Inference. Stay tuned for additional updates as these optimizations become available. 

The infrastructure agentic AI demands

At Red Hat Summit 2026, AMD and Red Hat announced an integrated, enterprise-ready foundation for agentic AI. Building on a strategic collaboration that has already brought AMD Instinct GPUs and AMD EPYC CPUs into the heart of Red Hat AI, together we’ve formed the compute foundation for enterprise agentic AI.

The shift from inference to agents isn't just a software change, it's an infrastructure inflection point. Agents need high-throughput GPUs for complex reasoning, CPUs for lightweight tasks, flexible form factors for diverse deployments, and intelligent scheduling to match compute to demand in real time.

AMD and Red Hat now deliver that foundation: the MI355X for frontier model inference, AMD EPYC vLLM-CPU for cost-effective lightweight calls, and the MI350P for bringing GPU acceleration to new environments. All built on open source. All enterprise-ready. All available on Red Hat AI.

Agentic AI is redefining how enterprises operate, and infrastructure is now a strategic differentiator. With AMD Instinct accelerators and AMD EPYC processors deeply integrated into Red Hat AI, organizations can deploy scalable, efficient, and open AI platforms designed for real-world impact. Together, AMD and Red Hat are not just supporting the agentic era, they’re helping enterprises operationalize it.

Resource

The adaptable enterprise: Why AI readiness is disruption readiness

This e-book, written by Michael Ferris, Red Hat COO and CSO, navigates the pace of change and technological disruption with AI that faces IT leaders today.

About the authors

Erwan Gallen is Senior Principal Product Manager, Generative AI, at Red Hat, where he follows Red Hat AI Inference Server product and manages hardware-accelerator enablement across OpenShift, RHEL AI, and OpenShift AI. His remit covers strategy, roadmap, and lifecycle management for GPUs, NPUs, and emerging silicon, ensuring customers can run state-of-the-art generative workloads seamlessly in hybrid clouds.

Before joining Red Hat, Erwan was CTO and Director of Engineering at a media firm, guiding distributed teams that built and operated 100 % open-source platforms serving more than 60 million monthly visitors. The experience sharpened his skills in hyperscale infrastructure, real-time content delivery, and data-driven decision-making.

Since moving to Red Hat he has launched foundational accelerator plugins, expanded the company’s AI partner ecosystem, and advised Fortune 500 global enterprises on production AI adoption. An active voice in the community, he speaks regularly at NVIDIA GTC, Red Hat Summit, OpenShift Commons, CERN, and the Open Infra Summit.

Priya Vasudevan is a Senior AI Product Manager focused on AI solutions, CPU-based AI inference, and Agentic AI. She works at the intersection of AI infrastructure, hardware, and enterprise software, helping bring scalable and efficient AI capabilities to real-world deployments.

UI_Icon-Red_Hat-Close-A-Black-RGB

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Virtualization icon

Virtualization

The future of enterprise virtualization for your workloads on-premise or across clouds