Red Hat is proud to announce our strong results from the latest industry-standard MLPerf Inference v6.0 benchmark. Our submission includes four AI workloads (Whisper-Large-v3, GPT-OSS-120B, Qwen3-VL-235B-A22B, and Llama-2-70b) on NVIDIA (H200, B200, L40S) and AMD (MI350X) GPUs, running on Red Hat Enterprise Linux (RHEL) and Red Hat OpenShift AI with our open source inference stack: vLLM, and llm-d

We achieved top scores across several configurations, including the highest offline throughput on B200 for GPT-OSS-120B, the leading H200 result on Whisper, and the top B200 submission on Qwen3-VL, which exceeded the top B300 performer by 50% in the Server scenario.

Enterprises use MLPerf to evaluate AI workload performance by comparing hardware and software stacks in a standardized environment. These results illustrate Red Hat AI’s ability to match or outperform other inference engines on the same hardware, scale distributed inference on OpenShift AI, and run across multiple GPU vendors without changing the software layer.

Our results

Across language, vision, and speech models, Red Hat’s stack delivered top-tier throughput and latency results on NVIDIA architecture.

Red Hat MLPerf Inference Results v6.0 

Model category

Model name

Hardware

Offline (tok/sec)

Server

(tok/sec)

Software stack

Notes & comparisons

Vision

Qwen3-VL-235B-A22B

8x B200

79.04 

(samples/sec)

67.86 

(samples/sec)

RHEL, vLLM

Top in the leaderboard compared to all results

8x H200

18.02 

(samples/sec)

11.05 

(samples/sec)

RHEL, vLLM

Unique and non-comparable

Reasoning

gpt-oss-120b

8x B200

93,070.70

71,588.13

OpenShift, llm-d, vLLM

Offline B200 8% better than closest competitor

8x H200

28,680.00

24,103.19

OpenShift, llm-d, vLLM

Unique and non-comparable

gpt-oss-120b

8x MI350X (w/Supermicro)

64,293.30

58,373.27

vLLM, RHEL

Open division submission. Improved perf by 20% post-submission

Audio

Whisper

8x H200

36,395.70

N/A

RHEL, vLLM

13% better than closest competitor

2x L40S

3,646.91

N/A

RHEL, vLLM

Unique and non-comparable

Dense

llama-2-70b

8x MI350X

(w/SuperMicro)

91,933.10

89,019.65

vLLM, RHEL

Competitive with other MI350X submissions

Qwen3-VL-235B (multimodal vision model)

Qwen3-VL-235B-A22B-Instruct is a 235-billion-parameter MoE vision-language model with 22 billion active parameters. The MLPerf benchmark uses a Shopify product catalog dataset with 48,289 multimodal samples (real e-commerce product images paired with text) where the model must classify products into structured JSON output. Image resolution varies by up to 20x, which stresses the vision encoder and creates highly uneven request sizes that can disrupt scheduling efficiency.

We submitted both H200 and B200 using RHEL and vLLM with CentML optimizations.

  • 8x B200 server: 67.86 samples per second, the top result among all B200 submissions, 50% faster than the top B300 submission
  • 8x B200 offline: 79.04 samples per second, also the top B200 result and on par with the top B300 result
  • 8x H200: 18.22 samples per second (offline), 11.05 samples per second (server), the only H200 submission for this model
Our optimized vLLM configuration on RHEL outperformed all other B200 submissions on this workload. Optimizations that made the difference included FlashInfer MoE kernels, Shortest Job First scheduling (which handles the variable image sizes well in the server scenario), FP8 multimodal attention, and Triton-based vision encoder improvements that gave us 30-40% faster ViT processing.

Our optimized vLLM configuration on RHEL outperformed all other B200 submissions on this workload. Optimizations that made the difference included FlashInfer MoE kernels, Shortest Job First scheduling (which handles the variable image sizes well in the server scenario), FP8 multimodal attention, and Triton-based vision encoder improvements that gave us 30-40% faster ViT processing.

GPT-OSS-120B (reasoning model)

GPT-OSS-120B is a 117-billion-parameter mixture-of-experts (MoE) model designed for reasoning, agentic workflows, and code generation. This was a new model in MLPerf Inference v6.0, and the workload is demanding – highly variable input lengths (600 to 15,000 tokens) with a long tail, and strict latency requirements for the server scenario (P99 time-to-first-token under 3 seconds, P99 time-per-output-token under 80 milliseconds).

We submitted on H200, B200 and MI350X using OpenShift AI, llm-d, and vLLM. This makes our GPT-OSS-120B submission the first for this model on Kubernetes infrastructure.

  • 8x B200 offline: 93,071 tokens per second, 8% higher than the next comparable B200 submission
  • 8x B200 server: 71,588 tokens per second, a strong result given this was our first GPT-OSS-120B submission on Kubernetes infrastructure. Preliminary testing with tensor parallelism greater than 1 is already showing a 9% improvement.
  • 8x H200: 28,680 tokens per second (offline), 24,103 tokens per second (server), the only H200 submission for this model in this round.
  • 8XMI350X: 64,293 tokens per second (offline), 58,373 tokens per second (server), the only MI350X submission for this model this round. We were able to achieve 20% gain in offline performance, and 24% gain in server performance post the submission date.

    We adopted a two-pronged strategy to optimize inference performance

We adopted a two-pronged strategy to optimize inference performance. First, our Bayesian optimization–based hyperparameter tuning pipeline on OpenShift identified an optimal configuration for a single replica that reduced P99 time-to-first-token (TTFT) from 3.4 seconds to 2.1 seconds (~38% improvement), meeting the sub-3s target.

Second, we optimized multi-replica performance by refining our load balancing and scoring strategy. By analyzing request distribution across replicas, we improved utilization and minimized tail latency, enabling more consistent scaling under load.

llm-d is an open source distributed inference framework built on vLLM and designed for Kubernetes-native deployments. While our MLPerf GPT-OSS-120B submission used load balancing, the architecture extends to more advanced techniques such as prefix-aware routing and disaggregating prefill and decode for scalable distributed inference. Image source: https://llm-d.ai/docs/architecture

llm-d is an open source distributed inference framework built on vLLM and designed for Kubernetes-native deployments. While our MLPerf GPT-OSS-120B submission used load balancing, the architecture extends to more advanced techniques such as prefix-aware routing and disaggregating prefill and decode for scalable distributed inference. Image source: https://llm-d.ai/docs/architecture

Whisper Large-V3 (speech-to-text)

Whisper is OpenAI's automatic speech recognition model that converts spoken audio into text, supporting multilingual transcription and translation across 100+ languages. The benchmark evaluates the model on 1,633 audio samples (~30 hours) from the LibriSpeech dataset, a corpus of read English speech derived from LibriVox public domain audiobooks, evaluated  by throughput (tokens per second) and transcription accuracy (word error rate).

We submitted Whisper-large-v3 results on H200 and L40S, both running RHEL and vLLM.

  • 8x H200 offline: 36,396 tokens per second, the leading H200 result, 13% faster than the next closest submission                                             

    2x L40S offline: 3,647 tokens per second, the first and only L40S submission for Whisper in MLPerf Inference v6.0
  • 2x L40S offline: 3,647 tokens per second, the first and only L40S submission for Whisper in MLPerf Inference v6.0

These results were driven by a systematic ablation study across config parameters to identify the optimizations that matter most for Whisper inference. Batch size tuning delivered a 40% throughput gain by maximizing GPU utilization, asynchronous scheduling contributed a further 12.8% by eliminating CPU-GPU synchronization stalls, and CUDA Graphs provided an additional 6%. With L40S widely deployed in cost-sensitive environments, our results show that an open-source inference stack delivers world-class speech recognition performance across both high-end and cost-efficient hardware.

Llama-2-70B on AMD MI350X

In partnership with Supermicro, we submitted Llama-2-70B results on 8x AMD Instinct MI350X GPUs running RHEL 9.6 and vLLM with ROCm 7.0. This was Red Hat's first MLPerf submission on AMD hardware.

  • Offline: 80,478 tokens per second
  • Server: 76,393 tokens per second

These results are competitive with other MI350X submitters. The MI350X is an air-cooled variant of the MI355X, and our results align with the expected ~80% performance ratio relative to the liquid-cooled MI355X.

Key takeaways

vLLM is a competitive inference engine. Across all workloads, vLLM delivered strong results. Notably, 20 submitters in MLPerf v6.0 used vLLM (up from 5 in v5.1), confirming its position as the standard open source inference engine.

llm-d and OpenShift AI work for distributed inference at scale. Our GPT-OSS-120B submission ran on OpenShift AI with llm-d coordinating 8 GPU replicas. The llm-d scheduler used KV cache utilization scoring and queue depth scoring to route requests intelligently across replicas. This is the first MLPerf submission for a model of this scale running on Kubernetes infrastructure.

The stack is hardware-portable. We submitted on NVIDIA H200, NVIDIA B200, NVIDIA L40S, and AMD MI350X. The same core stack (vLLM on RHEL) ran across all of them. For enterprises evaluating hardware options, this means the software investment carries across GPU generations and vendors.

What comes next

With MLPerf launching a multi-turn task force for agentic workloads, benchmarks are moving closer to real-world, interactive AI systems. Red Hat is equipped for this shift, with an inference stack built for such scenarios. We plan to showcase llm-d, featuring prefix-aware scoring and distributed inference across replicas to efficiently support multi-turn, agentic workloads.

If you want to replicate our results, see our GitHub repository containing our MLPerf Inference v6.0 benchmark results and setup documentation. Check out the full MLPerf Inference v6.0 results at mlcommons.org and learn more about Red Hat AI.

Resource

The adaptable enterprise: Why AI readiness is disruption readiness

This e-book, written by Michael Ferris, Red Hat COO and CSO, navigates the pace of change and technological disruption with AI that faces IT leaders today.

About the authors

Ashish Kamra is an accomplished engineering leader with over 15 years of experience managing high-performing teams in AI, machine learning, and cloud computing. He joined Red Hat in March 2017, where he currently serves as the Senior Manager of AI Performance at Red Hat. In this role, Ashish heads up initiatives to optimize performance and scale of Red Hat OpenShift AI - an end to end platform for MLOps, specifically focusing on large language model inference and training performance.

Prior to Red Hat, Ashish held leadership positions at Dell EMC, where he drove the development and integration of enterprise and cloud storage solutions and containerized data services. He also has a strong academic background, having earned a Ph.D. in Computer Engineering from Purdue University in 2010. His research focused on database intrusion detection and response, and he has published several papers in renowned journals and conferences.

Passionate about leveraging technology to drive business impact, Ashish is pursuing a Part-time Global Online MBA at Warwick Business School to complement his technical expertise. In his free time, he enjoys playing table tennis, exploring global cuisines, and traveling the world.

Diane Feddema is a Principal Software Engineer at Red Hat Inc in the Performance and Scale Team with a focus on AI/ML applications.  She has submitted official results in multiple rounds of MLCommons MLPerf Inference and Training, dating back to the initial MLPerf rounds.  Diane Leads performance analysis and visualization for MLPerf benchmark submissions and collaborates with Red Hat Hardware Partners in creating joint MLPerf benchmark submissions.

Diane has a BS and MS in Computer Science and is presently co-chair of the Best Practices group of the MLPerf consortium.

Michael Goin is a lead maintainer of vLLM, the high-performance open-source engine for LLM inference. His contributions span core areas including kernel optimization, system scheduling, new model architectures, and hardware efficiency across GPUs and emerging accelerators. Michael led performance engineering at Neural Magic, which was acquired by Red Hat, and brings experience in inference systems to the open source AI community.

Michey is a member of the Red Hat Performance Engineering team, and works on bare metal/virtualization performance and machine learning performance.. His areas of expertise include storage performance, Linux kernel performance, and performance tooling.

Performance engineer working on MLPerf, LLM Inference performance and profiling . Previous experience in hardware performance modelling

Nikhil Palaskar is a senior software engineer and a member of Performance and Scale Engineering at Red Hat. Nikhil's current focus is primarily on serving LLMs for inference and performance optimizing their deployments in Red Hat Openshift AI and Red Hat Enterprise Linux AI environments. Nikhil is also actively engaged in performance experimentation and tuning of running large language models on AMD hardware with the ROCm software stack. Previously he has also worked on building a Benchmarking and Performance Analysis Framework (Pbench).

Nikhil's professional interests revolve around AI/ML and deep learning, statistics, performance engineering, and application profiling. When he is not working he likes to go on hikes.

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

Aanya has been a Software Engineer at Red Hat since 2025, specializing in optimizing the performance and scale of OpenShift AI for enterprise-grade LLM workloads. She brings over two years of specialized experience developing AI applications at FPrime.ai (Carnegie Mellon University), Microsoft, and Scotiabank. Leveraging a background in Computer Science and Mathematics from UC San Diego, Aanya is dedicated to advancing scalable AI solutions built on open principles. In her free time, she is a racquet sports enthusiast and a passionate coffee connoisseur.

Alberto Perdomo is a AI Systems Performance Engineer on the PSAP team at Red Hat, working at the intersection of distributed systems, AI inference, and open source. Main focus is on optimizing distributed LLM inference with vLLM and llm-d for high-performance production deployments, among other interesting work in distributed AI systems.

Harika Pothina is an AI Systems Performance Engineer on the PSAP team at Red Hat, working at the intersection of distributed systems, AI inference, and performance engineering. Her work focuses on optimizing large-scale VLM inference with vLLM, MLPerf, and OpenShift for high-performance production deployments, while also driving benchmarking, profiling, and system-level tuning across modern GPU platforms.

Samuel Monson is a Senior Software Engineer and a member of the Performance and Scale for AI Platforms team at Red Hat. Samuel current role focuses on contributing to Red Hat's mission of fast, cost-effective AI by developing tooling around measuring and analyzing performance.

UI_Icon-Red_Hat-Close-A-Black-RGB

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Virtualization icon

Virtualization

The future of enterprise virtualization for your workloads on-premise or across clouds