Red Hat is proud to announce our strong results from the latest industry-standard MLPerf Inference v6.0 benchmark. Our submission includes four AI workloads (Whisper-Large-v3, GPT-OSS-120B, Qwen3-VL-235B-A22B, and Llama-2-70b) on NVIDIA (H200, B200, L40S) and AMD (MI350X) GPUs, running on Red Hat Enterprise Linux (RHEL) and Red Hat OpenShift AI with our open source inference stack: vLLM, and llm-d.
We achieved top scores across several configurations, including the highest offline throughput on B200 for GPT-OSS-120B, the leading H200 result on Whisper, and the top B200 submission on Qwen3-VL, which exceeded the top B300 performer by 50% in the Server scenario.
Enterprises use MLPerf to evaluate AI workload performance by comparing hardware and software stacks in a standardized environment. These results illustrate Red Hat AI’s ability to match or outperform other inference engines on the same hardware, scale distributed inference on OpenShift AI, and run across multiple GPU vendors without changing the software layer.
Our results
Across language, vision, and speech models, Red Hat’s stack delivered top-tier throughput and latency results on NVIDIA architecture.
Red Hat MLPerf Inference Results v6.0
Model category | Model name | Hardware | Offline (tok/sec) | Server (tok/sec) | Software stack | Notes & comparisons |
Vision | Qwen3-VL-235B-A22B | 8x B200 | 79.04 (samples/sec) | 67.86 (samples/sec) | RHEL, vLLM | Top in the leaderboard compared to all results |
8x H200 | 18.02 (samples/sec) | 11.05 (samples/sec) | RHEL, vLLM | Unique and non-comparable | ||
Reasoning | gpt-oss-120b | 8x B200 | 93,070.70 | 71,588.13 | OpenShift, llm-d, vLLM | Offline B200 8% better than closest competitor |
8x H200 | 28,680.00 | 24,103.19 | OpenShift, llm-d, vLLM | Unique and non-comparable | ||
gpt-oss-120b | 8x MI350X (w/Supermicro) | 64,293.30 | 58,373.27 | vLLM, RHEL | Open division submission. Improved perf by 20% post-submission | |
Audio | Whisper | 8x H200 | 36,395.70 | N/A | RHEL, vLLM | 13% better than closest competitor |
2x L40S | 3,646.91 | N/A | RHEL, vLLM | Unique and non-comparable | ||
Dense | llama-2-70b | 8x MI350X (w/SuperMicro) | 91,933.10 | 89,019.65 | vLLM, RHEL | Competitive with other MI350X submissions |
Qwen3-VL-235B (multimodal vision model)
Qwen3-VL-235B-A22B-Instruct is a 235-billion-parameter MoE vision-language model with 22 billion active parameters. The MLPerf benchmark uses a Shopify product catalog dataset with 48,289 multimodal samples (real e-commerce product images paired with text) where the model must classify products into structured JSON output. Image resolution varies by up to 20x, which stresses the vision encoder and creates highly uneven request sizes that can disrupt scheduling efficiency.
We submitted both H200 and B200 using RHEL and vLLM with CentML optimizations.
- 8x B200 server: 67.86 samples per second, the top result among all B200 submissions, 50% faster than the top B300 submission
- 8x B200 offline: 79.04 samples per second, also the top B200 result and on par with the top B300 result
- 8x H200: 18.22 samples per second (offline), 11.05 samples per second (server), the only H200 submission for this model
Our optimized vLLM configuration on RHEL outperformed all other B200 submissions on this workload. Optimizations that made the difference included FlashInfer MoE kernels, Shortest Job First scheduling (which handles the variable image sizes well in the server scenario), FP8 multimodal attention, and Triton-based vision encoder improvements that gave us 30-40% faster ViT processing.
GPT-OSS-120B (reasoning model)
GPT-OSS-120B is a 117-billion-parameter mixture-of-experts (MoE) model designed for reasoning, agentic workflows, and code generation. This was a new model in MLPerf Inference v6.0, and the workload is demanding – highly variable input lengths (600 to 15,000 tokens) with a long tail, and strict latency requirements for the server scenario (P99 time-to-first-token under 3 seconds, P99 time-per-output-token under 80 milliseconds).
We submitted on H200, B200 and MI350X using OpenShift AI, llm-d, and vLLM. This makes our GPT-OSS-120B submission the first for this model on Kubernetes infrastructure.
- 8x B200 offline: 93,071 tokens per second, 8% higher than the next comparable B200 submission
- 8x B200 server: 71,588 tokens per second, a strong result given this was our first GPT-OSS-120B submission on Kubernetes infrastructure. Preliminary testing with tensor parallelism greater than 1 is already showing a 9% improvement.
- 8x H200: 28,680 tokens per second (offline), 24,103 tokens per second (server), the only H200 submission for this model in this round.
8XMI350X: 64,293 tokens per second (offline), 58,373 tokens per second (server), the only MI350X submission for this model this round. We were able to achieve 20% gain in offline performance, and 24% gain in server performance post the submission date.
We adopted a two-pronged strategy to optimize inference performance. First, our Bayesian optimization–based hyperparameter tuning pipeline on OpenShift identified an optimal configuration for a single replica that reduced P99 time-to-first-token (TTFT) from 3.4 seconds to 2.1 seconds (~38% improvement), meeting the sub-3s target.
Second, we optimized multi-replica performance by refining our load balancing and scoring strategy. By analyzing request distribution across replicas, we improved utilization and minimized tail latency, enabling more consistent scaling under load.
llm-d is an open source distributed inference framework built on vLLM and designed for Kubernetes-native deployments. While our MLPerf GPT-OSS-120B submission used load balancing, the architecture extends to more advanced techniques such as prefix-aware routing and disaggregating prefill and decode for scalable distributed inference. Image source: https://llm-d.ai/docs/architecture
Whisper Large-V3 (speech-to-text)
Whisper is OpenAI's automatic speech recognition model that converts spoken audio into text, supporting multilingual transcription and translation across 100+ languages. The benchmark evaluates the model on 1,633 audio samples (~30 hours) from the LibriSpeech dataset, a corpus of read English speech derived from LibriVox public domain audiobooks, evaluated by throughput (tokens per second) and transcription accuracy (word error rate).
We submitted Whisper-large-v3 results on H200 and L40S, both running RHEL and vLLM.
8x H200 offline: 36,396 tokens per second, the leading H200 result, 13% faster than the next closest submission
- 2x L40S offline: 3,647 tokens per second, the first and only L40S submission for Whisper in MLPerf Inference v6.0
These results were driven by a systematic ablation study across config parameters to identify the optimizations that matter most for Whisper inference. Batch size tuning delivered a 40% throughput gain by maximizing GPU utilization, asynchronous scheduling contributed a further 12.8% by eliminating CPU-GPU synchronization stalls, and CUDA Graphs provided an additional 6%. With L40S widely deployed in cost-sensitive environments, our results show that an open-source inference stack delivers world-class speech recognition performance across both high-end and cost-efficient hardware.
Llama-2-70B on AMD MI350X
In partnership with Supermicro, we submitted Llama-2-70B results on 8x AMD Instinct MI350X GPUs running RHEL 9.6 and vLLM with ROCm 7.0. This was Red Hat's first MLPerf submission on AMD hardware.
- Offline: 80,478 tokens per second
- Server: 76,393 tokens per second
These results are competitive with other MI350X submitters. The MI350X is an air-cooled variant of the MI355X, and our results align with the expected ~80% performance ratio relative to the liquid-cooled MI355X.
Key takeaways
vLLM is a competitive inference engine. Across all workloads, vLLM delivered strong results. Notably, 20 submitters in MLPerf v6.0 used vLLM (up from 5 in v5.1), confirming its position as the standard open source inference engine.
llm-d and OpenShift AI work for distributed inference at scale. Our GPT-OSS-120B submission ran on OpenShift AI with llm-d coordinating 8 GPU replicas. The llm-d scheduler used KV cache utilization scoring and queue depth scoring to route requests intelligently across replicas. This is the first MLPerf submission for a model of this scale running on Kubernetes infrastructure.
The stack is hardware-portable. We submitted on NVIDIA H200, NVIDIA B200, NVIDIA L40S, and AMD MI350X. The same core stack (vLLM on RHEL) ran across all of them. For enterprises evaluating hardware options, this means the software investment carries across GPU generations and vendors.
What comes next
With MLPerf launching a multi-turn task force for agentic workloads, benchmarks are moving closer to real-world, interactive AI systems. Red Hat is equipped for this shift, with an inference stack built for such scenarios. We plan to showcase llm-d, featuring prefix-aware scoring and distributed inference across replicas to efficiently support multi-turn, agentic workloads.
If you want to replicate our results, see our GitHub repository containing our MLPerf Inference v6.0 benchmark results and setup documentation. Check out the full MLPerf Inference v6.0 results at mlcommons.org and learn more about Red Hat AI.
Resource
The adaptable enterprise: Why AI readiness is disruption readiness
About the authors
Ashish Kamra is an accomplished engineering leader with over 15 years of experience managing high-performing teams in AI, machine learning, and cloud computing. He joined Red Hat in March 2017, where he currently serves as the Senior Manager of AI Performance at Red Hat. In this role, Ashish heads up initiatives to optimize performance and scale of Red Hat OpenShift AI - an end to end platform for MLOps, specifically focusing on large language model inference and training performance.
Prior to Red Hat, Ashish held leadership positions at Dell EMC, where he drove the development and integration of enterprise and cloud storage solutions and containerized data services. He also has a strong academic background, having earned a Ph.D. in Computer Engineering from Purdue University in 2010. His research focused on database intrusion detection and response, and he has published several papers in renowned journals and conferences.
Passionate about leveraging technology to drive business impact, Ashish is pursuing a Part-time Global Online MBA at Warwick Business School to complement his technical expertise. In his free time, he enjoys playing table tennis, exploring global cuisines, and traveling the world.
Diane Feddema is a Principal Software Engineer at Red Hat Inc in the Performance and Scale Team with a focus on AI/ML applications. She has submitted official results in multiple rounds of MLCommons MLPerf Inference and Training, dating back to the initial MLPerf rounds. Diane Leads performance analysis and visualization for MLPerf benchmark submissions and collaborates with Red Hat Hardware Partners in creating joint MLPerf benchmark submissions.
Diane has a BS and MS in Computer Science and is presently co-chair of the Best Practices group of the MLPerf consortium.
Michael Goin is a lead maintainer of vLLM, the high-performance open-source engine for LLM inference. His contributions span core areas including kernel optimization, system scheduling, new model architectures, and hardware efficiency across GPUs and emerging accelerators. Michael led performance engineering at Neural Magic, which was acquired by Red Hat, and brings experience in inference systems to the open source AI community.
Michey is a member of the Red Hat Performance Engineering team, and works on bare metal/virtualization performance and machine learning performance.. His areas of expertise include storage performance, Linux kernel performance, and performance tooling.
Performance engineer working on MLPerf, LLM Inference performance and profiling . Previous experience in hardware performance modelling
Nikhil Palaskar is a senior software engineer and a member of Performance and Scale Engineering at Red Hat. Nikhil's current focus is primarily on serving LLMs for inference and performance optimizing their deployments in Red Hat Openshift AI and Red Hat Enterprise Linux AI environments. Nikhil is also actively engaged in performance experimentation and tuning of running large language models on AMD hardware with the ROCm software stack. Previously he has also worked on building a Benchmarking and Performance Analysis Framework (Pbench).
Nikhil's professional interests revolve around AI/ML and deep learning, statistics, performance engineering, and application profiling. When he is not working he likes to go on hikes.
Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.
Aanya has been a Software Engineer at Red Hat since 2025, specializing in optimizing the performance and scale of OpenShift AI for enterprise-grade LLM workloads. She brings over two years of specialized experience developing AI applications at FPrime.ai (Carnegie Mellon University), Microsoft, and Scotiabank. Leveraging a background in Computer Science and Mathematics from UC San Diego, Aanya is dedicated to advancing scalable AI solutions built on open principles. In her free time, she is a racquet sports enthusiast and a passionate coffee connoisseur.
Alberto Perdomo is a AI Systems Performance Engineer on the PSAP team at Red Hat, working at the intersection of distributed systems, AI inference, and open source. Main focus is on optimizing distributed LLM inference with vLLM and llm-d for high-performance production deployments, among other interesting work in distributed AI systems.
Harika Pothina is an AI Systems Performance Engineer on the PSAP team at Red Hat, working at the intersection of distributed systems, AI inference, and performance engineering. Her work focuses on optimizing large-scale VLM inference with vLLM, MLPerf, and OpenShift for high-performance production deployments, while also driving benchmarking, profiling, and system-level tuning across modern GPU platforms.
Samuel Monson is a Senior Software Engineer and a member of the Performance and Scale for AI Platforms team at Red Hat. Samuel current role focuses on contributing to Red Hat's mission of fast, cost-effective AI by developing tooling around measuring and analyzing performance.
More like this
Red Hat OpenShift sandboxed containers 1.12 and Red Hat build of Trustee 1.1 bring confidential computing to bare metal and AI workloads
AI for scientific research: Building the research platform that science needs with Red Hat AI
Technically Speaking | Build a production-ready AI toolbox
Technically Speaking | Platform engineering for AI agents
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Virtualization
The future of enterprise virtualization for your workloads on-premise or across clouds