Red Hat is proud to announce industry-leading results from the latest MLPerf Inference v6.0 benchmarks, achieved through deep engineering co-design with NVIDIA. These results demonstrate that when you combine Red Hat’s open-source leadership with NVIDIA’s leading AI infrastructure, the result is a versatile, proven platform ready for any enterprise inference workload—from vision and speech to complex reasoning.
Our latest submissions focused on maximizing the potential of the NVIDIA HGX H200 and NVIDIA HGX B200 systems, proving that software optimization is just as critical as raw horsepower for achieving peak ROI.
Results at a glance
Across language, vision, and speech models, Red Hat’s stack delivered top-tier throughput and latency results on NVIDIA AI infrastructure.
Model Category | Model | GPU Configuration | Scenario | Leading Results |
Vision | Qwen3-VL-235B | 8× NVIDIA B200 | Server | 67.9 samples/sec |
Reasoning | GPT-OSS-120B | 8× NVIDIA B200 | Offline | 93,071 tokens/sec |
Speech | Whisper-Large-v3 | 8× NVIDIA H200 | Offline | 36,396 tokens/sec |
Qwen3-VL-235B (multimodal vision model)
The Qwen3-VL-235B model, a massive 235-billion-parameter multimodal vision-language model, represents a significant challenge for inference engines due to highly variable image resolutions. Using NVIDIA Blackwell GPUs running on Red Hat Enterprise Linux (RHEL) with vLLM and NVIDIA Dynamo, we achieved the highest offline throughput in our class. Notably, our Blackwell submission exceeded the next top performer by 50% in the Server scenario.
Key engineering wins:
- Triton-based improvements: Optimizations to the vision encoder yielded 30-40% faster ViT processing.
- FlashInfer Mixture-of-Experts (MoE) kernels: These specialized kernels handled the MoE architecture with extreme efficiency.
- FP8 Multimodal Attention: Leveraging NVIDIA’s advanced data formats to lower cost per token without sacrificing accuracy.
GPT-OSS-120B
Our submission for GPT-OSS-120B marks the first time a model of this scale has been benchmarked on Kubernetes infrastructure for MLPerf. By using Red Hat OpenShift AI and the llm-d scheduler, we demonstrated that distributed inference can scale effectively on NVIDIA AI infrastructure (H200 and B200 GPUs) while maintaining strict latency requirements.
We adopted a two-pronged strategy to optimize inference performance. First, our Bayesian optimization–based hyperparameter tuning pipeline on OpenShift identified an optimal configuration for a single replica that reduced P99 time-to-first-token (TTFT) from 3.4 seconds to 2.1 seconds (~38% improvement), meeting the sub-3s target.
Second, we optimized multi-replica performance by refining our load balancing and scoring strategy. By analyzing request distribution across replicas, we improved utilization and minimized tail latency, enabling more consistent scaling under load.
Whisper large-V3 (speech-to-text)
We submitted Whisper-large-v3 results on NVIDIA H200 and NVIDIA L40S GPUs, both running Red Hat Enterprise Linux (RHEL) and vLLM.
- 8x H200 offline: 36,396 tokens per second, the leading H200 result, 13% faster than the next closest submission
- 2x L40S offline: 3,647 tokens per second, the first and only L40S submission for Whisper in MLPerf Inference v6.0
These results were driven by a systematic ablation study across config parameters to identify the optimizations that matter most for Whisper inference. Batch size tuning delivered a 40% throughput gain by maximizing GPU utilization, asynchronous scheduling contributed a further 12.8% by eliminating CPU-GPU synchronization stalls, and CUDA Graphs provided an additional 6%. With L40S widely deployed in cost-sensitive environments, our results show that an open-source inference stack delivers world-class speech recognition performance across both high-end and cost-efficient hardware.
Delivering greater efficiency and ROI
Red Hat’s software stack utilizes NVIDIA inference software Dynamo and Red Hat AI’s vLLM and llm-d to deliver significant efficiency gains on NVIDIA accelerated computing infrastructure. By optimizing every layer of the stack—from the RHEL kernel to the inference engines—we help enterprises lower their cost per token and improve overall ROI on their NVIDIA investments. Whether you are deploying on-premises or in the cloud, Red Hat provides a proven, high-performance foundation for the next generation of agentic and multimodal AI.
Want to replicate our results? Here’s how… Repo
Check out the full MLPerf Inference v6.0 results at mlcommons.org and learn more about Red Hat AI.
執筆者紹介
Ashish Kamra is an accomplished engineering leader with over 15 years of experience managing high-performing teams in AI, machine learning, and cloud computing. He joined Red Hat in March 2017, where he currently serves as the Senior Manager of AI Performance at Red Hat. In this role, Ashish heads up initiatives to optimize performance and scale of Red Hat OpenShift AI - an end to end platform for MLOps, specifically focusing on large language model inference and training performance.
Prior to Red Hat, Ashish held leadership positions at Dell EMC, where he drove the development and integration of enterprise and cloud storage solutions and containerized data services. He also has a strong academic background, having earned a Ph.D. in Computer Engineering from Purdue University in 2010. His research focused on database intrusion detection and response, and he has published several papers in renowned journals and conferences.
Passionate about leveraging technology to drive business impact, Ashish is pursuing a Part-time Global Online MBA at Warwick Business School to complement his technical expertise. In his free time, he enjoys playing table tennis, exploring global cuisines, and traveling the world.
類似検索
Red Hat AI tops MLPerf Inference v6.0 with vLLM on Qwen3-VL, Whisper, and GPT-OSS-120B
Running LLMs dynamically, in production, on limited resources, is hard. We think there’s room for another approach…
Technically Speaking | Build a production-ready AI toolbox
Technically Speaking | Platform engineering for AI agents
チャンネル別に見る
自動化
テクノロジー、チームおよび環境に関する IT 自動化の最新情報
AI (人工知能)
お客様が AI ワークロードをどこでも自由に実行することを可能にするプラットフォームについてのアップデート
オープン・ハイブリッドクラウド
ハイブリッドクラウドで柔軟に未来を築く方法をご確認ください。
セキュリティ
環境やテクノロジー全体に及ぶリスクを軽減する方法に関する最新情報
エッジコンピューティング
エッジでの運用を単純化するプラットフォームのアップデート
インフラストラクチャ
世界有数のエンタープライズ向け Linux プラットフォームの最新情報
アプリケーション
アプリケーションの最も困難な課題に対する Red Hat ソリューションの詳細
仮想化
オンプレミスまたは複数クラウドでのワークロードに対応するエンタープライズ仮想化の将来についてご覧ください