Welcome to Neural Magic's monthly vLLM roundup! We are excited to announce the agreement to be acquired by Red Hat. Joining forces with the industry's open source leader will enable us to bring our cutting-edge AI model optimization and accelerated inference technology to a worldwide audience of enterprises adopting open LLM capabilities.

Keep scrolling for exciting vLLM updates and opportunities to engage with the community!

Bi-Weekly vLLM Office Hours

Recent Recordings

vLLM Project Update: 2024 Retrospective and 2025 Roadmap | Watch Now

Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs | Watch Now

Disaggregated Prefill and KV Cache Storage in vLLM | Watch Now

SOTA Tool-Calling Implementation in vLLM | Watch Now

Take Your AI Performance to the Next Level

2

2:4 Sparse Llama: Smaller Models for Efficient GPU Inference
Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squeezing out the last possible bits before accuracy plummets.

Keep Reading

2

We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found
Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference.

Keep Reading

2

Introducing Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
Mixed-input quantization is a technique that processes weights and activations at different precisions in neural networks.

Keep Reading

Research From Our Labs 🧪

1️⃣ "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization | Read Here

2️⃣ PV-Tuning: Beyond Straight-Through Estimation for Extreme
LLM Compression | Read Here

3️⃣ QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | Read Here

4️⃣ The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information | Read Here

5️⃣ MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence | Read Here

vLLM has surpassed 32,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.

リソース

エンタープライズ AI を始める:初心者向けガイド

この初心者向けガイドでは、Red Hat OpenShift AI と Red Hat Enterprise Linux AI によって AI 導入をどのように加速できるのかについて説明します。

執筆者紹介

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

UI_Icon-Red_Hat-Close-A-Black-RGB

チャンネル別に見る

automation icon

自動化

テクノロジー、チームおよび環境に関する IT 自動化の最新情報

AI icon

AI (人工知能)

お客様が AI ワークロードをどこでも自由に実行することを可能にするプラットフォームについてのアップデート

open hybrid cloud icon

オープン・ハイブリッドクラウド

ハイブリッドクラウドで柔軟に未来を築く方法をご確認ください。

security icon

セキュリティ

環境やテクノロジー全体に及ぶリスクを軽減する方法に関する最新情報

edge icon

エッジコンピューティング

エッジでの運用を単純化するプラットフォームのアップデート

Infrastructure icon

インフラストラクチャ

世界有数のエンタープライズ向け Linux プラットフォームの最新情報

application development icon

アプリケーション

アプリケーションの最も困難な課題に対する Red Hat ソリューションの詳細

Virtualization icon

仮想化

オンプレミスまたは複数クラウドでのワークロードに対応するエンタープライズ仮想化の将来についてご覧ください