Welcome to Neural Magic's monthly vLLM roundup! We are excited to announce the agreement to be acquired by Red Hat. Joining forces with the industry's open source leader will enable us to bring our cutting-edge AI model optimization and accelerated inference technology to a worldwide audience of enterprises adopting open LLM capabilities.

Keep scrolling for exciting vLLM updates and opportunities to engage with the community!

Bi-Weekly vLLM Office Hours

Recent Recordings

vLLM Project Update: 2024 Retrospective and 2025 Roadmap | Watch Now

Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs | Watch Now

Disaggregated Prefill and KV Cache Storage in vLLM | Watch Now

SOTA Tool-Calling Implementation in vLLM | Watch Now

Take Your AI Performance to the Next Level

2

2:4 Sparse Llama: Smaller Models for Efficient GPU Inference
Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squeezing out the last possible bits before accuracy plummets.

Keep Reading

2

We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found
Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference.

Keep Reading

2

Introducing Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
Mixed-input quantization is a technique that processes weights and activations at different precisions in neural networks.

Keep Reading

Research From Our Labs 🧪

1️⃣ "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization | Read Here

2️⃣ PV-Tuning: Beyond Straight-Through Estimation for Extreme
LLM Compression | Read Here

3️⃣ QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | Read Here

4️⃣ The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information | Read Here

5️⃣ MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence | Read Here

vLLM has surpassed 32,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.

리소스

엔터프라이즈를 위한 AI 시작하기: 입문자용 가이드

이 입문자용 가이드에서 Red Hat OpenShift AI와 Red Hat Enterprise Linux AI로 AI 도입 여정을 가속화할 수 있는 방법을 알아보세요.

저자 소개

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래