Welcome to Neural Magic's monthly vLLM roundup! We are excited to announce the agreement to be acquired by Red Hat. Joining forces with the industry's open source leader will enable us to bring our cutting-edge AI model optimization and accelerated inference technology to a worldwide audience of enterprises adopting open LLM capabilities.
Keep scrolling for exciting vLLM updates and opportunities to engage with the community!
Bi-Weekly vLLM Office Hours
Recent Recordings
vLLM Project Update: 2024 Retrospective and 2025 Roadmap | Watch Now
Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs | Watch Now
Disaggregated Prefill and KV Cache Storage in vLLM | Watch Now
SOTA Tool-Calling Implementation in vLLM | Watch Now
Take Your AI Performance to the Next Level
2:4 Sparse Llama: Smaller Models for Efficient GPU Inference
Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squeezing out the last possible bits before accuracy plummets.
We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found
Quantizing models to lower precision formats, such as 8-bit or 4-bit, significantly reduces computational costs and accelerates inference.
Introducing Machete, a Mixed-Input GEMM Kernel Optimized for NVIDIA Hopper GPUs
Mixed-input quantization is a technique that processes weights and activations at different precisions in neural networks.
Research From Our Labs 🧪
1️⃣ "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization | Read Here
2️⃣ PV-Tuning: Beyond Straight-Through Estimation for Extreme
LLM Compression | Read Here
3️⃣ QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | Read Here
4️⃣ The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information | Read Here
5️⃣ MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence | Read Here
vLLM has surpassed 32,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.
Recurso
Introducción a la inteligencia artificial para las empresas: Guía para principiantes
Sobre el autor
Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.
Más como éste
Looking ahead to 2026: Red Hat’s view across the hybrid cloud
Resilient model training on Red Hat OpenShift AI with Kubeflow Trainer
Technically Speaking | Platform engineering for AI agents
Technically Speaking | Driving healthcare discoveries with AI
Navegar por canal
Automatización
Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos
Inteligencia artificial
Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar
Nube híbrida abierta
Vea como construimos un futuro flexible con la nube híbrida
Seguridad
Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías
Edge computing
Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge
Infraestructura
Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo
Aplicaciones
Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones
Virtualización
El futuro de la virtualización empresarial para tus cargas de trabajo locales o en la nube