Suscríbase al feed

We hope you’re as excited as we are for the first-ever East Coast vLLM meetup! Neural Magic [now a part of Red Hat] will be hosting the event on March 11, 2025 at the Google offices in Cambridge, Massachusetts.

This is a time to connect with a growing community of vLLM users, developers, maintainers, and engineers from leading companies! Whether you’re a seasoned expert or new to the field, come join us as we dive into exciting technical talks, exchange insights, and discuss the latest innovations in optimizing LLM inference for both performance and efficiency. We hope to see you there!

Learn more and register here!

Bi-weekly vLLM Office Hours

Upcoming

Exploring vLLM V1 Alpha | February 27, 2025 - 2:00PM ET / 11:00AM PT

3

Join Robert Shaw, vLLM's core committer and Director of Engineering at Red Hat, as he dives into the alpha release of vLLM V1, a transformative upgrade to vLLM’s architecture. Built on 1.5 years of insights, V1 enhances flexibility, scalability, and performance while maintaining seamless compatibility. We'll deep dive into key design improvements, state-of-the-art performance gains, and our roadmap for making V1 the default engine.

Register Here

vLLM Production Stack Deep Dive | March 6, 2025 - 2:00PM ET / 11:00AM PT

Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.

Register Here

Recent Recordings

Multimodal LLMs With vLLM V1 | Slides

Distributed Inference with vLLM | Slides | Blog

Blogs

Introducing vLLM Inference Provider in Llama Stack

Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.

Keep Reading


Introducing Compressed Granite 3.1: Powerful Performance in a Small Package

Our new compressed Granite 3.1 models are designed for enterprise deployments, achieving 3.3X smaller models, up to 2.8X better performance, and 99% accuracy recovery. Models and recipes are open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.

Keep Reading


How Well Do Quantized Models Handle Long-Context Tasks?

We evaluated quantized Llama 3.1 models up to 128k sequence length and found 99%+ accuracy recovery for most quantization formats. See the details in the blog.

Keep Reading


Enhancing DeepSeek Models with MLA and FP8 Optimizations in vLLM

The vLLM community has rolled out its latest batch of enhancements to DeepSeek models, including support for MLA (Multi-Head Latent Attention) and optimized CUTLASS Block FP8 kernels. These improvements increase both generation throughput and memory efficiency, making long-context inference more scalable and cost-effective.

Keep Reading


Multimodal Model Quantization Support Through LLM Compressor

LLM Compressor (v0.4.0) now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.

Keep Reading

Research From Our Labs 🧪

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Read Here

Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning | Read Here

Activation-Informed Merging of Large Language Models | Read Here

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Read Here

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations | Read Here

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models | Read Here

Events

Stay engaged with the vLLM community

vLLM is nearing 39,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.

resource

Introducción a la inteligencia artificial para las empresas: Guía para principiantes

Acelere su proceso de adopción de la inteligencia artificial con Red Hat OpenShift AI y Red Hat Enterprise Linux AI. Obtenga más información al respecto en esta guía para principiantes.

Sobre el autor

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

Navegar por canal

automation icon

Automatización

Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos

AI icon

Inteligencia artificial

Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar

open hybrid cloud icon

Nube híbrida abierta

Vea como construimos un futuro flexible con la nube híbrida

security icon

Seguridad

Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías

edge icon

Edge computing

Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge

Infrastructure icon

Infraestructura

Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo

application development icon

Aplicaciones

Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones

Virtualization icon

Virtualización

El futuro de la virtualización empresarial para tus cargas de trabajo locales o en la nube