Iscriviti al feed

We hope you’re as excited as we are for the first-ever East Coast vLLM meetup! Neural Magic [now a part of Red Hat] will be hosting the event on March 11, 2025 at the Google offices in Cambridge, Massachusetts.

This is a time to connect with a growing community of vLLM users, developers, maintainers, and engineers from leading companies! Whether you’re a seasoned expert or new to the field, come join us as we dive into exciting technical talks, exchange insights, and discuss the latest innovations in optimizing LLM inference for both performance and efficiency. We hope to see you there!

Learn more and register here!

Bi-weekly vLLM Office Hours

Upcoming

Exploring vLLM V1 Alpha | February 27, 2025 - 2:00PM ET / 11:00AM PT

3

Join Robert Shaw, vLLM's core committer and Director of Engineering at Red Hat, as he dives into the alpha release of vLLM V1, a transformative upgrade to vLLM’s architecture. Built on 1.5 years of insights, V1 enhances flexibility, scalability, and performance while maintaining seamless compatibility. We'll deep dive into key design improvements, state-of-the-art performance gains, and our roadmap for making V1 the default engine.

Register Here

vLLM Production Stack Deep Dive | March 6, 2025 - 2:00PM ET / 11:00AM PT

Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.

Register Here

Recent Recordings

Multimodal LLMs With vLLM V1 | Slides

Distributed Inference with vLLM | Slides | Blog

Blogs

Introducing vLLM Inference Provider in Llama Stack

Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.

Keep Reading


Introducing Compressed Granite 3.1: Powerful Performance in a Small Package

Our new compressed Granite 3.1 models are designed for enterprise deployments, achieving 3.3X smaller models, up to 2.8X better performance, and 99% accuracy recovery. Models and recipes are open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.

Keep Reading


How Well Do Quantized Models Handle Long-Context Tasks?

We evaluated quantized Llama 3.1 models up to 128k sequence length and found 99%+ accuracy recovery for most quantization formats. See the details in the blog.

Keep Reading


Enhancing DeepSeek Models with MLA and FP8 Optimizations in vLLM

The vLLM community has rolled out its latest batch of enhancements to DeepSeek models, including support for MLA (Multi-Head Latent Attention) and optimized CUTLASS Block FP8 kernels. These improvements increase both generation throughput and memory efficiency, making long-context inference more scalable and cost-effective.

Keep Reading


Multimodal Model Quantization Support Through LLM Compressor

LLM Compressor (v0.4.0) now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.

Keep Reading

Research From Our Labs 🧪

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Read Here

Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning | Read Here

Activation-Informed Merging of Large Language Models | Read Here

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Read Here

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations | Read Here

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models | Read Here

Events

Stay engaged with the vLLM community

vLLM is nearing 39,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.

resource

Definizione della strategia aziendale per l'IA: una guida introduttiva

Leggi questa guida introduttiva per scoprire come Red Hat OpenShift AI e Red Hat Enterprise Linux AI possono accelerare il percorso di adozione dell'IA.

Sull'autore

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

Ricerca per canale

automation icon

Automazione

Novità sull'automazione IT di tecnologie, team e ambienti

AI icon

Intelligenza artificiale

Aggiornamenti sulle piattaforme che consentono alle aziende di eseguire carichi di lavoro IA ovunque

open hybrid cloud icon

Hybrid cloud open source

Scopri come affrontare il futuro in modo più agile grazie al cloud ibrido

security icon

Sicurezza

Le ultime novità sulle nostre soluzioni per ridurre i rischi nelle tecnologie e negli ambienti

edge icon

Edge computing

Aggiornamenti sulle piattaforme che semplificano l'operatività edge

Infrastructure icon

Infrastruttura

Le ultime novità sulla piattaforma Linux aziendale leader a livello mondiale

application development icon

Applicazioni

Approfondimenti sulle nostre soluzioni alle sfide applicative più difficili

Virtualization icon

Virtualizzazione

Il futuro della virtualizzazione negli ambienti aziendali per i carichi di lavoro on premise o nel cloud