We hope you’re as excited as we are for the first-ever East Coast vLLM meetup! Neural Magic [now a part of Red Hat] will be hosting the event on March 11, 2025 at the Google offices in Cambridge, Massachusetts.
This is a time to connect with a growing community of vLLM users, developers, maintainers, and engineers from leading companies! Whether you’re a seasoned expert or new to the field, come join us as we dive into exciting technical talks, exchange insights, and discuss the latest innovations in optimizing LLM inference for both performance and efficiency. We hope to see you there!
Bi-weekly vLLM Office Hours
Upcoming
Exploring vLLM V1 Alpha | February 27, 2025 - 2:00PM ET / 11:00AM PT

Join Robert Shaw, vLLM's core committer and Director of Engineering at Red Hat, as he dives into the alpha release of vLLM V1, a transformative upgrade to vLLM’s architecture. Built on 1.5 years of insights, V1 enhances flexibility, scalability, and performance while maintaining seamless compatibility. We'll deep dive into key design improvements, state-of-the-art performance gains, and our roadmap for making V1 the default engine.
vLLM Production Stack Deep Dive | March 6, 2025 - 2:00PM ET / 11:00AM PT
Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.
Recent Recordings
Multimodal LLMs With vLLM V1 | Slides
Distributed Inference with vLLM | Slides | Blog
Blogs
Introducing vLLM Inference Provider in Llama Stack
Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.
Introducing Compressed Granite 3.1: Powerful Performance in a Small Package
Our new compressed Granite 3.1 models are designed for enterprise deployments, achieving 3.3X smaller models, up to 2.8X better performance, and 99% accuracy recovery. Models and recipes are open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.
How Well Do Quantized Models Handle Long-Context Tasks?
We evaluated quantized Llama 3.1 models up to 128k sequence length and found 99%+ accuracy recovery for most quantization formats. See the details in the blog.
Enhancing DeepSeek Models with MLA and FP8 Optimizations in vLLM
The vLLM community has rolled out its latest batch of enhancements to DeepSeek models, including support for MLA (Multi-Head Latent Attention) and optimized CUTLASS Block FP8 kernels. These improvements increase both generation throughput and memory efficiency, making long-context inference more scalable and cost-effective.
Multimodal Model Quantization Support Through LLM Compressor
LLM Compressor (v0.4.0) now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.
Research From Our Labs 🧪
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Read Here
Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning | Read Here
Activation-Informed Merging of Large Language Models | Read Here
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Read Here
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations | Read Here
Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models | Read Here
Events
- vLLM x Meta Meetup | February 27, 2025 | Register Here
- East Coast vLLM Meetup | March 11, 2025 | Register Here
Stay engaged with the vLLM community
vLLM is nearing 39,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.
resource
Definizione della strategia aziendale per l'IA: una guida introduttiva
Sull'autore
Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.
Altri risultati simili a questo
Ricerca per canale
Automazione
Novità sull'automazione IT di tecnologie, team e ambienti
Intelligenza artificiale
Aggiornamenti sulle piattaforme che consentono alle aziende di eseguire carichi di lavoro IA ovunque
Hybrid cloud open source
Scopri come affrontare il futuro in modo più agile grazie al cloud ibrido
Sicurezza
Le ultime novità sulle nostre soluzioni per ridurre i rischi nelle tecnologie e negli ambienti
Edge computing
Aggiornamenti sulle piattaforme che semplificano l'operatività edge
Infrastruttura
Le ultime novità sulla piattaforma Linux aziendale leader a livello mondiale
Applicazioni
Approfondimenti sulle nostre soluzioni alle sfide applicative più difficili
Virtualizzazione
Il futuro della virtualizzazione negli ambienti aziendali per i carichi di lavoro on premise o nel cloud