Subscribe to the feed

We hope you’re as excited as we are for the first-ever East Coast vLLM meetup! Neural Magic [now a part of Red Hat] will be hosting the event on March 11, 2025 at the Google offices in Cambridge, Massachusetts.

This is a time to connect with a growing community of vLLM users, developers, maintainers, and engineers from leading companies! Whether you’re a seasoned expert or new to the field, come join us as we dive into exciting technical talks, exchange insights, and discuss the latest innovations in optimizing LLM inference for both performance and efficiency. We hope to see you there!

Learn more and register here!

Bi-weekly vLLM Office Hours

Upcoming

Exploring vLLM V1 Alpha | February 27, 2025 - 2:00PM ET / 11:00AM PT

3

Join Robert Shaw, vLLM's core committer and Director of Engineering at Red Hat, as he dives into the alpha release of vLLM V1, a transformative upgrade to vLLM’s architecture. Built on 1.5 years of insights, V1 enhances flexibility, scalability, and performance while maintaining seamless compatibility. We'll deep dive into key design improvements, state-of-the-art performance gains, and our roadmap for making V1 the default engine.

Register Here

vLLM Production Stack Deep Dive | March 6, 2025 - 2:00PM ET / 11:00AM PT

Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.

Register Here

Recent Recordings

Multimodal LLMs With vLLM V1 | Slides

Distributed Inference with vLLM | Slides | Blog

Blogs

Introducing vLLM Inference Provider in Llama Stack

Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.

Keep Reading


Introducing Compressed Granite 3.1: Powerful Performance in a Small Package

Our new compressed Granite 3.1 models are designed for enterprise deployments, achieving 3.3X smaller models, up to 2.8X better performance, and 99% accuracy recovery. Models and recipes are open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.

Keep Reading


How Well Do Quantized Models Handle Long-Context Tasks?

We evaluated quantized Llama 3.1 models up to 128k sequence length and found 99%+ accuracy recovery for most quantization formats. See the details in the blog.

Keep Reading


Enhancing DeepSeek Models with MLA and FP8 Optimizations in vLLM

The vLLM community has rolled out its latest batch of enhancements to DeepSeek models, including support for MLA (Multi-Head Latent Attention) and optimized CUTLASS Block FP8 kernels. These improvements increase both generation throughput and memory efficiency, making long-context inference more scalable and cost-effective.

Keep Reading


Multimodal Model Quantization Support Through LLM Compressor

LLM Compressor (v0.4.0) now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.

Keep Reading

Research From Our Labs 🧪

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Read Here

Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning | Read Here

Activation-Informed Merging of Large Language Models | Read Here

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Read Here

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations | Read Here

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models | Read Here

Events

Stay engaged with the vLLM community

vLLM is nearing 39,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.

resource

Get started with AI for enterprise: A beginner’s guide

Explore this beginner's guide to find out how Red Hat OpenShift AI and Red Hat Enterprise Linux AI can accelerate your AI adoption journey.

About the author

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

Keep exploring

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Virtualization icon

Virtualization

The future of enterprise virtualization for your workloads on-premise or across clouds