Assinar feed RSS

We hope you’re as excited as we are for the first-ever East Coast vLLM meetup! Neural Magic [now a part of Red Hat] will be hosting the event on March 11, 2025 at the Google offices in Cambridge, Massachusetts.

This is a time to connect with a growing community of vLLM users, developers, maintainers, and engineers from leading companies! Whether you’re a seasoned expert or new to the field, come join us as we dive into exciting technical talks, exchange insights, and discuss the latest innovations in optimizing LLM inference for both performance and efficiency. We hope to see you there!

Learn more and register here!

Bi-weekly vLLM Office Hours

Upcoming

Exploring vLLM V1 Alpha | February 27, 2025 - 2:00PM ET / 11:00AM PT

3

Join Robert Shaw, vLLM's core committer and Director of Engineering at Red Hat, as he dives into the alpha release of vLLM V1, a transformative upgrade to vLLM’s architecture. Built on 1.5 years of insights, V1 enhances flexibility, scalability, and performance while maintaining seamless compatibility. We'll deep dive into key design improvements, state-of-the-art performance gains, and our roadmap for making V1 the default engine.

Register Here

vLLM Production Stack Deep Dive | March 6, 2025 - 2:00PM ET / 11:00AM PT

Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.

Register Here

Recent Recordings

Multimodal LLMs With vLLM V1 | Slides

Distributed Inference with vLLM | Slides | Blog

Blogs

Introducing vLLM Inference Provider in Llama Stack

Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.

Keep Reading


Introducing Compressed Granite 3.1: Powerful Performance in a Small Package

Our new compressed Granite 3.1 models are designed for enterprise deployments, achieving 3.3X smaller models, up to 2.8X better performance, and 99% accuracy recovery. Models and recipes are open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.

Keep Reading


How Well Do Quantized Models Handle Long-Context Tasks?

We evaluated quantized Llama 3.1 models up to 128k sequence length and found 99%+ accuracy recovery for most quantization formats. See the details in the blog.

Keep Reading


Enhancing DeepSeek Models with MLA and FP8 Optimizations in vLLM

The vLLM community has rolled out its latest batch of enhancements to DeepSeek models, including support for MLA (Multi-Head Latent Attention) and optimized CUTLASS Block FP8 kernels. These improvements increase both generation throughput and memory efficiency, making long-context inference more scalable and cost-effective.

Keep Reading


Multimodal Model Quantization Support Through LLM Compressor

LLM Compressor (v0.4.0) now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.

Keep Reading

Research From Our Labs 🧪

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Read Here

Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning | Read Here

Activation-Informed Merging of Large Language Models | Read Here

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Read Here

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations | Read Here

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models | Read Here

Events

Stay engaged with the vLLM community

vLLM is nearing 39,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.

resource

Introdução à IA empresarial: um guia para iniciantes

Leia este guia para iniciantes e descubra como o Red Hat OpenShift AI e o Red Hat Enterprise Linux AI podem ajudar a acelerar sua jornada de adoção da inteligência artificial.

Sobre o autor

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

Navegue por canal

automation icon

Automação

Últimas novidades em automação de TI para empresas de tecnologia, equipes e ambientes

AI icon

Inteligência artificial

Descubra as atualizações nas plataformas que proporcionam aos clientes executar suas cargas de trabalho de IA em qualquer ambiente

open hybrid cloud icon

Nuvem híbrida aberta

Veja como construímos um futuro mais flexível com a nuvem híbrida

security icon

Segurança

Veja as últimas novidades sobre como reduzimos riscos em ambientes e tecnologias

edge icon

Edge computing

Saiba quais são as atualizações nas plataformas que simplificam as operações na borda

Infrastructure icon

Infraestrutura

Saiba o que há de mais recente na plataforma Linux empresarial líder mundial

application development icon

Aplicações

Conheça nossas soluções desenvolvidas para ajudar você a superar os desafios mais complexos de aplicações

Virtualization icon

Virtualização

O futuro da virtualização empresarial para suas cargas de trabalho on-premise ou na nuvem