RSS 피드 구독하기

We hope you’re as excited as we are for the first-ever East Coast vLLM meetup! Neural Magic [now a part of Red Hat] will be hosting the event on March 11, 2025 at the Google offices in Cambridge, Massachusetts.

This is a time to connect with a growing community of vLLM users, developers, maintainers, and engineers from leading companies! Whether you’re a seasoned expert or new to the field, come join us as we dive into exciting technical talks, exchange insights, and discuss the latest innovations in optimizing LLM inference for both performance and efficiency. We hope to see you there!

Learn more and register here!

Bi-weekly vLLM Office Hours

Upcoming

Exploring vLLM V1 Alpha | February 27, 2025 - 2:00PM ET / 11:00AM PT

3

Join Robert Shaw, vLLM's core committer and Director of Engineering at Red Hat, as he dives into the alpha release of vLLM V1, a transformative upgrade to vLLM’s architecture. Built on 1.5 years of insights, V1 enhances flexibility, scalability, and performance while maintaining seamless compatibility. We'll deep dive into key design improvements, state-of-the-art performance gains, and our roadmap for making V1 the default engine.

Register Here

vLLM Production Stack Deep Dive | March 6, 2025 - 2:00PM ET / 11:00AM PT

Join us for an overview of the components in the vLLM Production Stack (https://github.com/vllm-project/production-stack) and practical guidance on deploying it effectively. We’ll dive into the technical details, including an in-depth look at the prefix-aware router and its role in optimizing request routing, as well as KV cache offloading and its impact on performance and scalability.

Register Here

Recent Recordings

Multimodal LLMs With vLLM V1 | Slides

Distributed Inference with vLLM | Slides | Blog

Blogs

Introducing vLLM Inference Provider in Llama Stack

Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.

Keep Reading


Introducing Compressed Granite 3.1: Powerful Performance in a Small Package

Our new compressed Granite 3.1 models are designed for enterprise deployments, achieving 3.3X smaller models, up to 2.8X better performance, and 99% accuracy recovery. Models and recipes are open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.

Keep Reading


How Well Do Quantized Models Handle Long-Context Tasks?

We evaluated quantized Llama 3.1 models up to 128k sequence length and found 99%+ accuracy recovery for most quantization formats. See the details in the blog.

Keep Reading


Enhancing DeepSeek Models with MLA and FP8 Optimizations in vLLM

The vLLM community has rolled out its latest batch of enhancements to DeepSeek models, including support for MLA (Multi-Head Latent Attention) and optimized CUTLASS Block FP8 kernels. These improvements increase both generation throughput and memory efficiency, making long-context inference more scalable and cost-effective.

Keep Reading


Multimodal Model Quantization Support Through LLM Compressor

LLM Compressor (v0.4.0) now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.

Keep Reading

Research From Our Labs 🧪

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Read Here

Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning | Read Here

Activation-Informed Merging of Large Language Models | Read Here

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Read Here

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations | Read Here

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models | Read Here

Events

Stay engaged with the vLLM community

vLLM is nearing 39,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.

resource

엔터프라이즈를 위한 AI 시작하기: 입문자용 가이드

Red Hat OpenShift AI와 Red Hat Enterprise Linux AI가 어떻게 여러분의 AI 도입 여정을 가속화할 수 있는지 확인해 보세요.

저자 소개

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래