We’ve heard all the love for the vLLM meetups and we’re excited to announce that the next one is happening in New York City on May 7th! We’ll be hosting it at IBM One Madison Avenue, and we can’t wait to see you there. This is your heads-up to mark your calendars! As a subscriber to our newsletter, you’ll get first access to the registration page before it goes public. Keep an eye out for another email next week with all the details and the signup link.

We’re also planning to bring vLLM meetups to more cities and we want your input. Where should we go next? Let us know!

Submit a city

Bi-weekly vLLM office hours

Upcoming

vLLM Office Hours #23: Deep Dive Into the LLM Compressor

April 10, 2025 - 2:00PM ET / 11:00AM PT

Register here

vLLM Office Hours #24: Performance Optimization of vLLM on Google TPUs

April 14, 2025 - 2:00PM ET / 11:00AM PT

Register here 

Recordings you don’t want to miss

Introduction to vLLM V1 | Video | Slides | Blog

DeepSeek and vLLM | Video | Slides | Blog
Multimodal LLMs With vLLM v1 | Video | Slides | Blog

View all recordings

Blog highlights

Meet vLLM: For faster, more efficient LLM inference and serving

Have you ever wondered how AI-powered applications like chatbots, code assistants and more respond so quickly? Or perhaps you’ve experienced the frustration of waiting for a large language model (LLM) to generate a response, wondering what’s taking so long. Well, behind the scenes, there’s an open source project aimed at making inference, or responses from models, more efficient

Keep reading

3.5X Faster vision-language models with quantization

Vision-Language Models (VLMs), such as the Pixtral and Qwen-VL series, are trained to generate text from image and text inputs. With the expanded input types and the performance of large language models, they enable accurate and promising new use cases such as content moderation, image captioning and tagging, visual question answering, and document extraction/analysis, among others.

Keep reading

vLLM V1: Accelerating multimodal inference for large language models

In this article, we dive into the innovations behind vLLM V1 (V1 Alpha), which addresses the challenges of multimodal inference encountered in V0. We’ll explore the design decisions that enhance performance, from encoder caching to optimized data processing and share benchmark results that highlight the improvements. Finally, we’ll outline our vision for future work to further push the boundaries of efficient, scalable AI.

Keep reading

How we optimized vLLM for DeepSeek-R1

DeepSeek and vLLM optimizations have been a top priority for our team and the vLLM community as a whole, and we are excited to share a deep dive into our work. In this article, we will cover the key inference improvements we have made, detail the integration of DeepSeek’s latest advancements into vLLM, and discuss how we are scaling DeepSeek-R1 for real-world deployment. Additionally, we will review the various open source contributions from DeepSeek and outline our roadmap for integrating them into vLLM.

Keep reading

Multimodal model quantization support through LLM Compressor

LLM Compressor is a unified library for optimizing models for deployment with vLLM. As of its 0.4.0 release, LLM Compressor now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.

Keep reading

Unleash the full potential of LLMs: Optimize for performance with vLLM

Large language models (LLMs) are transforming industries, from customer service to cutting-edge applications, unlocking vast opportunities for innovation. Yet, their potential comes with a catch: high computational costs and complexity.

Keep reading

Research from our labs

We recently launched AI Research Hub, a destination for all research from Red Hat and Neural Magic labs. We plan to post all our research papers, research blogs, and accompanying code to this new location, so please bookmark it! Here are three papers we are currently featuring on the new page:

  • A probabilistic inference approach to inference-time scaling of LLMs using particle-based Monte Carlo methods
    arXiv | Code
  • GPTQ: Accurate post-praining quantization for generative pre-trained transformers 
    arXiv | Code
  • Unveiling the secret recipe: A guide for supervised fine-tuning small LLMs
    arXiv

Want to join our Friday discussions on cutting-edge AI research? Reply to this email and let us know!

Stay engaged with the vLLM community

vLLM is nearing 44,000 stars! Be sure to add your star and join the community. Thank you for your support.

리소스

엔터프라이즈를 위한 AI 시작하기: 입문자용 가이드

이 입문자용 가이드에서 Red Hat OpenShift AI와 Red Hat Enterprise Linux AI로 AI 도입 여정을 가속화할 수 있는 방법을 알아보세요.

저자 소개

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래