We’ve heard all the love for the vLLM meetups and we’re excited to announce that the next one is happening in New York City on May 7th! We’ll be hosting it at IBM One Madison Avenue, and we can’t wait to see you there. This is your heads-up to mark your calendars! As a subscriber to our newsletter, you’ll get first access to the registration page before it goes public. Keep an eye out for another email next week with all the details and the signup link.
We’re also planning to bring vLLM meetups to more cities and we want your input. Where should we go next? Let us know!
Bi-weekly vLLM office hours
Upcoming
vLLM Office Hours #23: Deep Dive Into the LLM Compressor
April 10, 2025 - 2:00PM ET / 11:00AM PT
vLLM Office Hours #24: Performance Optimization of vLLM on Google TPUs
April 14, 2025 - 2:00PM ET / 11:00AM PT
Recordings you don’t want to miss
Introduction to vLLM V1 | Video | Slides | Blog
DeepSeek and vLLM | Video | Slides | Blog
Multimodal LLMs With vLLM v1 | Video | Slides | Blog
Blog highlights
Meet vLLM: For faster, more efficient LLM inference and serving
Have you ever wondered how AI-powered applications like chatbots, code assistants and more respond so quickly? Or perhaps you’ve experienced the frustration of waiting for a large language model (LLM) to generate a response, wondering what’s taking so long. Well, behind the scenes, there’s an open source project aimed at making inference, or responses from models, more efficient
3.5X Faster vision-language models with quantization
Vision-Language Models (VLMs), such as the Pixtral and Qwen-VL series, are trained to generate text from image and text inputs. With the expanded input types and the performance of large language models, they enable accurate and promising new use cases such as content moderation, image captioning and tagging, visual question answering, and document extraction/analysis, among others.
vLLM V1: Accelerating multimodal inference for large language models
In this article, we dive into the innovations behind vLLM V1 (V1 Alpha), which addresses the challenges of multimodal inference encountered in V0. We’ll explore the design decisions that enhance performance, from encoder caching to optimized data processing and share benchmark results that highlight the improvements. Finally, we’ll outline our vision for future work to further push the boundaries of efficient, scalable AI.
How we optimized vLLM for DeepSeek-R1
DeepSeek and vLLM optimizations have been a top priority for our team and the vLLM community as a whole, and we are excited to share a deep dive into our work. In this article, we will cover the key inference improvements we have made, detail the integration of DeepSeek’s latest advancements into vLLM, and discuss how we are scaling DeepSeek-R1 for real-world deployment. Additionally, we will review the various open source contributions from DeepSeek and outline our roadmap for integrating them into vLLM.
Multimodal model quantization support through LLM Compressor
LLM Compressor is a unified library for optimizing models for deployment with vLLM. As of its 0.4.0 release, LLM Compressor now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.
Unleash the full potential of LLMs: Optimize for performance with vLLM
Large language models (LLMs) are transforming industries, from customer service to cutting-edge applications, unlocking vast opportunities for innovation. Yet, their potential comes with a catch: high computational costs and complexity.
Research from our labs
We recently launched AI Research Hub, a destination for all research from Red Hat and Neural Magic labs. We plan to post all our research papers, research blogs, and accompanying code to this new location, so please bookmark it! Here are three papers we are currently featuring on the new page:
- A probabilistic inference approach to inference-time scaling of LLMs using particle-based Monte Carlo methods
arXiv | Code - GPTQ: Accurate post-praining quantization for generative pre-trained transformers
arXiv | Code - Unveiling the secret recipe: A guide for supervised fine-tuning small LLMs
arXiv
Want to join our Friday discussions on cutting-edge AI research? Reply to this email and let us know!
Stay engaged with the vLLM community
vLLM is nearing 44,000 stars! Be sure to add your star and join the community. Thank you for your support.
리소스
엔터프라이즈를 위한 AI 시작하기: 입문자용 가이드
저자 소개
Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.
유사한 검색 결과
AI insights with actionable automation accelerate the journey to autonomous networks
Fast and simple AI deployment on Intel Xeon with Red Hat OpenShift
Technically Speaking | Build a production-ready AI toolbox
Technically Speaking | Platform engineering for AI agents
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래