Welcome to our second edition of the monthly vLLM roundup! We are excited to continue sharing updates about the project, new features, and opportunities to engage with the vLLM community. Check out the December roundup here.

Keep on reading for exciting updates. And please share this post to others who may benefit!

Upcoming bi-weekly vLLM Office Hours

2

Distributed inference with vLLM | January 23, 2025 - 2:00PM ET / 11:00AM PT
Join our upcoming vLLM Office Hours as we dive into distributed inference with vLLM. We'll explore common pitfalls, practical implementation strategies, and steps to get started, with insights tailored to real-world challenges like those discussed here.

Recent recordings

vLLM’s 2024 wrapped and 2025 vision


vLLM v0.6.6 Update & open discussion

Blog posts

Structured decoding in vLLM: A gentle introduction
vLLM is the high-throughput and efficient inference engine for running large-language models (LLMs). In this post, we will explore the annotated history of language models, describe the current state of structured decoding in vLLM, as well as the recent integration with XGrammar, and share our tentative roadmap for future improvements.
Keep Reading

vLLM 2024 retrospective and 2025 vision
The vLLM community achieved remarkable growth in 2024, evolving from a specialized inference engine to becoming the de facto serving solution for the open-source AI ecosystem. Celebrate vLLMs 2024 achievements and get a sneak peek into the 2025 roadmap.
Keep Reading

Installing and Developing vLLM with Ease
The field of LLM inference is advancing at an unprecedented pace. With new models and features emerging weekly, the traditional software release pipeline often struggles to keep up. With vLLM, we aim to provide more than just a software package. We are building a dynamic ecosystem that adapts to this rapid evolution, offering developers the tools, documentation, and community support they need to stay ahead.
Keep Reading

2:4 Sparse Llama FP8: SOTA Performance for NVIDIA Hopper GPUs
Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit. Building on our previous work at Neural Magic with the 2:4 Sparse Llama 3.1 8B foundation model–which increases model efficiency by eliminating unnecessary parameters while preserving accuracy–we are excited to introduce the next step forward: Sparse 8-bit floating point (FP8) models and the associated high-performance kernels for vLLM.
Keep Reading

Events

1️⃣ The year of full-stack OSS AI!
Optimizing LLMs for Cost-Efficient Deployment with vLLM
Michael Goin, Neural Magic [Red Hat]
Deploying LLMs is just the starting point; optimizing them for cost-efficient, high-performance serving is the real challenge. In this talk, we’ll explore cutting-edge compression techniques and advanced inference system optimizations that enable fast performance on your hardware of choice. Discover practical strategies and tools enterprises trust to scale deployments while minimizing costs.

2️⃣ West coast vLLM meetup
The first vLLM meetup in 2025 is on Wednesday, January 22nd in San Francisco. We will discuss vLLM's performant V1 architecture, Q1 roadmap, and Google Cloud's innovation around vLLM: networking, Cloud Run, Vertex, and TPU!

3️⃣ First-ever east coast vLLM meetup
It’s happening on March 11, 2025, in Boston! More details coming in early February.

In other news

It’s official! Red Hat completed the acquisition of Neural Magic! By acquiring Neural Magic, a leading commercial contributor to vLLM, Red Hat aims to continue supporting the vibrant vLLM community and enhancing Red Hat AI’s ability to support gen AI deployments anywhere and everywhere across the hybrid cloud. Read more on the completed acquisition here.

vLLM is nearing 34,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.

리소스

엔터프라이즈를 위한 AI 시작하기: 입문자용 가이드

이 입문자용 가이드에서 Red Hat OpenShift AI와 Red Hat Enterprise Linux AI로 AI 도입 여정을 가속화할 수 있는 방법을 알아보세요.

저자 소개

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래