We are heading to NYC for our next vLLM Meetup on Wednesday, May 7th at the IBM Office (1 Madison Avenue). Hosted by Red Hat and IBM, this in-person gathering will feature deep dives and lightning talks from experts at AMD, IBM, Meta’s PyTorch Team, and the vLLM crew from Red Hat. Spots are limited, and registration approval is required, so make sure to request to join here before it fills up!

We hope to see you there!

Bi-weekly vLLM Office Hours

Upcoming

vLLM Office Hours #25: Structured Outputs in vLLM

May 8, 2025 - 2:00PM ET / 11:00AM PT

Register Here

vLLM Office Hours #26: Intro to torch.compile and How It Works with vLLM

May 29, 2025 - 2:00PM ET / 11:00AM PT

Register Here

Recordings you don't want to miss 

Performance Optimization of vLLM on Google TPUs | Video

Deep Dive Into the LLM Compressor | Video

Intro to vLLM V1 | Video

Red Hat AI Innovation team: Friday Random Samples weekly series

Random Samples is a weekly AI seminar series that bridges the gap between cutting-edge research and real-world application. Designed for AI developers, data scientists, and researchers, each episode explores the latest advancements in AI and how they’re being applied in production today. 

This week's topic: Synthetic Data Generation via SDG-Hub

May 2nd, 2025 @ 11:30AM EST |  Join the live session here

The session will explore SDG Hub's core components: prompts, blocks, and flows, and demonstrate how users can compose, extend, or modify pipelines to fit specific tasks. It will also cover strategies for choosing the right teacher model depending on the use case (reasoning, translation, etc.), and walk through two real-world examples.

No Math AI podcast

Generative Optimization in Engineering Design | Watch Here

Inference-time scaling: How small models beat the big ones | Watch Here

AI blog highlights 

Cracking the code: How neural networks might actually “think”

Deep neural networks are achieving the incredible, pushing the boundaries of artificial intelligence in areas from medicine to language. But as these powerful AI systems become more integrated into our lives, a critical challenge looms: we often don’t understand how they arrive at their answers. They operate like inscrutable “black boxes,” making it hard to fully trust them. 

Keep Reading

Performance boosts in vLLM 0.8.1: Switching to the V1 engine

vLLM has rapidly become the go-to solution for efficient inference of large language and multimodal models. In this post, we'll demonstrate the substantial performance and usability improvements introduced in vLLM 0.8.1 compared to version 0.7.3, emphasizing crucial architectural overhauls and multimodal inference optimizations.

Keep Reading 

Transformers backend integration in vLLM

The Hugging Face Transformers library offers a flexible, unified interface to a vast ecosystem of model architectures. From research to fine-tuning on custom dataset, transformers is the go-to toolkit for all.

But when it comes to deploying these models at scale, inference speed and efficiency often take center stage. Enter vLLM, a library engineered for high-throughput inference, pulling models from the Hugging Face Hub and optimizing them for production-ready performance.

Keep Reading 

Accelerating RLHF with vLLM, Best Practice from OpenRLHF

As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone technique. However, conventional RLHF pipelines—especially those using Proximal Policy Optimization (PPO)—are often hindered by substantial computational overhead. 

Keep Reading 

AI research from our lab

We recently launched AI Research Hub, a destination for all research from Red Hat and Neural Magic labs. We plan to post all our research papers, research blogs, and accompanying code to this new location, so please bookmark it! Here are three papers we are currently featuring on the new page:

  • Towards Combinatorial Interpretability of Neural Computation 
    Paper LinkBlog
  • LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation 
    Paper LinkCode
  • Unveiling the Secret Recipe: A Guide for Supervised Fine-Tuning Small LLMs 
    Paper LinkCode
  • Implicit In-Context Learning 
    Paper LinkCode
  • A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods 
    Paper LinkCode
  • Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning 
    Paper Link
  • SQuat: Subspace-Orthogonal KV Cache Quantization 
    Paper Link
  • Activation-Informed Merging of Large Language Models 
    Paper Link

Stay engaged with the vLLM community

vLLM is nearing 46,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.

リソース

エンタープライズ AI を始める:初心者向けガイド

この初心者向けガイドでは、Red Hat OpenShift AI と Red Hat Enterprise Linux AI によって AI 導入をどのように加速できるのかについて説明します。

執筆者紹介

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

UI_Icon-Red_Hat-Close-A-Black-RGB

チャンネル別に見る

automation icon

自動化

テクノロジー、チームおよび環境に関する IT 自動化の最新情報

AI icon

AI (人工知能)

お客様が AI ワークロードをどこでも自由に実行することを可能にするプラットフォームについてのアップデート

open hybrid cloud icon

オープン・ハイブリッドクラウド

ハイブリッドクラウドで柔軟に未来を築く方法をご確認ください。

security icon

セキュリティ

環境やテクノロジー全体に及ぶリスクを軽減する方法に関する最新情報

edge icon

エッジコンピューティング

エッジでの運用を単純化するプラットフォームのアップデート

Infrastructure icon

インフラストラクチャ

世界有数のエンタープライズ向け Linux プラットフォームの最新情報

application development icon

アプリケーション

アプリケーションの最も困難な課題に対する Red Hat ソリューションの詳細

Virtualization icon

仮想化

オンプレミスまたは複数クラウドでのワークロードに対応するエンタープライズ仮想化の将来についてご覧ください