We are heading to NYC for our next vLLM Meetup on Wednesday, May 7th at the IBM Office (1 Madison Avenue). Hosted by Red Hat and IBM, this in-person gathering will feature deep dives and lightning talks from experts at AMD, IBM, Meta’s PyTorch Team, and the vLLM crew from Red Hat. Spots are limited, and registration approval is required, so make sure to request to join here before it fills up!
We hope to see you there!
Bi-weekly vLLM Office Hours
Upcoming
vLLM Office Hours #25: Structured Outputs in vLLM
May 8, 2025 - 2:00PM ET / 11:00AM PT
vLLM Office Hours #26: Intro to torch.compile and How It Works with vLLM
May 29, 2025 - 2:00PM ET / 11:00AM PT
Recordings you don't want to miss
Performance Optimization of vLLM on Google TPUs | Video
Deep Dive Into the LLM Compressor | Video
Intro to vLLM V1 | Video
Red Hat AI Innovation team: Friday Random Samples weekly series
Random Samples is a weekly AI seminar series that bridges the gap between cutting-edge research and real-world application. Designed for AI developers, data scientists, and researchers, each episode explores the latest advancements in AI and how they’re being applied in production today.
This week's topic: Synthetic Data Generation via SDG-Hub
May 2nd, 2025 @ 11:30AM EST | Join the live session here
The session will explore SDG Hub's core components: prompts, blocks, and flows, and demonstrate how users can compose, extend, or modify pipelines to fit specific tasks. It will also cover strategies for choosing the right teacher model depending on the use case (reasoning, translation, etc.), and walk through two real-world examples.
No Math AI podcast
Generative Optimization in Engineering Design | Watch Here
Inference-time scaling: How small models beat the big ones | Watch Here
AI blog highlights
Cracking the code: How neural networks might actually “think”
Deep neural networks are achieving the incredible, pushing the boundaries of artificial intelligence in areas from medicine to language. But as these powerful AI systems become more integrated into our lives, a critical challenge looms: we often don’t understand how they arrive at their answers. They operate like inscrutable “black boxes,” making it hard to fully trust them.
Performance boosts in vLLM 0.8.1: Switching to the V1 engine
vLLM has rapidly become the go-to solution for efficient inference of large language and multimodal models. In this post, we'll demonstrate the substantial performance and usability improvements introduced in vLLM 0.8.1 compared to version 0.7.3, emphasizing crucial architectural overhauls and multimodal inference optimizations.
Transformers backend integration in vLLM
The Hugging Face Transformers library offers a flexible, unified interface to a vast ecosystem of model architectures. From research to fine-tuning on custom dataset, transformers is the go-to toolkit for all.
But when it comes to deploying these models at scale, inference speed and efficiency often take center stage. Enter vLLM, a library engineered for high-throughput inference, pulling models from the Hugging Face Hub and optimizing them for production-ready performance.
Accelerating RLHF with vLLM, Best Practice from OpenRLHF
As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone technique. However, conventional RLHF pipelines—especially those using Proximal Policy Optimization (PPO)—are often hindered by substantial computational overhead.
AI research from our lab
We recently launched AI Research Hub, a destination for all research from Red Hat and Neural Magic labs. We plan to post all our research papers, research blogs, and accompanying code to this new location, so please bookmark it! Here are three papers we are currently featuring on the new page:
- Towards Combinatorial Interpretability of Neural Computation
Paper Link | Blog - LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation
Paper Link | Code - Unveiling the Secret Recipe: A Guide for Supervised Fine-Tuning Small LLMs
Paper Link | Code - Implicit In-Context Learning
Paper Link | Code - A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
Paper Link | Code - Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning
Paper Link - SQuat: Subspace-Orthogonal KV Cache Quantization
Paper Link - Activation-Informed Merging of Large Language Models
Paper Link
Stay engaged with the vLLM community
vLLM is nearing 46,000 stars! 🌟 Be sure to add your star and join the community. Thank you for your support.
리소스
엔터프라이즈를 위한 AI 시작하기: 입문자용 가이드
저자 소개
Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.
유사한 검색 결과
Smarter troubleshooting with the new MCP server for Red Hat Enterprise Linux (now in developer preview)
Navigating secure AI deployment: Architecture for enhancing AI system security and safety
Technically Speaking | Build a production-ready AI toolbox
Technically Speaking | Platform engineering for AI agents
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래