Every team that starts experimenting with generative AI (gen AI) eventually runs into the same wall: scaling it. Running 1 or 2 models is simple enough. Running dozens, supporting hundreds of users, and keeping GPU costs under control, is something else entirely.

Teams often find themselves juggling hardware requests, managing multiple versions of the same model, and trying to deliver performance that actually holds up in production. These are the same kinds of infrastructure and operations challenges we have seen in other workloads, but now applied to AI systems that demand far more resources and coordination.

In this post, we look at 3 practical strategies that help organizations solve these scaling problems:

  1. GPU-as-a-Service to make better use of expensive GPU hardware
     
  2. Models-as-a-Service to give teams controlled and reliable access to shared models
     
  3. Scalable inference with vLLM and llm-d to achieve production-grade, run-time performance efficiently

The common challenges of scaling AI

When moving from proof of concept to production, cost quickly becomes the first barrier. Running large models in production is expensive, especially when training, tuning, and inference workloads all compete for GPU capacity.

Control is another challenge. IT teams must balance freedom to experiment with guardrails that strengthen security and compliance. Without proper management, internal developers may resort to unsanctioned services or external APIs, leading to shadow AI.

Finally, there is scale. Deploying a few models may be straightforward, but maintaining dozens of high-parameter models across environments and teams requires automation and stronger observability.

The following 3 strategies address these challenges directly.

1. Optimizing GPU resources with GPU-as-a-Service

GPUs are essential for gen AI workloads, but are both expensive and scarce. In many organizations, GPU clusters are distributed across clouds and on-premises environments, often underused or misallocated. Provisioning can take days, and tracking utilization across teams is difficult.

These challenges have led many enterprises to adopt GPU-as-a-Service, an approach that automates allocation, enforces quotas, and provides visibility into how GPUs are used across the organization.

Case study: Turkish Airlines

Turkish Airlines is a strong example of managing GPU access more efficiently. With about 20 data scientists and more than 200 application developers, the company manages over 50 AI use cases ranging from dynamic pricing to predictive maintenance. Its goal is to generate roughly $100 million USD annually in new revenue or cost savings.

Already using Red Hat OpenShift, Turkish Airlines implemented Red Hat OpenShift AI to automate GPU provisioning. Instead of submitting service tickets and waiting hours or days for access, developers can now launch GPU-enabled environments in minutes. This self-service model eliminates delays while giving administrators full visibility into resource usage and cost.

Demo: Implementing GPU-as-a-Service

This 10-minute demo shows how GPU-as-a-Service can be deployed using OpenShift AI to automate GPU allocation, enforce quotas, and track utilization across users, all from a single administrative interface.

2. Improving control with Models-as-a-Service

Automating GPU allocation addresses cost, but many organizations then face the challenge of governance. As teams deploy more models, duplication and fragmentation grow. Developers often run their own copies of common open gen AI models, consuming additional GPUs and creating version drift.

A Models-as-a-Service (MaaS) pattern centralizes these efforts. Models like Llama, Mistral, Qwen, Granite, and DeepSeek can be hosted once and served through APIs, giving developers immediate access while allowing platform teams to manage permissions, quotas, and usage from a single control point.

From Infrastructure-as-a-Service to Models-as-a-Service

This shift replaces hardware provisioning with standardized model access. Models are deployed centrally and exposed through APIs that developers call directly. Platform engineers monitor utilization, enforce quotas, and can provide internal chargebacks or showbacks to align resource use with business priorities.

API-driven model access

Most enterprise large language model (LLM) workloads are already API-based. Red Hat AI Inference Server, available throughout the full range of Red Hat AI offerings, exposes an OpenAI-compatible API endpoint, allowing teams to integrate easily with frameworks such as LangChain or custom enterprise applications.

If an organization already has an API gateway, it can connect directly to Red Hat AI for authentication and token-based metering. For those that don't, Red Hat AI 3 introduces a built-in, token-aware gateway that runs alongside the model within the cluster. Over the Red Hat AI 3 lifecycle, we'll deliver capabilities to automatically monitor usage, set limits, and collect analytics without deploying external components.

Models-as-a-Service establishes the foundation for more advanced use cases such as retrieval-augmented generation (RAG), retrieval-augmented fine-tuning (RAFT), synthetic data generation, and agentic AI systems built on top of shared model endpoints (See Figure 1).

API gateway access to shared model

3. Scaling inference with vLLM and llm-d

Even with efficient GPU allocation and model governance, inference performance often becomes the bottleneck. LLMs are resource-intensive, and maintaining fast response times at scale requires optimized runtimes and sometimes even distributed execution. This challenge is further complicated by the need to serve different use cases on a single model, such as varying large or small contexts and differing service level objectives (SLOs) around metrics like time-to-first-token (TTFT) and inter-token latency (ITL).

The case for vLLM

vLLM has become the preferred open source inference runtime for large models. It offers high throughput and supports multiple accelerator types, enabling enterprises to standardize serving across hardware vendors. When combined with model compression and quantization, vLLM can reduce GPU requirements by 2 to 3 times while maintaining 99% accuracy for models such as Llama, Mistral, Qwen, and DeepSeek.

A large North American financial institution applied this approach to its on-premises deployments of Llama and Whisper models in a disconnected environment. Containerizing the workloads on OpenShift AI simplified management, improved data security, and delivered the necessary throughput. To further increase scalability, the team began exploring distributed inference through llm-d.

Introducing llm-d for distributed inference

The llm-d project extends proven cloud-native scaling techniques to LLMs. Traditional microservices scale easily because they are stateless, but LLM inference with vLLM is not. Each vLLM instance builds and maintains a key-value (KV) cache during the prefill stage, which is the initial phase where the model processes the entire input prompt to generate the first token. Re-creating that cache on a new pod wastes compute and increases latency.

llm-d introduces intelligent scheduling and routing to avoid that duplication. It separates workloads between the prefill and decode stages so each runs on hardware optimized for its needs. It also implements prefix-aware routing, which detects when a previous context has already been processed and sends the next request to the pod most likely to hold that cache. The result is faster response times and better GPU efficiency.

Demo: Scaling with llm-d

This 5-minute demo highlights how llm-d uses intelligent routing and cache awareness to improve inference performance. It shows requests automatically routed to cached model instances, significantly reducing time to first token and improving throughput across GPUs.

Bringing it all together

Building scalable AI systems takes more than just adding GPUs. It takes a thoughtful design that balances performance, control, and simplicity. The patterns described here help teams do exactly that.

  • GPU-as-a-Service helps you make the most of your hardware by turning it into a shared, policy-controlled resource
  • Models-as-a-Service brings order to how models are deployed and accessed as APIs across projects
  • vLLM and llm-d make it possible to run those models at production scale without sacrificing performance or breaking budgets

Together, these help create a practical roadmap for anyone modernizing their AI platform with Red Hat AI. They make it easier to move fast, stay efficient, and scale with confidence.

Learn more


저자 소개

A Red Hatter since 2019, James' past experiences include IT infrastructure engineering, cyber security and traditional system administration. He built infrastructure as code and ran tooling in Linux containers for several years, spent two more years as an incident responder and threat hunter on DoD networks and loves to tinker with whatever lives at the nexus of performance and security. His focus on security, including the use of modern technologies to enable Defense-in-Depth in IT infrastructure, are likely to shine through his writing.

Will McGrath is a Senior Principal Product Marketing Manager at Red Hat. He is responsible for marketing strategy, developing content, and driving marketing initiatives for Red Hat OpenShift AI. He has more than 30 years of experience in the IT industry. Before Red Hat, Will worked for 12 years as strategic alliances manager for media and entertainment technology partners.

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래