Solving the scaling challenge: 3 proven strategies for your AI infrastructure

2025년 12월 11일James Harmison, Philip Hayes, Will McGrath5분 읽기

Every team that starts experimenting with generative AI (gen AI) eventually runs into the same wall: scaling it. Running 1 or 2 models is simple enough. Running dozens, supporting hundreds of users, and keeping GPU costs under control, is something else entirely.

Teams often find themselves juggling hardware requests, managing multiple versions of the same model, and trying to deliver performance that actually holds up in production. These are the same kinds of infrastructure and operations challenges we have seen in other workloads, but now applied to AI systems that demand far more resources and coordination.

In this post, we look at 3 practical strategies that help organizations solve these scaling problems:

GPU-as-a-Service to make better use of expensive GPU hardware
Models-as-a-Service to give teams controlled and reliable access to shared models
Scalable inference with vLLM and llm-d to achieve production-grade, run-time performance efficiently

The common challenges of scaling AI

When moving from proof of concept to production, cost quickly becomes the first barrier. Running large models in production is expensive, especially when training, tuning, and inference workloads all compete for GPU capacity.

Control is another challenge. IT teams must balance freedom to experiment with guardrails that strengthen security and compliance. Without proper management, internal developers may resort to unsanctioned services or external APIs, leading to shadow AI.

Finally, there is scale. Deploying a few models may be straightforward, but maintaining dozens of high-parameter models across environments and teams requires automation and stronger observability.

The following 3 strategies address these challenges directly.

1. Optimizing GPU resources with GPU-as-a-Service

GPUs are essential for gen AI workloads, but are both expensive and scarce. In many organizations, GPU clusters are distributed across clouds and on-premises environments, often underused or misallocated. Provisioning can take days, and tracking utilization across teams is difficult.

These challenges have led many enterprises to adopt GPU-as-a-Service, an approach that automates allocation, enforces quotas, and provides visibility into how GPUs are used across the organization.

Case study: Turkish Airlines

Turkish Airlines is a strong example of managing GPU access more efficiently. With about 20 data scientists and more than 200 application developers, the company manages over 50 AI use cases ranging from dynamic pricing to predictive maintenance. Its goal is to generate roughly $100 million USD annually in new revenue or cost savings.

Already using Red Hat OpenShift, Turkish Airlines implemented Red Hat OpenShift AI to automate GPU provisioning. Instead of submitting service tickets and waiting hours or days for access, developers can now launch GPU-enabled environments in minutes. This self-service model eliminates delays while giving administrators full visibility into resource usage and cost.

Demo: Implementing GPU-as-a-Service

This 10-minute demo shows how GPU-as-a-Service can be deployed using OpenShift AI to automate GPU allocation, enforce quotas, and track utilization across users, all from a single administrative interface.

2. Improving control with Models-as-a-Service

Automating GPU allocation addresses cost, but many organizations then face the challenge of governance. As teams deploy more models, duplication and fragmentation grow. Developers often run their own copies of common open gen AI models, consuming additional GPUs and creating version drift.

A Models-as-a-Service (MaaS) pattern centralizes these efforts. Models like Llama, Mistral, Qwen, Granite, and DeepSeek can be hosted once and served through APIs, giving developers immediate access while allowing platform teams to manage permissions, quotas, and usage from a single control point.

From Infrastructure-as-a-Service to Models-as-a-Service

This shift replaces hardware provisioning with standardized model access. Models are deployed centrally and exposed through APIs that developers call directly. Platform engineers monitor utilization, enforce quotas, and can provide internal chargebacks or showbacks to align resource use with business priorities.

API-driven model access

Most enterprise large language model (LLM) workloads are already API-based. Red Hat AI Inference Server, available throughout the full range of Red Hat AI offerings, exposes an OpenAI-compatible API endpoint, allowing teams to integrate easily with frameworks such as LangChain or custom enterprise applications.

If an organization already has an API gateway, it can connect directly to Red Hat AI for authentication and token-based metering. For those that don't, Red Hat AI 3 introduces a built-in, token-aware gateway that runs alongside the model within the cluster. Over the Red Hat AI 3 lifecycle, we'll deliver capabilities to automatically monitor usage, set limits, and collect analytics without deploying external components.

Models-as-a-Service establishes the foundation for more advanced use cases such as retrieval-augmented generation (RAG), retrieval-augmented fine-tuning (RAFT), synthetic data generation, and agentic AI systems built on top of shared model endpoints (See Figure 1).

3. Scaling inference with vLLM and llm-d

Even with efficient GPU allocation and model governance, inference performance often becomes the bottleneck. LLMs are resource-intensive, and maintaining fast response times at scale requires optimized runtimes and sometimes even distributed execution. This challenge is further complicated by the need to serve different use cases on a single model, such as varying large or small contexts and differing service level objectives (SLOs) around metrics like time-to-first-token (TTFT) and inter-token latency (ITL).

The case for vLLM

vLLM has become the preferred open source inference runtime for large models. It offers high throughput and supports multiple accelerator types, enabling enterprises to standardize serving across hardware vendors. When combined with model compression and quantization, vLLM can reduce GPU requirements by 2 to 3 times while maintaining 99% accuracy for models such as Llama, Mistral, Qwen, and DeepSeek.

A large North American financial institution applied this approach to its on-premises deployments of Llama and Whisper models in a disconnected environment. Containerizing the workloads on OpenShift AI simplified management, improved data security, and delivered the necessary throughput. To further increase scalability, the team began exploring distributed inference through llm-d.

Introducing llm-d for distributed inference

The llm-d project extends proven cloud-native scaling techniques to LLMs. Traditional microservices scale easily because they are stateless, but LLM inference with vLLM is not. Each vLLM instance builds and maintains a key-value (KV) cache during the prefill stage, which is the initial phase where the model processes the entire input prompt to generate the first token. Re-creating that cache on a new pod wastes compute and increases latency.

llm-d introduces intelligent scheduling and routing to avoid that duplication. It separates workloads between the prefill and decode stages so each runs on hardware optimized for its needs. It also implements prefix-aware routing, which detects when a previous context has already been processed and sends the next request to the pod most likely to hold that cache. The result is faster response times and better GPU efficiency.

Demo: Scaling with llm-d

This 5-minute demo highlights how llm-d uses intelligent routing and cache awareness to improve inference performance. It shows requests automatically routed to cached model instances, significantly reducing time to first token and improving throughput across GPUs.

Bringing it all together

Building scalable AI systems takes more than just adding GPUs. It takes a thoughtful design that balances performance, control, and simplicity. The patterns described here help teams do exactly that.

GPU-as-a-Service helps you make the most of your hardware by turning it into a shared, policy-controlled resource
Models-as-a-Service brings order to how models are deployed and accessed as APIs across projects
vLLM and llm-d make it possible to run those models at production scale without sacrificing performance or breaking budgets

Together, these help create a practical roadmap for anyone modernizing their AI platform with Red Hat AI. They make it easier to move fast, stay efficient, and scale with confidence.

Learn more

저자 소개

James Harmison

Staff Field Engineer

A Red Hatter since 2019, James' past experiences include IT infrastructure engineering, cyber security and traditional system administration. He built infrastructure as code and ran tooling in Linux containers for several years, spent two more years as an incident responder and threat hunter on DoD networks and loves to tinker with whatever lives at the nexus of performance and security. His focus on security, including the use of modern technologies to enable Defense-in-Depth in IT infrastructure, are likely to shine through his writing.

Read full bio

Philip Hayes

Will McGrath

Senior Principal Product Marketing Manager

Will McGrath is a Senior Principal Product Marketing Manager at Red Hat. He is responsible for marketing strategy, developing content, and driving marketing initiatives for Red Hat OpenShift AI. He has more than 30 years of experience in the IT industry. Before Red Hat, Will worked for 12 years as strategic alliances manager for media and entertainment technology partners.

Read full bio

유사한 검색 결과

Blog post

자세히 알아보기

채널별 검색

모든 채널 탐색