Organizations are increasingly adopting an internal Model-as-a-Service (MaaS) pattern to solve these operational challenges and implement a reliable foundation for agentic AI. This operational model allows central IT to host, manage, and serve optimized AI models through standardized application programming interfaces (APIs). By treating models as shared, governed utilities, teams can standardize their AI consumption, reduce infrastructure costs, and gain the flexibility to meet a wide variety of gen AI and agentic use cases with control.
To implement this pattern, organizations need a highly scalable and flexible inference engine to run AI profitably and efficiently. Solutions should provide capabilities to make the most of both individual accelerators and the available infrastructure. They also require model optimization tools to reduce compute requirements and extract even more value from their current resources.
Targeted observability into gen AI-specific metrics helps teams maintain performance standards and track resource use. Furthermore, access to preoptimized, validated models helps accelerate deployment timelines and empowers developers to build faster.
Enterprise AI inference engine
Red Hat® AI Inference is an enterprise AI inference engine designed to power models across diverse environments. It provides a unified, hardware-agnostic platform to manage, orchestrate, and optimize AI workloads, acting as the core engine to deliver a flexible, private MaaS experience and a reliable foundation for agentic AI.
Solution benefits
- Reduce AI infrastructure costs by boosting inference capacity and sharing resources efficiently across development teams.
- Deliver hardware-agnostic AI performance across a wide ecosystem of accelerators, private datacenters, and public clouds.
- Scale AI inference with efficient, distributed routing across broader infrastructure and targeted gen AI telemetry.
Improve efficiency
Enterprises can get more from hardware investments and reduce AI infrastructure costs by treating models as a shared, on-demand utility. Adopting a centralized MaaS strategy allows IT teams to serve models efficiently, decreasing fragmented and underutilized hardware. This approach helps organizations make the most of their existing compute resources.
The platform uses an optimized vLLM enterprise architecture to deliver fast and cost-effective inference and offers a model optimization toolkit to compress both foundation and customized models using techniques like quantization and sparsity. This combined approach lowers the underlying compute requirements while maintaining response accuracy for complex tasks. Organizations optimizing models with these methods have observed significant reductions in compute hours, with customers seeing up to 40% in cost savings while preserving baseline accuracy.2 These capabilities boost the performance of individual accelerators, and llm-d’s distributed inference compounds these benefits by efficiently distributing the inference load across the available fleet of GPUs.
Scale with control
Organizations can scale AI inference operations confidently by establishing an efficient foundation for their MaaS strategy and agentic architecture. By optimizing individual accelerators and their broader infrastructure, the engine helps run models at scale. It integrates llm-d’s inference-aware routing and disaggregated serving to balance traffic, manage capacity, and orchestrate reasoning models efficiently. This fleet-wide orchestration helps manage the rapid, continuous compute requests generated by agentic AI loops, with testing showing the ability to sustain up to twice the baseline of queries per second (QPS) under service-level objective constraints.3
Platform engineers gain operational insights through gen AI specific telemetry. Teams can track time-to-first-token, key-value (KV)-cache hit rates, and overall inference capacity alongside traditional central processing unit (CPU) and memory usage. These gen AI specific metrics can integrate with existing tools like Prometheus and Grafana, providing the data IT needs to monitor usage and manage capacity effectively.
Run anywhere
Organizations can maintain deployment flexibility with hardware-agnostic AI capabilities that operate across hybrid cloud environments. They can execute workloads on various platforms, spanning datacenters, edge locations, and major public clouds. This architectural freedom supports strong collaboration with a diverse ecosystem of hardware and cloud partners, equipping enterprises to flexibly meet a wide variety of business requirements.
The engine is designed to offer operational consistency across models, accelerators, and environments. By decoupling AI applications from the underlying infrastructure, enterprises can transition between different accelerators, models, and hardware configurations as their needs evolve, while optimizing inference and serving models with a common set of capabilities. Organizations can dynamically adapt their hybrid cloud AI strategy based on resource availability, hardware and model advancements, and pricing variations.
Build on an open foundation
Organizations can achieve goals faster with an enterprise AI inference platform built on trusted open source innovation. The solution includes a curated catalog of validated, containerized, and versioned open models ready for immediate deployment. AI and machine learning (ML) developers and engineers can bypass lengthy model optimization and validation cycles and begin building applications in less time.
The inference platform natively integrates with Red Hat OpenShift® and supports third-party Kubernetes environments to fit existing operational workflows. By relying on established open standards, teams can benefit from continuous, community-driven performance enhancements. This open approach provides the stability enterprise IT requires without sacrificing the pace of open source AI innovation.