How llm-d brings critical resource optimization with SoftBank’s AI-RAN orchestrator

2026 年 2 月 18 日Tushar Katarki3 分 (読了時間の目安)

As the technical reality of AI-RAN comes into focus, many telecommunication service providers are realizing that it’s no longer just about whether they can run AI and radio access network (RAN) on the same hardware – it’s about how they manage AI at scale.

In Red Hat’s latest collaboration with SoftBank Corp., we have integrated llm-d into SoftBank’s AI-RAN orchestrator, AITRAS. Founded by Red Hat alongside other industry leaders, llm-d is an open source framework designed to dynamically and intelligently distribute the inferencing of large language models (LLMs) within a RAN more efficiently and with increased performance.

The problem: Unifying AI and RAN workloads at the service provider edge

Traditional RAN applications are widely deployed by service providers at the edge on CPUs and GPUs, often utilizing Kubernetes platforms like Red Hat OpenShift. However, the recent surge in GenAI and transformer-based language models is enabling new forms of computation and insights at the edge. Now, in addition to traditional RANs, there are AI-powered RAN applications and agents that require runtime and inference end points at the edge.

The critical question for service providers, therefore, is how to enable traditional RAN and these new language models and agents to effectively co-exist at RAN locations in order to unlock new use cases, generate value, and create monetization. This unification is essential for reducing operational expenditure (OpEx) and accelerating the time-to-market for new, revenue-generating edge services.

To make AI-RAN commercially viable, service providers need to treat AI workloads with the same flexibility as cloud-native network functions (CNFs) and applications. Enter the collaboration between Softbank and Red Hat using llm-d and vLLM for AI-RAN.

llm-d: the bridge between inference and orchestrators

vLLM has emerged as the open source leader for AI inferencing, providing high-performance model deployment on a single GPU node. However, it is not designed to manage model deployment across a complex, multi-node footprint. That is the specific problem llm-d was built to solve. By leveraging Kubernetes, llm-d orchestrates vLLM across multiple nodes to achieve production-scale AI inference, extending vLLM's efficiency to a distributed environment.

By integrating llm-d into the SoftBank AITRAS orchestrator, service providers are able to achieve the following major breakthroughs:

Unified AI and RAN workloads: AITRAS orchestrates and optimizes RAN workloads and LLM requests across multiple GPU clusters, while llm-d and vLLM intelligently (prefix, kvcache, and load aware) route inference requests to the GPUs to more seamlessly manage GPU resources and enable autoscaling.
Hardware-aware optimization: LLM inference involves 2 distinct phases – prefill (compute-intensive prompt processing) and decode (memory-bandwidth-bound token generation). To maximize hardware utilization across heterogeneous configurations, llm-d enables AITRAS to leverage prefill and decode disaggregation by dynamically assigning specialized GPU resources to each phase. This, combined with other Kubernetes capabilities for resource management, helps mitigate the risk of high-performance AI demands potentially starving the critical RAN functions that share the same hardware, which is essential for protecting network resiliency and ensuring superior quality of service (QoS) for all customers.
Autonomous scaling for variable demand: User requests for LLM services are highly variable. By using llm-d, AITRAS can automatically assign and scale prefill and decode worker roles based on the workload profile. This optimized allocation reduces latency for the user and significantly improves power consumption, driving down total cost of ownership (TCO) and supporting sustainability goals for the service provider.

Why this matters for the future of 5G and 6G

The integration of llm-d into AITRAS effectively provides the operating system for AI at the edge. It allows SoftBank to run high-performance inference and RAN workloads on power-efficient architectures, including Arm-based systems, proving that AI-RAN can achieve the scalability and flexibility required for next-generation mobile networks. By moving away from manual configurations and toward an automated, llm-d-driven deployment model, service providers can remove the operational complexity that has historically held back edge AI.

Service providers are entering an era where the network doesn't just carry data – it processes it intelligently and efficiently. Learn more about the results of this integration at the Red Hat booth at MWC Barcelona 2026, where experts will be available to explain how llm-d and AITRAS are making the promise of AI-RAN a reality.

In the meantime, explore the benefits of Red Hat AI and learn more about Red Hat’s collaboration with SoftBank to develop AI-RAN technologies and optimize network performance.