As enterprises scale large language models (LLMs) into production, site reliability engineers (SREs) and platform operators face a new set of challenges. Traditional application metrics—CPU usage, request throughput, memory consumption—are no longer enough. With LLMs, reliability and efficacy are defined by entirely new dynamics—token-level performance, cache efficiency, and inference pipeline latency.

This article explores how llm-d, an open source project co-developed with the leading AI vendors (Red Hat, Google, IBM, etc.) and integrated into Red Hat OpenShift AI 3.0, redefines observability for LLM workloads.

New service level objectives (SLO) in the age of LLMs

Users expect responsive applications, including those enhanced with AI, and enterprises require consistent performance to turn a pilot project into a profitable production application at scale. While SLOs in traditional microservice architectures are usually framed around request latency and error rate, the user experience for LLMs depends on more nuanced measures, including:

  • Time to first token (TTFT): Measures the delay before the initial token of a response is streamed to the user. A lower TTFT is needed to provide a responsive and immediate user experience, especially in interactive applications.
  • Time per output token (TPOT): Indicates the speed at which tokens are generated after the process begins. A consistent and low TPOT provides a smooth and efficient streaming of the complete response to the user.
  • Cache hit rate: Represents the proportion of requests that can use previously computed context stored in GPU memory. A high cache hit rate significantly reduces computational overhead and improves overall system throughput by avoiding redundant processing.
  • Prefill vs. decode latency: Distinguishes between the time taken for expensive prompt processing (prefill) and token-by-token decoding. Understanding this distinction helps to optimize resource allocation and identify bottlenecks in the different stages of request processing.
  • Goodput: The number of requests successfully served within their defined SLO budgets. Maximizing goodput is needed for the system to reliably meet its performance commitments.

Together, these represent what users experience: responsiveness, throughput, and consistency.

The challenge of managing these new SLOs

Existing monitoring tools don’t know about tokens or caches. A traditional latency metric can tell you that a request took 400ms, but it can’t break down whether that was due to routing delays, cache misses, or GPU scheduling.

In a distributed inference stack—where requests move through gateways, routers, schedulers, and GPU workers—observability blind spots are everywhere. Without token-aware metrics, operators can’t tell whether performance degradation is caused by routing imbalance, prefill-cache fragmentation, or overloaded GPUs.

This makes it difficult to debug issues quickly in production, plan GPU capacity, or guarantee user-facing SLOs.

What is llm-d?

llm-d is a community-driven, Kubernetes-native project that disaggregates inference into composable services, including:

  • Endpoint picker (EPP): An semantic router that makes cache-aware and load-aware scheduling decisions.
     
  • Decode and prefill services: Separating heavy prompt ingestion (prefill) from sequential token generation (decode). Prefill can even run on CPU if GPU resources are constrained.
     
  • Key-value (KV) cache management: Centralized indexing to maximize cache reuse across workloads.

On OpenShift AI, llm-d provides an operationally consistent way to run vLLM and other serving engines with advanced observability.

How llm-d Helps solve the observability gap

llm-d integrates deeply with Prometheus, Grafana, and OpenTelemetry, exposing both system-level and LLM-specific metrics:

  • Cache-aware metrics: Cache hit ratios, cache size utilization, and KV movements.
  • Token-level latency metrics: TTFT, TPOT, and end-to-end latency.
  • Routing transparency: Metrics and traces showing why requests were routed to specific pods.
  • Tracing across components: With OpenTelemetry, operators can follow a request through Inference Gateway (IGW) to Scheduler to vLLM workers.

This transforms llm-d from a black box into an auditable, measurable inference stack.

Example PromQL queries

Here are some PromQL queries that illustrate how SREs can use llm-d metrics to monitor token-level in production as of llm-d version 0.2.0.

1. TTFT P95

histogram_quantile(0.95, sum(rate(llmd_inference_ttft_bucket[5m])) by (le))undefinedTracks the 95th percentile latency until the first token is emitted.

2. TPOT average

rate(llmd_inference_output_tokens_sum[5m]) undefined/ rate(llmd_inference_requests_total[5m])

Measures how quickly tokens are generated once decoding starts.

3. Cache hit rate

(sum(rate(llmd_kvcache_hits_total[5m])) by (pool)) undefined/ (sum(rate(llmd_kvcache_requests_total[5m])) by (pool))

Shows the ratio of cache hits to total cache lookups, per inference pool.

4. Routing efficiency (score distribution)

histogram_quantile(0.90, sum(rate(llmd_router_routing_score_bucket[5m])) by (le))

These metrics can be scraped in OpenShift using llm-d by setting monitoring.podmonitor.enabled=true in the configuration, with a similar setting for the epp servicemonitor that manages the prefill-cache-based routing.

This observes routing scores from the EPP to validate that cache-aware routing decisions prioritize efficiency.

Example dashboard

Example dashboard

Having these metrics displayed in a single Grafana dashboard transforms LLM observability from guesswork into actionable insights. Instead of sifting through logs or raw counters, SREs and platform engineers can instantly see TTFT values, cache hit rates, and routing efficiency side by side, correlated across the entire llm-d stack on OpenShift AI.

This visibility makes it possible to spot real-time performance regressions, validate that SLOs are being met, and more easilydiagnose whether latency is due to routing, caching, or GPU saturation—all from one dashboard.

Conclusion

The age of LLMs requires a new observability playbook. Traditional metrics fail to capture token-level performance and cache-aware efficiency. By combining vLLM’s engine-level metrics with llm-d’s routing and cache-aware insights, Red Hat OpenShift AI provides SREs and platform teams the tools they need to more effectively meet modern AI SLOs at production scale.

With llm-d on OpenShift AI, users gain:

  • Transparency into distributed inference workloads
  • Confidence in meeting token-level SLOs
  • Lower GPU costs through cache efficiency
  • Actionable insights with Grafana dashboards powered by Prometheus and OpenTelemetry

This is observability purpose-built for AI at scale, and it’s available today in the open source community and on the OpenShift AI 3.0 platform.

리소스

적응형 엔터프라이즈: AI 준비성은 곧 위기 대응력

Red Hat의 COO 겸 CSO인 Michael Ferris가 쓴 이 e-Book은 오늘날 IT 리더들이 직면한 AI의 변화와 기술적 위기의 속도를 살펴봅니다.

저자 소개

Christopher Nuland is a Principal Technical Marketing Manager for AI at Red Hat and has been with the company for over six years. Before Red Hat, he focused on machine learning and big data analytics for companies in the finance and agriculture sectors. Once coming to Red Hat, he specialized in cloud native migrations, metrics-driven transformations, and the deployment and management of modern AI platforms as a Senior Architect for Red Hat’s consulting services, working almost exclusively with Fortune 50 companies until recently moving into his current role. Christopher has spoken worldwide on AI at conferences like IBM Think, KubeCon EU/US, and Red Hat’s Summit events.

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래