Red Hat OpenShift delivers high-performance LLM inference for financial services

26. Juni 20268 Minuten (Lesedauer)KI/ML

Senior Performance and Scale Engineer

The financial services industry, like many other sectors, is aiming to make best use of their hardware in the age of resource-intensive AI workloads. Financial services companies have come to rely on the advantages of a container-based architecture, but may worry that container orchestration platforms add performance penalties that could exacerbate resource constraints.

The results of a new STAC-AI™ LANG6 (Inference-Only) audit, the industry-standard benchmark for evaluating large language model (LLM) inference performance in financial services, may help assuage those worries when it comes to Red Hat OpenShift. In collaboration with NVIDIA and Supermicro, the Red Hat performance and scale engineering team ran the full STAC-AI LANG6 benchmark suite on OpenShift, using 2 NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs in a Supermicro SuperServer SYS-222C-TN. These are the first audited STAC-AI results produced on a containerized Kubernetes platform.

Red Hat has a long history of demonstrating that OpenShift delivers bare-metal-like performance for the most demanding workloads, and we back that claim with independently audited results. In the STAC-N1 network benchmark, we showed that OpenShift matches (and in some cases improves upon) bare-metal latency for market data, tying the lowest mean and median latencies at both 100,000 and 1.4 million messages per second while reducing maximum latency by 37%. In the record-breaking STAC-A2 market risk benchmark, we set multiple records for GPU-accelerated Monte Carlo simulation on OpenShift with NVIDIA DGX A100 systems. And in STAC-A2 with Intel CPUs, we showed that the containerization overhead of OpenShift is negligible even for CPU-intensive financial workloads.

This new STAC-AI submission extends that track record into the domain of LLM inference. This is an area of rapidly growing importance as financial institutions adopt AI-driven workflows for fraud detection, security, sentiment analysis, and regulatory compliance.

A quick primer on STAC-AI LANG6

The Strategic Technology Analysis Center (STAC) produces standardized benchmarks that the world's largest financial institutions use to evaluate technology stacks. STAC-AI LANG6 measures LLM inference performance on financial-services use cases. Its datasets are derived from real SEC EDGAR financial filings, the kinds of documents that drive RAG (retrieval-augmented generation) and long-context workloads at financial firms.

The benchmark covers 2 inference modes. Batch mode measures maximum throughput by handing the full dataset to the system under test (SUT) in a single API call. Interactive mode simulates real-world usage by sending requests at varying arrival rates following a Poisson distribution, while measuring reaction time (analogous to time to first token), response time, and output rate under load.

We tested 2 models (Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct) across 4 EDGAR datasets that vary in prompt length and complexity. The benchmark also measures fidelity: how closely the optimized model's output matches the same model running at native precision on a STAC reference SUT. In total, the specification calls for 7 required workloads.

Table 1. Model, dataset, and execution modes of the STAC-AI benchmark

Model	Dataset	Batch	Interactive
Llama-3.1-8B	EDGAR4a	Required	Required
Llama-3.1-8B	EDGAR5a	Required	Required
Llama-3.1-70B	EDGAR4b	Required	Required
Llama-3.1-70B	EDGAR5b	Required	—

The stack

Hardware:

Supermicro SuperServer SYS-222C-TN, a 2U DC-MHS server
2x Intel Xeon 6730P CPUs (64 physical cores total)
2 TiB DDR5-5200 (32x 64 GiB DIMMs)
2x NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs (96 GiB GDDR7 each)

Software:

Red Hat OpenShift 4.20 (single-node OpenShift)
Red Hat Enterprise Linux CoreOS 9.6
Node feature discovery operator
NVIDIA GPU operator
STAC-AI Operator (custom operator developed by Red Hat)
NVIDIA TensorRT-LLM 1.2.0rc2 (PyTorch backend) with NVFP4 quantization via NVIDIA Model Optimizer 0.37.0

The operator stack follows the same pattern we established in our STAC-A2 work. The node feature discovery operator is the prerequisite: it discovers hardware features on each node and surfaces them as Kubernetes labels. The NVIDIA GPU operator uses those labels to determine where to deploy the GPU driver containers, making GPUs available to the scheduler as nvidia.com/gpu resources. From that point on, a pod requests GPUs in the same way that it requests CPU or memory.

With these 2 operators installed, GPUs become first-class Kubernetes resources. No manual driver installation, special host configuration, or custom kernel modules are required. The same approach works whether you have a single GPU node at the edge or a cluster of them in a datacenter.

On top of that foundation, we added the STAC-AI Operator to orchestrate the benchmark itself.

The STAC-AI Operator

Running a STAC-AI benchmark end to end involves a lot of moving parts: quantizing the model to the target precision, locking GPU clocks and power limits, executing the benchmark harness across multiple workloads and arrival rates, collecting power and temperature traces synchronously with each run, and, finally, running fidelity analysis against a reference model. Doing all of this manually is time-consuming, error-prone, and difficult to reproduce. The STAC-AI Operator makes the entire workflow declarative.

Custom resources

The operator introduces 3 custom resource definitions (CRDs):

Implementation registers an available LLM backend (such as TensorRT-LLM) along with its container images and supported models and datasets.
BenchmarkRun is the primary resource. It declares a single benchmark execution: which implementation to use, which model and dataset to test, the quantization method, GPU allocation, and whether to run post-processing fidelity analysis. The operator handles everything from there.
BenchmarkImageBuild manages the one-time container image build for the TensorRT-LLM stack.

Lifecycle management

When a BenchmarkRun is created, the operator drives it through a state machine:

Pending → Queued → Initializing → Quantizing → Building → Running → RunCompleted → PostProcessing → Completed

At each phase, the operator creates Kubernetes Jobs to perform the work. It automatically discovers and applies maximum GPU clock speeds and power limits using nvidia-smi, manages a sequential execution queue so that only 1 benchmark runs at a time (preventing GPU contention between runs), and handles the fidelity analysis phase using vLLM as the reference. If any step fails, the run transitions to a Failed state with a typed reason code, which makes diagnosis straightforward.

Running a benchmark

Here is what it takes to run an audited workload:

apiVersion: stac.ai/v1alpha1
kind: BenchmarkRun
metadata:
  name: stac-rtxpro6000-8b-e4a-batch
  namespace: stac-ai
spec:
  implementation: tensorrt-llm
  model: llama-3.1-8b
  dataset: edgar4a
  sut: RTXPRO6000-8B-E4a-Batch
  sutConfig:
    configMapName: sut-tensorrt-llm-rtxpro6000-batch
    key: RTXPRO6000-8B-E4a-Batch.yaml
  config:
    quantization: NVFP4
    gpuCount: 2
    useExecutor: false
  postProcessing:
    fidelity: true
  storage:
    workspacePVC: stac-workspace-pvc
    modelCachePVC: stac-model-cache-pvc
    logsPVC: stac-logs-pvc

That YAML is the entire user interaction. Apply it and the operator handles model quantization, GPU tuning, benchmark execution, power and temperature capture, and fidelity analysis. Workload-specific parameters (batch sizes, sequence lengths, arrival rates) are injected via a ConfigMap, so different configurations can be tested without rebuilding any container images.

To run the complete STAC-AI LANG6 suite, we applied a single BenchmarkRun referencing a SUT configuration that defines all 7 workloads. This way, the operator launches the benchmark harness (as defined by the custom resource) and the harness runs each workload sequentially (as defined by the SUT config). Power and inlet-air temperature are captured throughout each measurement period via a Yokogawa WT1804R precision power analyzer (queried over VXI-11/SCPI from the benchmark pod) and a Dallas DS18B20 probe over USB serial.

Results

The audited results in Tables 2 and 3 are taken from the official STAC report. All workloads were quantized to NVFP4 with FP8 KV cache, with each GPU running an independent model instance.

Table 2. Batch mode results

Workload	Model	Inference Rate (inf/s)	Throughput (words/s)	Energy Eff. (words/kWh)	Space Eff. (words/ft³·hour)
EDGAR4a	Llama-3.1-8B	32.9	5,549	9.320M	9.501M
EDGAR5a	Llama-3.1-8B	0.345	139	234.6K	237.4K
EDGAR4b	Llama-3.1-70B	5.28	834	1.358M	1.428M
EDGAR5b	Llama-3.1-70B	0.0411	13.2	22.34K	22.61K

Table 3. Interactive mode results (highest sustained arrival rate)

Workload	Model	λ (inf/s)	Throughput (words/s)	95p Reaction Time (s)	95p Response Time (s)
EDGAR4a	Llama-3.1-8B	30.0	5,013	0.320	14.6
EDGAR5a	Llama-3.1-8B	0.320	128	29.1	126
EDGAR4b	Llama-3.1-70B	5.00	743	2.26	44.8

A few observations from the audited results:

Strong space efficiency from a 2U form factor. Per the STAC report, the 2U SYS-222C-TN delivered higher batch space efficiency than the larger NVIDIA GH200 system on every comparable workload (9.501 million vs 6.148 million words per ft³·hour on EDGAR4a, 1.428 million vs 799,500 on EDGAR4b, and 237,400 vs 226,900 on EDGAR5a) while completing the full reported workload set with just 2 GPUs.
NVFP4 makes 70B models comfortably single-GPU. Quantizing Llama-3.1-70B-Instruct to NVFP4 brings the on-disk model down to roughly 35 GB, which fits well inside the 96 GB of memory on a single RTX PRO 6000 with plenty of headroom for KV cache. Each GPU runs an independent model instance, and both GPUs are used concurrently in batch and interactive workloads.

The STAC-AI Operator was responsible for every one of these runs. We applied a BenchmarkRun custom resource that referenced the SUT configuration for the full suite, and the operator handled quantization, tuning, execution, and post-processing without further intervention.

Why Red Hat OpenShift for LLM inference

A question we hear often is whether a containerized platform introduces overhead that matters for performance-sensitive workloads. The results in this audit give a strong indication for LLM inference: the GPU appears to be the bottleneck, not the platform, and there is no signal in the data that OpenShift's container runtime or scheduling adds meaningful overhead to the inference pipeline. STAC's own commentary in the report is consistent with this: "The container and orchestration layers did not appear to introduce material performance limitations in practice." Taken alongside our prior audited STAC-N1 (network) and STAC-A2 (GPU compute) results, we see this as a 3rd data point reinforcing our view that OpenShift can deliver the kind of performance organizations expect from bare metal for demanding financial workloads.

But the real value of running on OpenShift goes well beyond raw performance. In production financial services environments, you also need security policies, role-based access control, audit logging, lifecycle management, and multitenancy. OpenShift delivers all of that out of the box. You do not have to choose between bare-metal-like performance and enterprise-grade operations.

You also don't need a parallel stack to run GPU workloads. The same cluster that ran this benchmark can run application workloads, virtualization, batch jobs, and CI/CD pipelines alongside it. It all runs under the same operations team, the same observability surface, and the same security policies. A team that already operates OpenShift picks up a GPU inference platform without standing up a 2nd operational surface.

The operator pattern we used for this benchmark illustrates a broader point: complex GPU workloads can be managed through declarative Kubernetes-native APIs, whether they're benchmark suites or production LLM serving. Users describe what they want (model, quantization, dataset); the operator figures out how to deliver it. This pattern scales in the same way from a single-node test environment to a multinode production cluster.

This is also how Red Hat OpenShift AI is built. It runs on core OpenShift with the same node feature discovery and NVIDIA GPU operators, the same nvidia.com/gpu resource model, and the same operator pattern for managing model development, training, serving, and monitoring across cloud, on-premise, and edge. The infrastructure that ran this benchmark scales directly into a production AI platform without changing the foundation underneath it.

GPU management is similarly simplified. The node feature discovery and NVIDIA GPU operators handle hardware discovery, driver lifecycle, and resource scheduling across the cluster. Adding new GPU nodes is a matter of installing the operators; the rest is automatic. The same deployment model runs unchanged across a single-node OpenShift cluster like the one used here, a 3-node compact cluster suited to a branch office, remote worker nodes at the edge, or a multinode cluster in a regional datacenter. None of the YAML in this blog post would change to retarget any of those.

Finally, every benchmark run in this project is defined in version-controlled YAML. Every configuration is reproducible. When STAC performed their independent audit, we reproduced the exact same environment and results by applying the same resources. That level of reproducibility is a natural consequence of the Kubernetes deployment model; it is just as valuable for production workloads as it is for benchmark submissions.

Conclusion

These audited results show Red Hat OpenShift running the kinds of LLM inference workloads that financial services firms use to evaluate their AI infrastructure, with no indication of platform overhead holding the GPUs back. Combined with our prior STAC-N1 and STAC-A2 results, this reinforces a consistent picture: organizations can run their most demanding workloads with bare-metal performance, all on the same OpenShift platform.

The full audited results are available on the STAC website. For additional commentary on the TensorRT-LLM implementation and the SYS-222C-TN hardware platform, check out this blog post from NVIDIA.

Über den Autor

Sebastian Jug

Senior Performance and Scale Engineer

Sebastian Jug, a Senior Performance and Scalability Engineer, has been working on OpenShift Performance at Red Hat since 2016. He is a software engineer and Red Hat Certified Engineer with experience in enabling Performance Sensitive Applications with devices such as GPUs and NICs. His focus is in automating, qualifying and tuning the performance of distributed systems. He has been a speaker at a number of industry conferences such as Kubecon and STAC Global.

Mehr erfahren

Nach Thema durchsuchen

Entdecken Sie alle Themen

Red Hat OpenShift delivers high-performance LLM inference for financial services

A quick primer on STAC-AI LANG6

The stack

The STAC-AI Operator

Custom resources

Lifecycle management

Running a benchmark

Results

Why Red Hat OpenShift for LLM inference

Conclusion

Red Hat OpenShift AI (selbst gemanagt) | Testversion

Über den Autor

Sebastian Jug

Ähnliche Einträge

Mehr erfahren

Nach Thema durchsuchen

Plattformen

Tools

Testen, kaufen und verkaufen

Kommunizieren

Über Red Hat

Sprache auswählen

Red Hat legal and privacy links

Red Hat legal and privacy links