Planning the design of your production-grade RAG system

March 6, 2026Frank La Vigne, Robbie Jerrom4-minute read

In our previous article Context as architecture: A practical look at retrieval-augmented generation, we treated retrieval-augmented generation (RAG) as an architectural idea. We explored why retrieval exists, how it changes the system around a language model, and where its boundaries lie. That framing is necessary, but incomplete.

Once teams move beyond prototypes and begin operating RAG systems in production, a new reality sets in. Retrieval does not fail loudly. It fails subtly, probabilistically, and often convincingly. Systems return an answer, grounded in some source, even when that source is incomplete, outdated, or only loosely relevant.

This is the point where RAG stops being an idea and becomes a systems problem.

The myth of "simple" RAG

At a conceptual level, RAG looks straightforward: Store documents, retrieve relevant passages, pass them to the model. Many early implementations follow exactly this pattern—and appear to work.

Until they don’t.

The first failures are rarely catastrophic. Instead, teams notice patterns:

Correct answers appear inconsistently
Small phrasing changes produce different results
Similar questions retrieve different context
The model sounds confident even when it's wrong

What's happening is not a model failure. It's a retrieval failure.

Similarity-based retrieval is approximate by design. It optimizes for closeness in representation space, not factual correctness. As data volume grows and queries become more nuanced, this approximation begins to show cracks.

This is the moment teams realize that RAG is not a single technique, but a pipeline under constant pressure.

The retrieval gap revisited

Even in well-curated systems, the correct information may exist in the corpus and still not be retrieved. This retrieval gap becomes the dominant failure mode in production RAG systems.

The consequences are subtle but severe. If the system retrieves the wrong passage, the model still does its job, synthesizing, summarizing, and explaining. The output is fluent, grounded, and wrong.

Importantly, this failure cannot be corrected downstream. Once incorrect context enters the context window, generation is already compromised. This is why mature RAG systems prioritize retrieval robustness over generation quality.

Why "advanced RAG" has emerged

Most techniques described as "advanced RAG" exist for one reason: To reduce the probability of retrieving the wrong context. As systems mature, teams begin to introduce additional stages around retrieval:

Pre-retrieval query transformation to improve recall
Hybrid retrieval to balance semantic similarity with exact matching
Re-ranking to refine results using more precise relevance models
Post-retrieval filtering and compression to manage token pressure

Each addition compensates for a specific weakness in naive retrieval. None of them fundamentally changes the generation step. They exist to protect the context window. Seen this way, advanced RAG is not an upgrade, but an adaptation.

From pipelines to decisions

As pipelines grow more complex, another limitation emerges: Rigid retrieval workflows struggle to handle varied query intent.

Not every question requires retrieval. Not every query should hit the same data source. Some questions require decomposition, others require synthesis, and some are best answered directly from the model’s parametric knowledge.

This is where agent-driven approaches begin to emerge. Instead of treating retrieval as a mandatory step, systems begin to treat it as a decision. Agents determine whether retrieval is necessary, which sources to consult, and how to combine results. Retrieval becomes one tool among many, rather than a fixed stage in a pipeline.

This shift reflects a deeper insight: as RAG systems scale, control logic matters as much as retrieval quality.

Operational reality of RAG

Production RAG systems are not just about relevance. They are about operations. Teams must consider:

Latency tradeoffs introduced by multi-stage retrieval
Cost implications of embeddings, re-ranking, and inference
Data freshness and re-indexing strategies
Access control and document-level permissions
Observability across retrieval and generation steps

These concerns are orthogonal to model choice. A better model does not fix poor retrieval. A faster model does not fix slow indexing. Operational discipline matters as much as architectural design.

This is why RAG success is rarely about a single tool or framework. It's about managing a distributed system with probabilistic components.

The hard boundary RAG cannot cross

Even the most sophisticated RAG system does not change how a model reasons. A model that misinterprets domain concepts will continue to do so, even when given perfect context. A model that struggles with consistency or tone will remain inconsistent. Retrieval can supply facts, but it cannot supply judgment.

At this point, teams often try to push retrieval further, by adding more context, rules, and filtering. This usually increases cost and complexity without fixing the underlying problem. This is the architectural boundary RAG cannot cross.

Why does this naturally lead to tuning

Once retrieval quality is under control and failure modes are understood, a new question emerges: If the model has the right information, why does it still behave incorrectly?

That question points beyond retrieval. Techniques such as fine-tuning, adapters, and synthetic data generation can modify model behavior. They shape how a model interprets information, how it reasons within a domain, and how consistently it applies rules.

These techniques do not replace RAG. They build on it.

In practice, effective systems combine:

Prompting to shape interaction
Retrieval to supply context
Tuning to align reasoning

Understanding when to move from one layer to the next is the real architectural challenge.

Where this leaves us

RAG is not a shortcut to correctness. Rather, it is a commitment to system design.

It introduces new failure modes, new operational costs, and new architectural responsibilities. But it also enables capabilities that prompting alone can never reach: Grounding, traceability, and controlled access to real knowledge.

In production systems, however, retrieval eventually reaches a ceiling. Once retrieval is reliable, adding more context no longer helps. The model has the right information but still draws incorrect conclusions, applies rules inconsistently, or defaults to behavior that doesn’t align with the domain.

That’s not a retrieval problem. It’s a model problem.

Dive into the RAG AI quickstart in the catalog or get started with Red Hat 30-day Developer Sandbox.

About the authors

Frank La Vigne

AI Principal Technical Marketing Manager

Frank La Vigne is a seasoned Data Scientist and the Principal Technical Marketing Manager for AI at Red Hat. He possesses an unwavering passion for harnessing the power of data to address pivotal challenges faced by individuals and organizations.
A trusted voice in the tech community, Frank co-hosts the renowned “Data Driven” podcast, a platform dedicated to exploring the dynamic domains of Data Science and Artificial Intelligence. Beyond his podcasting endeavors, he shares his insights and expertise through FranksWorld.com, a blog that serves as a testament to his dedication to the tech community. Always ahead of the curve, Frank engages with audiences through regular livestreams on LinkedIn, covering cutting-edge technological topics from quantum computing to the burgeoning metaverse.

Read full bio

Robbie Jerrom

Senior Principal Technologist, AI

As a principal technologist for AI at Red Hat with over 30 years of experience, Robbie works to support enterprise AI adoption through open source innovation. His focus is on cloud-native technologies, Kubernetes, and AI platforms, helping to deliver scalable and secure solutions using Red Hat AI.

Robbie is deeply committed to open source, open source AI, and open data, believing in the power of transparency, collaboration, and inclusivity to advance technology in meaningful ways. His work involves exploring private generative AI, traditional machine learning, and enhancing platform capabilities to support open and hybrid cloud solutions for AI. His focus is on helping organizations adopt ethical and sustainable AI technologies that make a real impact.

Read full bio

Keep exploring

Browse by channel

Explore all channels

Planning the design of your production-grade RAG system

The myth of "simple" RAG

The retrieval gap revisited

Why "advanced RAG" has emerged

From pipelines to decisions

Operational reality of RAG

The hard boundary RAG cannot cross

Why does this naturally lead to tuning

Where this leaves us

Red Hat AI Inference | Product Trial

About the authors

Frank La Vigne

Robbie Jerrom

More like this

Keep exploring

Browse by channel

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links