In our previous article Context as architecture: A practical look at retrieval-augmented generation, we treated retrieval-augmented generation (RAG) as an architectural idea. We explored why retrieval exists, how it changes the system around a language model, and where its boundaries lie. That framing is necessary, but incomplete.
Once teams move beyond prototypes and begin operating RAG systems in production, a new reality sets in. Retrieval does not fail loudly. It fails subtly, probabilistically, and often convincingly. Systems return an answer, grounded in some source, even when that source is incomplete, outdated, or only loosely relevant.
This is the point where RAG stops being an idea and becomes a systems problem.
The myth of "simple" RAG
At a conceptual level, RAG looks straightforward: Store documents, retrieve relevant passages, pass them to the model. Many early implementations follow exactly this pattern—and appear to work.
Until they don’t.
The first failures are rarely catastrophic. Instead, teams notice patterns:
- Correct answers appear inconsistently
- Small phrasing changes produce different results
- Similar questions retrieve different context
- The model sounds confident even when it's wrong
What's happening is not a model failure. It's a retrieval failure.
Similarity-based retrieval is approximate by design. It optimizes for closeness in representation space, not factual correctness. As data volume grows and queries become more nuanced, this approximation begins to show cracks.
This is the moment teams realize that RAG is not a single technique, but a pipeline under constant pressure.
The retrieval gap revisited
Even in well-curated systems, the correct information may exist in the corpus and still not be retrieved. This retrieval gap becomes the dominant failure mode in production RAG systems.
The consequences are subtle but severe. If the system retrieves the wrong passage, the model still does its job, synthesizing, summarizing, and explaining. The output is fluent, grounded, and wrong.
Importantly, this failure cannot be corrected downstream. Once incorrect context enters the context window, generation is already compromised. This is why mature RAG systems prioritize retrieval robustness over generation quality.
Why "advanced RAG" has emerged
Most techniques described as "advanced RAG" exist for one reason: To reduce the probability of retrieving the wrong context. As systems mature, teams begin to introduce additional stages around retrieval:
- Pre-retrieval query transformation to improve recall
- Hybrid retrieval to balance semantic similarity with exact matching
- Re-ranking to refine results using more precise relevance models
- Post-retrieval filtering and compression to manage token pressure
Each addition compensates for a specific weakness in naive retrieval. None of them fundamentally changes the generation step. They exist to protect the context window. Seen this way, advanced RAG is not an upgrade, but an adaptation.
From pipelines to decisions
As pipelines grow more complex, another limitation emerges: Rigid retrieval workflows struggle to handle varied query intent.
Not every question requires retrieval. Not every query should hit the same data source. Some questions require decomposition, others require synthesis, and some are best answered directly from the model’s parametric knowledge.
This is where agent-driven approaches begin to emerge. Instead of treating retrieval as a mandatory step, systems begin to treat it as a decision. Agents determine whether retrieval is necessary, which sources to consult, and how to combine results. Retrieval becomes one tool among many, rather than a fixed stage in a pipeline.
This shift reflects a deeper insight: as RAG systems scale, control logic matters as much as retrieval quality.
Operational reality of RAG
Production RAG systems are not just about relevance. They are about operations. Teams must consider:
- Latency tradeoffs introduced by multi-stage retrieval
- Cost implications of embeddings, re-ranking, and inference
- Data freshness and re-indexing strategies
- Access control and document-level permissions
- Observability across retrieval and generation steps
These concerns are orthogonal to model choice. A better model does not fix poor retrieval. A faster model does not fix slow indexing. Operational discipline matters as much as architectural design.
This is why RAG success is rarely about a single tool or framework. It's about managing a distributed system with probabilistic components.
The hard boundary RAG cannot cross
Even the most sophisticated RAG system does not change how a model reasons. A model that misinterprets domain concepts will continue to do so, even when given perfect context. A model that struggles with consistency or tone will remain inconsistent. Retrieval can supply facts, but it cannot supply judgment.
At this point, teams often try to push retrieval further, by adding more context, rules, and filtering. This usually increases cost and complexity without fixing the underlying problem. This is the architectural boundary RAG cannot cross.
Why does this naturally lead to tuning
Once retrieval quality is under control and failure modes are understood, a new question emerges: If the model has the right information, why does it still behave incorrectly?
That question points beyond retrieval. Techniques such as fine-tuning, adapters, and synthetic data generation can modify model behavior. They shape how a model interprets information, how it reasons within a domain, and how consistently it applies rules.
These techniques do not replace RAG. They build on it.
In practice, effective systems combine:
- Prompting to shape interaction
- Retrieval to supply context
- Tuning to align reasoning
Understanding when to move from one layer to the next is the real architectural challenge.
Where this leaves us
RAG is not a shortcut to correctness. Rather, it is a commitment to system design.
It introduces new failure modes, new operational costs, and new architectural responsibilities. But it also enables capabilities that prompting alone can never reach: Grounding, traceability, and controlled access to real knowledge.
In production systems, however, retrieval eventually reaches a ceiling. Once retrieval is reliable, adding more context no longer helps. The model has the right information but still draws incorrect conclusions, applies rules inconsistently, or defaults to behavior that doesn’t align with the domain.
That’s not a retrieval problem. It’s a model problem.
Dive into the RAG AI quickstart in the catalog or get started with Red Hat 30-day Developer Sandbox.
Product trial
Red Hat AI Inference Server | Product Trial
About the authors
Frank La Vigne is a seasoned Data Scientist and the Principal Technical Marketing Manager for AI at Red Hat. He possesses an unwavering passion for harnessing the power of data to address pivotal challenges faced by individuals and organizations.
A trusted voice in the tech community, Frank co-hosts the renowned “Data Driven” podcast, a platform dedicated to exploring the dynamic domains of Data Science and Artificial Intelligence. Beyond his podcasting endeavors, he shares his insights and expertise through FranksWorld.com, a blog that serves as a testament to his dedication to the tech community. Always ahead of the curve, Frank engages with audiences through regular livestreams on LinkedIn, covering cutting-edge technological topics from quantum computing to the burgeoning metaverse.
As a principal technologist for AI at Red Hat with over 30 years of experience, Robbie works to support enterprise AI adoption through open source innovation. His focus is on cloud-native technologies, Kubernetes, and AI platforms, helping to deliver scalable and secure solutions using Red Hat AI.
Robbie is deeply committed to open source, open source AI, and open data, believing in the power of transparency, collaboration, and inclusivity to advance technology in meaningful ways. His work involves exploring private generative AI, traditional machine learning, and enhancing platform capabilities to support open and hybrid cloud solutions for AI. His focus is on helping organizations adopt ethical and sustainable AI technologies that make a real impact.
More like this
Red Hat and NVIDIA collaborate for a more secure foundation for the agent-ready workforce
Bringing Nemotron models to the Red Hat AI Factory with NVIDIA
Technically Speaking | Build a production-ready AI toolbox
Technically Speaking | Platform engineering for AI agents
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Virtualization
The future of enterprise virtualization for your workloads on-premise or across clouds