What We’ve Learned After Two Years With Retrieval-Augmented Generation

Lessons learned and best practices for building RAG AI systems

Today’s edition is co-authored by Louis-François Bouchard, Co-Founder and CTO of Towards AI, home to thousands of AI articles and the new LLM Academy trusted by 100,000+ learners.

If you’re building real-world applications with AI, sooner or later you’ll stumble over the term “RAG” – which, in short, promises to give LLMs access to proprietary or real-world knowledge and overcome the “hallucination” problem.

Over the past 2 years, we've been deeply involved with this technology from various angles and I’m thrilled to have Louis-François today as a co-author who’s sharing their learnings from implementing RAG at scale not only for various clients but also on their AI Tutor on the Towards AI Academy, serving over a hundred thousand learners.

In this article, we’ll share what we’ve learned from building and scaling AI RAG systems – so you can build better RAG systems, too!

Let's dive in!

What is RAG and Why It (Still) Matters

Retrieval-augmented generation (RAG) links off-the-shelf models to your own evolving knowledge base, updating them in real time and tailoring responses to your domain. Suddenly, models that once knew "a little about everything" can cite internal policies, answer support tickets, or walk someone through a fine-tuning tutorial—grounded in your organization’s latest data and practices.

Here’s how RAG works on a high level:

High-Level RAG Overview

It’s also how Louis-François and his team built the AI Tutor—a production-grade RAG system that teaches users how to build with LLMs (like RAG systems—meta, I know!).

RAG helped them keep the AI Tutor up-to-date with fast-moving tools like LangChain and LlamaIndex, filter stale content, and deliver context-aware answers—even as the AI ecosystem shifted week to week.

With recent announcements of models like Gemini 2.5 promising million-token context windows, some have questioned whether RAG is still necessary. After all, if a model can just read everything at once, why bother with retrieval? But as we'll explore more below, the reality is more nuanced. 

Spoiler alert: RAG isn't just surviving these advances, it's becoming even more essential. 

But first, let’s take a look at what happened:

RAG Over 2 Years – From Research To Mainstream

It's wild how much RAG has evolved in just two years. RAG has gone from cutting-edge research to production-grade approach—faster, more accurate, and better at handling everything from legal PDFs to clinical notes and raw HTML.

As a topic, RAG’s even more popular than prompt engineering these days:

Under the hood, retrieval methods have gotten a serious upgrade. 

Let’s highlight some of the most impactful advances:

Adaptive Retrieval: Modern pipelines do more than just fetch documents—they adapt. Some use reinforcement learning to prioritize reliable sources (like peer-reviewed papers) over random blog posts. Others run multi-stage searches: start broad, then re-rank with context in mind.

Hybrid Retrieval: Dense models capture nuance. Sparse methods (like BM25) nail keyword matching. Most systems now use both, combining semantic understanding with literal precision.

Real-Time Integrations: RAG can now fetch from data that changed five minutes ago. With dynamic indexing and sub-second retrieval, modern pipelines refresh their knowledge base on the fly. 

Knowledge Graph Integration: By tying retrieval to structured knowledge, RAG systems can now connect the dots across documents. Think: the policy memo, the regulation it cites, and an expert breakdown—all linked. This is a game-changer for legal, healthcare, and compliance tasks requiring multi-hop reasoning.

Together, these advances show how RAG has evolved into a flexible, multi-talented architecture, ready to tackle messy, real-world problems with a lot more nuance. But as capabilities grow, so do the challenges. 

Here’s where things get tricky.

Key Challenges That Remain

RAG is powerful—but it’s not magic. As more teams move from prototype to production, a few common pain points keep surfacing.

Dependency on Retrieval Quality: RAG lives or dies by what it retrieves. If it pulls noisy chunks, stale docs, or loosely related content, the output suffers badly. Even high-quality content fails when it doesn’t align with the question’s intent.

Hybrid search and re-rankers help, but fine-tuning them takes time and experimentation. It gets harder with edge cases—ambiguous queries or niche topics—where relevant documents are rare. When that happens, retrieval becomes the bottleneck, no matter how good the generator is.

Latency and Scalability Trade-offs: Retrieval adds latency—it’s built into the loop. Fast ANN indices can return results in milliseconds, but scaling that speed to billions of documents often requires distributed infrastructure.

You can offset some of the delay with async workflows or caching, but at large scale, there’s always a cost. It’s a trade-off: speed vs. precision. In healthcare, a few seconds might be fine. In retail or chat, not so much.

Integration Complexity: RAG isn’t plug-and-play. You’re managing multiple moving parts—retrievers, context windows, prompts, and generators—that all need to stay in sync.

Even small changes, like swapping an embedding model, can break prompt alignment or derail earlier tuning. As the system grows, so does the coordination overhead.

If you’re running into these challenges, the Beginner to Advanced LLM Developer Course* offers a deeper dive into building with LLMs and retrieval, from core concepts to practical workflows.

*Affiliate link

5 Lessons Learned

If the past two years of RAG deployments have taught us anything, it’s this: success is less about fancy models and more about thoughtful design, clean data, and constant iteration.

Here’s what’s worked:

1. Modular Design > Big Monoliths

The best RAG pipelines aren’t monoliths—they’re systems built for change. Each component—retriever, vector store, LLM—should be modular and easy to swap. The key is interface discipline: expose every piece through a config file, not hardcoded logic. A simple pipeline_config.yaml might let you toggle between Pinecone and Weaviate or flip from Claude to GPT‑4 without touching application code.

We saw this in action with the AI Tutor. Swapping from OpenAI embeddings to Cohere’s, or toggling chunk sizes and rerankers, became a config change—not a refactor. That flexibility helped us keep pace with ecosystem churn and test new ideas quickly.

2. Smarter Retrieval Wins

Hybrid search is table stakes—dense vectors for nuance, sparse signals for keyword hits. But in the AI Tutor, we had to go further.

  • We layered in rerankers like Cohere’s Rerank-3 to clean up noisy top-k results. It reorders chunks by semantic relevance, so the final prompt includes what actually matters.

  • We also used source filters and metadata tags to scope queries to the right docs. If a user asked about LangChain, we didn’t want the retriever pulling from Hugging Face tutorials.

  • Later, we added sentence-level chunking with context windows—retrieving not just a sentence, but the ones around it. That reduced fragmented answers and helped the LLM reason in full paragraphs.

Good retrieval isn’t just finding the right stuff. It’s avoiding the wrong stuff—and getting it in the right order.

AI Tutor with source customization

3. Build Guardrails For Graceful Failure

Early RAG systems tried to answer everything—and ended up hallucinating. Today’s systems do better by knowing when not to answer.

In our AI Tutor, we combined system prompts, routing logic, and fallback messaging to enforce topic boundaries and reject off-topic queries. If a question wasn’t about AI, the model would politely decline to respond. We also logged all fallbacks to spot coverage gaps and refine the knowledge base.

Failing fast isn’t a flaw—it’s essential. Especially in domains like healthcare or education, it’s better to say “I don’t know” than to guess.

4. Keep Your Data Fresh (and Filtered)

RAG systems are only as good as the data behind them. For our AI Tutor, that meant continuously refining the knowledge base—not just sourcing good content, but keeping it clean, current, and relevant.

We saw measurable gains with small changes. Adding source filters in the UI—like limiting LangChain queries to LangChain docs—boosted hit rate from 0.21 to 0.46. Missed queries and fallback logs helped us spot blind spots and patch gaps.

What worked:

  • De-duping noisy files and stripping bloat from scraped documents

  • Boosting high-trust sources like research papers when confidence dipped

  • Tailoring chunk size and overlap based on content type

Treat your data like part of the product. Keep it live, structured, and responsive

5. Evaluation Matters More Than Ever

Standard model metrics don’t cut it. We designed custom evaluations focused on:

  • Retrieval precision (Hit Rate, MRR)

  • Faithfulness to the retrieved context

  • Hallucination rates

We used synthetic LLM-generated queries for rapid iteration, then validated with real user feedback. The most effective approach was short, continuous eval loops—run after every pipeline tweak, not just at milestones. This helped us catch regressions early and focus on changes that actually moved the needle.

Done well, RAG turns static data into a live, self‑improving engine—delivering context‑faithful answers in milliseconds, minimizing errors, and scaling with your evolving needs.

The Rise of Long-Context LLMs: What It Means for RAG

With long-context models entering the scene, it’s fair to ask: if a model can read everything at once, do we still need retrieval?

Context windows of modern LLMs (May ‘25) via Artificial Analysis

Models like Gemini 2.5, GPT-4.1, and Llama 4 Scout now boast million-token context windows. In theory, you could just dump your docs into the prompt and skip retrieval altogether.

In practice? Not so fast.

Why Long-Context Alone Isn’t Enough

Stuffing entire corpora into a prompt sounds nice, but it’s slow, expensive, and noisy. The more you pack in, the harder it is for the model to zero in on the critical information. Even state‑of‑the‑art LLMs can lose track of what’s important when fed unfiltered, overly long inputs.

RAG solves this by filtering first. Instead of scanning everything, it pulls only what’s relevant—and builds a clean, focused prompt. That means:

  • Less compute

  • Faster responses

  • Sharper answers—especially with large or messy datasets

Long-context LLMs don’t replace RAG. They enhance it.

With more room in the prompt, RAG systems can:

  • Include richer context from multiple sources

  • Mix data types—text, tables, diagrams

  • Support agent-style workflows that combine retrieval + reasoning

And because many of these models are also multimodal or tool-aware, RAG can feed them structured data, images, or whatever else the task requires.

Even the largest context window can’t fix stale inputs. If you feed in outdated documents, you’ll still get outdated answers. For real-time queries—breaking news, updated product info, or live user data—RAG is how you keep things fresh.

Bottom line: RAG keeps inputs focused and current. Long-context LLMs give you room to reason deeper and generate better outputs, without wasting tokens or compute.

Looking Forward

Two years in, RAG has gone from academic footnote to enterprise staple. It’s now one of the most practical ways to anchor LLMs in real-world data.

But where’s RAG headed next?

With bigger models, richer data, and rising user expectations, the next wave of RAG systems will look very different from the early pipelines:

Multimodal RAG Advancements: We’re moving past text and images. Future RAG systems will handle structured data, time-series logs, and even cross-modal reasoning. That means tighter alignment across data types—and smarter retrieval logic. Expect systems that can search and reason across formats, timelines, and tools by default—not just text and pictures.

Agentic Retrieval-Based Systems: The old RAG loop—retrieve, then generate—is evolving into something more dynamic.

Agentic RAG systems can:

  • Decide what to search for

  • Refine their queries

  • Loop through multiple retrieval passes

It’s like switching from Q&A mode to research mode. Think DeepResearch with your custom data. Projects like LangChain Agents are already exploring this.

Enhancing RAG with Memory and Long-Term Learning: Most RAG systems today have no memory. Each prompt starts from scratch.

But that’s changing. Memory-augmented RAG can:

  • Cache past retrievals

  • Use shared scratchpads or graphs to track context

  • Carry knowledge across multi-turn conversations

This matters in real workflows, like enterprise copilots or support agents, where context builds over time.

Some teams are even testing systems that learn from feedback and adapt their retrieval strategy. It’s promising, but risky. Without good guardrails, you can end up overfitting or polluting your knowledge base with bad data.

RAG is no longer just a utility. It’s evolving into a dynamic, interactive architecture.

If you’re building for what’s next, this is where things are going.

Despite the hype around long-context models, retrieval isn’t going anywhere. If anything, it’s become more important. You don’t need more data in every prompt—you need the right data.

That’s what RAG does best.

Put simply, retrieval brings relevance, generation brings reasoning. Together, they create systems that are smart, accurate, and grounded. And as LLMs get bigger and more capable, that partnership will matter more than ever.

Keep building!

Louis & Tobias

If you're working with Retrieval-Augmented Generation or planning to integrate retrieval into your LLM workflows, the Beginner to Advanced LLM Developer Course* is designed to support you.

With over 50 hours of hands-on content, the course walks through the engineering principles behind modular system design, hybrid retrieval techniques, prompt construction, evaluation strategies, and scaling RAG in production.

It also covers how long-context models shift the retrieval landscape—and how to adapt your architecture accordingly.

Whether you're just starting out or optimizing enterprise-grade systems, this course equips you with the practical skills to build RAG pipelines that are reliable, efficient, and ready for real-world deployment.

*Affiliate link

Reply

or to participate.