Deep Dive: Understanding RAG for AI Applications

The most popular LLM architecture for business use cases

Hi there,

Generative AI is still very young, but I've noticed a steady pattern in my consulting work.

Most enterprise AI applications use something called Retrieval Augmented Generation (RAG). In short, RAG lets a Large Language Model access new data without needing to be retrained, which is important for many uses.

RAG is easy to prototype, but hard to master in production. That’s why I’ve written a deep dive below so you’ll understand what it is, where it can be used and how you can tweak its performance.

Let’s go!

There’s no way around RAG – for now

When you want to build a ChatGPT-like application for your business, you’ll likely encounter any of the two scenarios:

Either, you want someone (internal or external) have a conversation with a chatbot to learn more about your business (e.g. they have some questions regarding a specific product or service).

Or you want to find a certain kind of information in your organization (e.g., the right form to submit travel expenses, a summary of all ongoing projects, etc.).

In both cases, the "baked-in" knowledge of the LLM won't get you very far. You need your "own" data in the LLM.

So how do you do that?

There are two main ways to let an LLM access new data.

The first way is fine-tuning. This involves updating the model's weights to incorporate new knowledge. However, fine-tuning requires a lot of time and high-quality training data. It's also not guaranteed that the model will consistently use the new knowledge. Therefore, fine-tuning is usually used to adjust the model's behavior rather than its knowledge. If you want the model to be more or less chatty, fine-tuning can help. But if you want it to know data that is updated daily, it's not a good use for that.

The second way is RAG.

High-Level RAG Architecture

RAG doesn't change the model, but provides access to additional information by augmenting the prompt with information retrieved from an external document source.

This is great because:

  • we can pull in real-time data as needed

  • we’re able to trace knowledge back to a source (which isn't really possible with fine-tuning)

  • we don’t share our entire data with the model, but just the currently needed information.

And that’s how you’ll probably encounter RAG rather sooner than later when dealing with LLMs in production.

RAG: Core Components and Workflow

The core idea behind RAG is to use the large language model as a kind of fancy search engine. This makes the model itself a swappable component in the overall architecture. GPT-3.5 and GPT-4 work well for RAG, but so do open source models like Llama-2 or Mistral.

On a high level, here’s what the process looks like on the example of a customer support:

  1. Understand a user query, e.g. “How do I return my product?”

  2. Retrieve pieces of information from a relevant document, e.g. "Return_policy.html - To return a product a customer has to […]”

  3. Augment the context of the LLM with the relevant information “chunks”, i.e. include the chunks in the prompt.

  4. Generate the response to the user in natural language, e.g. “To return your product, you have to do XYZ”

This all happens in real-time!

To make this workflow possible, we have to consider two main phases: data ingestion and information retrieval.

What sounds super complicated is actually implemented with 15 lines of code using open source frameworks like Llamaindex (which is imo the best):

This is where most introductions to RAG stop.

And if you don't care about the technical details, you can jump off now and enjoy your day – see you next Friday!

But if you want to dive deeper and learn how to do this in the real world, read on.

But be aware - it will get a little nerdy!

The code above is – obviously not production ready. When you build production-scale RAG systems, things get much more complicated than this.

The good news is that every RAG application essentially boils down to 5 fundamental steps. Master each, and you master RAG. Easy, right?

Let’s see!

The steps below are based on implementing RAG using Llamaindex. If you're using another framework like LangChain or a platform like Azure, the terminology may be different, but the main concepts should be similar.

Step 1: Loading

To make your data available for the model, it needs to be loaded from its original source – whether it’s files from a Sharepoint, HTML pages from a website, or tables from a database. Llamaindex provides hundreds of pre-built connectors in its LlamaHub library.

This is where things can get tricky in many business situations. Usually there are special permissions for different users, so not everyone can see all the documents in the company. That's why, in most cases, you need to provide the user information and authentication token when connecting to an internal data source, as you can see in this Jira example.

Step 2: Indexing

After you loaded your data from its source, you need to create a structure that allows to query the data easily. For LLMs, this usually involves creating so-called vector embeddings.

In very simple terms, a vector embedding is a long list of numerical values that represents the meaning (semantics) of a text item. When the numbers of two items are very similar, it means they have a similar meaning:

Typically, you split your documents into different “chunks” (more on that below) to get better performance.

Besides the embedding, the indexer will typically also include metadata (such as original source, filename, timestamps, etc.) to ensure that contextually relevant information can be easily and accurately found.

Step 3: Storing

After we've done the heavy lifting, we want to store the index and metadata somewhere. There are specialized (vector) databases for this task. Popular examples include open source projects like Weaviate or fully managed services like Pinecone.

Llamaindex supports a variety of vector stores.

Take a quick breath.

Step 1-3 completed the first phase of our RAG system - data ingestion.

You would typically run this ingestion either in batch intervals (like weekly, daily, or hourly, depending on the volume and velocity of your data), or trigger real-time updates when a data source is changed. For example, Llamaindex supports advanced techniques like incremental updates, so there’s no need to update the whole store when only one document changes.

Did I mention I love Llamaindex? ❤️

Step 4: Querying

The querying stage is where the actual information retrieval beings.

If you want to keep it simple, Llamaindex provides a default query engine where you can just throw in a user query and get the answer from the top k relevant documents from your vector store:

Since vector embeddings are language agnostic, this will work out of the box for all major languages - the embedding for "cat" and "katze" is basically the same. And thanks to the LLM, it would translate the information from the original document into the user's language on the fly. In Sam Altman’s words: "Pretty cool, huh?"

Of course, the query engine gives you a lot of things to customize under the hood:

  • A Retriever defines “how” you find the most relevant information from your index: Summary index, a tree index, SQL, or knowledge graph index - it’s up to you to choose! (more on this below)

  • Postprocessing is used to filter, rerank, or transform the retrieved results. For example, if you fetch several documents with similar (or even contradicting) information, you could rank them by their timestamp metadata and get only the most recent one.

  • Synthesizers send the query, the results and your prompt back to the LLM to generate a user-facing response. This includes adding more information to the user output, such as the document sources, if necessary.

You can make this query process as complex as you want. There are literally hundreds of knobs and bolts to tweak to optimize the process for your use case.

Which leads us to the last step…

Step 5: Evaluation

How do you decide which of the dozen retriever modes works best for your data? Or which language model you should use for the embeddings?

Your RAG system requires an evaluation framework to check if performance improves or declines after making changes. Surprisingly, you can utilize an LLM to develop these evaluation systems for you.

For example, you could use a model like GPT-4 to create questions based on your data, and store the generated input/outputs as a reference dataset. In Llamaindex, there’s a tool called QuestionGeneration which is just doing that for you.

Once you have a set of reference questions and answers, you can dive deeper into different evaluation strategies for both automated and manual testing of your RAG system.

There are different types of evaluation metrics, but some of them should be familiar to you, especially if you’re reading this newsletter:

Challenges of RAG

When you run the 10-lines-of-code playground example with Llamaindex, you will probably notice that the RAG system will work surprisingly well out of the box. It’s not uncommon that you’ll reach a 70-80% accuracy level in just a few hours!

The challenge, however, starts by tackling the remaining 20-30%. And the more you move towards 100%, the harder it will become.

Best case, there’s just too much fluff in the answer. Worst case, the answer is wrong because an incorrect (e.g. outdated) document was retrieved. There are also problems inherent to the LLM, such as hallucination, or losing important information from the context, especially if the context is very long (for example because you loaded 10 document chunks into it).

How can we overcome these challenges?

Improving Your RAG Performance

To get the most out of your RAG system, here’s a set of optimization strategies that worked well for me in practice:

Optimize your input data: This is the biggest performance lever you can pull. Make sure your data is high quality and easy to index. For example by automatically converting PDFs with complicated layouts into flat text documents before ingesting them into a vector store. Or using a table parser to extract tables from your documents separately.

Experiment with chunk sizes: In Llamaindex, the default chunk size is 1024. When you decrease this size, your embeddings will be more precise, but there’s a higher risk you’ll miss important context information. Larger chunk sizes offer more general information, but might lose important nuances. Start with a small chunk size and expand gradually while monitoring the performance of the system. Tip: In some cases, it’s best to align the chunk size with document metadata. For example, Powerpoint presentations are typically best embedded for each slide individually.

Try hybrid search: Believe it or not, but for a lot of user queries, the good old keyword search will just do fine. A vector store that allows for hybrid search will give you the option to get the best of the both worlds.

Do prompt engineering: Two things here work really well. The first is adding few-shot examples dynamically to your query so you get consistent output formats. The other strategy is to change the system prompt to suppress undesirable behavior. Something as simple as "Respond with 'I don't know – please contact us directly' when you have conflicting information" really did the trick in a recent support chatbot use case.

Choose the right LLM: You can try different LLMs for your use case and see which one works best. Especially if you’re using open source models like Llama-2 or Mistral, you might experiment with different sizes and see which one is just large enough for your use case. Using a fine-tuned LLM on a reference Q&A dataset might further improve the quality of the embeddings and lead to an improved performance overall.

Remember: The key to improving your RAG system really is continuous experimentation. Chances are you’ll get it to a pretty decent level really quickly, and then you can improve it over time.

Conclusion

RAG systems are (currently) the go-to architecture for building production-grade LLM systems in the enterprise, and it doesn't look like that's going to change any time soon.

They are easy to prototype, but hard to master. This shouldn’t keep you from trying them out, but encourage you to get started with them.

Start small and see if your users like the idea of retrieving information this way.

If you need help, reach out any time!

I hope you enjoyed this (technical) deep dive today.

Next week, I’ll be back with a hands-on use case!

See you next Friday,

Tobias

PS: If you found this newsletter useful, please leave a feedback! It would mean so much to me! ❤️

Reply

or to participate.