A Simple Template for Evaluating AI Chatbots

Getting started to make sure your bot works as expected

This year started off with a nice surprise - I just got recognized as a LinkedIn Top Voice! I'm honored and humbled. Thank you all for your support!

Surprise is also the topic of today's newsletter. Surprise is what you get when you work with non-deterministic AI services like GPT-4. But surprise is also what you want to avoid when you put these systems into production - like chatbots.

To make sure your chatbot works the way you want it to, you need to monitor some key performance metrics. But how do you do that? I'll share a simple template to get you started.

Let’s jump in!

Want to learn how to 10x your data analysis productivity with ChatGPT?

Register for my event on February 7-8, 2024!
(Subscription required, but you can join with a free trial)

Running chatbots is easy, but maintaining them is hard

Thanks to no-code tools like Chatbase.co, Stack AI, or Voiceflow, anyone can build a custom LLM-powered chatbot in minutes. But if you want to move beyond prototypes and POCs, there are many challenges.

Evaluating your chatbot's performance is one of them.

Evaluating your chatbot in production is critical, as it helps ensure that your chatbot is not only functional, but also efficient and user-friendly. But unlike more traditional AI systems, there's no simple set of universally accepted metrics for evaluating the performance of generative AI services, and chatbots in particular. It's complicated.

Training the chatbot on my book PDF and deploying it with Chatbase takes literally 5 minutes.

Then I can ask it:

“Who’s the author of the book”

And the chatbot will respond:

The author of the book is Tobias Zwingmann.

Great! I know by chance that this is correct. The chatbot learned this information from the book itself.

But let’s say I ask it another important question like:

“Where can I buy the book?”

It would return:

Hmm… I’m not sure

That's because this information is not stored in the book itself, but only on the book's website.

So let's also crawl the book's website, aipoweredbi.com, and add that to the chatbot's knowledge.

Now when I ask the same question again, it comes back:

You can buy the book here: [list of resources].

Great!

But let’s go back and ask “Who’s the author of the book” again.

After adding the website data, the chatbot responds:

The author of the book is Tobias Zwingmann. He is an experienced data scientist with a strong business background and more than 15 years of professional experience. He is also the managing partner of RAPYD.AI, a German AI consultancy.

Hmmmm…. this answer is different than before. Why? Because the bot now has more knowledge (from the book’s website where this bio is taken from)!

But is this new answer wrong?

Or is it even better?

Ultimately, it’s up to you to decide to what degree you "accept" the correct answer. You should be able to make this decision deliberately!

What's a trivial example in this case becomes an immense complexity when working with dozens or hundreds of knowledge sources. If you don't have a system in place to track your chatbot's performance as you change its components (the knowledge base, model, prompt, etc.), you're in for a gamble.

So let's set up a simple system for this.

Building a simple chatbot evaluation with Google Sheets

Quick primer, before we dive in: When it comes to evaluating chatbots, there are generally two approaches:

  • End-to-end validation: this checks the general behavior of the chatbots, given some sample questions

  • Component-wise evaluation: this checks different parts of the chatbot, like the embedding performance, the document retrieval performance, etc.

Today I'm focusing on end-to-end validation. That's where you should start. Google Sheets is a straightforward tool for this because of its accessibility, ease of use, and ability to integrate with other applications. You can take it from there to make it more complex. Think of it as a first step, not the ultimate solution.

Here’s how the sheet looks like:

The template covers some essential principles:

Categories

Typically, your chatbot will not perform equally well across different types of knowledge. In my book chatbot example, I would distinguish between "about the book" and "book content" questions. You can also categorize by documents, domains, etc. This categorization will be useful for the evaluation later on, as we will see.

Reference QA’s (Ground truth)

Even if we can't do a simple string comparison, we need some "ground truth" question-answer pairs to judge the quality of the chatbot's answers. But where do we get them?

Option 1: Use AI. If you don't have reference question-answer pairs, you can use ChatGPT 4 to generate them from your documents. The Config worksheet in the template contains a prompt for this.

Option 2: Use existing pairs. If your chatbot is already in production, copy / paste some user queries and chatbot questions that have worked well.

Of course, you can also combine both options.

Question variations

Your users will find countless ways to ask the same question. Fortunately, modern LLMs understand "Who's the book author" and "Who wrote the book" alike, but in some cases they still have some silly behavior like this:

To make sure we get consistent answers, we still need to test different variations of our original questions. Thanks to AI, we don't have to do this manually. Using the GPT for Sheets and Docs plugin, we can simply call GPT-3.5 or GPT-4 from within Google Sheets while using a cell formula like this:

 

You can find the prompt in the Google Sheets config tab. I typically create 2 variations. You can do more if you like (and have the budget for).

Chatbot Answers

That's the gist of our exercise! We now have the chatbot answer for each question (3 in total - 1 for the original answer and 2 for the question variations).

Many chatbot platforms provide API access to their chatbots. You can access this API by writing a simple Google Apps script. In fact, the Google Sheet I've shared with you already comes with a Google Apps script that works for bots hosted on Chatbase. Just add your Chatbase API key and chatbot ID to the configuration tab and you're ready to use the =SEND_TO_CHATBASE function.

If you are using a different chatbot provider, you will need to update the script below accordingly based on your chatbot platform's API documentation. ChatGPT will help you.

Evaluation

Now that we have the groundwork - the reference answers and the chatbot results - we need to somehow compare them programmatically. That's the tricky part. I use the following prompt (with GPT-4) to do this:

Customize this prompt as needed depending on your use case. As a result, you should get a bunch of 1s and 0s, depending on whether your chatbot answers are acceptable or not.

If you want to get more advanced, you can adjust the prompt to implement more specific evaluation metrics, such as faithfulness or answer relevancy (see further resources below).

Important: Once you’re done, select all data, choose copy and (while keeping it selected choose Paste special → Values) to store your results.

Further Analysis

Now that we have an evaluation dataset, create a little pivot table (Insert → Pivot table) to check the performance across different categories. This will give you better directions to improve your chatbot.

Improving your chatbot

Let’s say your evaluation shows poor performance. How can we fix this? It depends, there are three common cases:

Case 1: The knowledge is not in the database.
Solution: Add more documents to your knowledge base.

Case 2: The knowledge is in the database, but can’t be found.
Solution: Perhaps the information is too sparse (i.e. your chatbot can't find the answer). Try adding more sample Q&As for this topic to your knowledge base. And guess what, you can use the QA variations from this spreadsheet.

Case 3: The knowledge is contradictory in the database.
Solution: This is where it gets tricky. For example, there are two documents that suggest different answers, maybe one of them is out of date. The simplest approach would be to simply delete the old document, but that's not always possible. Ideally, you can control the components of the RAG architecture and only consider the most recent document based on metadata. That's where a custom implementation comes in handy.

Next steps

This evaluation schema gives you a solid baseline to start. If you want to dive deeper, here are some good resources:

Conclusion

I hope you have seen how you can evaluate your chatbots with relatively simple means. While not perfect, this approach gives you an easy step to avoid shooting in the dark.

Good luck with your chatbots, and may they not surprise you!

As always, feel free to reach out if you need any help.

See you next Friday,

Tobias

PS: If you found this newsletter useful, please leave a testimonial! It means a lot to me!

Reply

or to participate.