Technology

How to Improve Retrieval Systems in AI Products

At Newfront we’ve built Benji, an AI-powered conversational assistant that automatically helps employees navigate their company’s resources. This is powered by retrieval augmented generation (RAG), a design architecture where relevant proprietary data is passed to a LLM via its prompt to provide additional context for its response. For Benji, we retrieve the most relevant data about a company’s benefits and HR policies, inject that data and the employee’s question into the LLM prompt, and instruct the LLM to generate an answer to the question.

While much focus in AI evaluation is on LLM-specific components, ensuring retrieval is performing well in RAG systems is important as well. The LLM’s response can only be as good as the data that we give it (“garbage in, garbage out”). This article explains why retrieval evaluation (or “eval” for short) is important and how it can supercharge the improvement of your AI products.

Answer Evaluation and its Challenges

Evaluation is one of the most important processes in building AI products. For a chatbot, the most important evaluation is of the final response provided to the user. However, only evaluating answers poses a couple of challenges.

There are two complimentary ways to evaluate answers. First is through human annotation, where the user query, AI answer, and knowledge base are exposed to a labeler who grades whether the answer is correct. This is useful but very resource intensive. For example, searching through hundreds of PDF pages to verify whether three facts in an answer are supported is not an easy feat. The grading criteria for whether an answer is correct is also complicated and subjective. For example, the labeler needs to consider how much detail is appropriate or whether an unclear answer is incorrect.

A second approach is automating the grading process with the LLM-as-a-judge paradigm. This approach uses a judge LLM to grade whether the chatbot answer is correct given the user query and a predefined correct answer to reference. While attractive, this approach still requires resources to maintain another LLM system to ensure your automatically generated grades align with your expectations. In other words, the judge LLM also requires eval.

The most prominent failure mode we discovered initially on Benji was poor answers resulting from the correct data not being passed to the LLM. Answer evaluation alone does not reveal this type of insight. To better debug issues and prioritize experiments, we needed to test the intermediate stages of the system instead of just the final result. Specifically, we needed to test our RAG system’s retrieval step to ensure that the context passed to the answering LLM is correct.

How Retrieval Eval Can Help

Retrieval eval measures how often the relevant parts of the knowledge base are retrieved and injected as context into the answering LLM’s prompt. To do so, we need to collect ground truth (pairs of queries and the source data required to answer them) and check that the retrieval system ranks the ground truth source data highly enough to pass it to the prompt. We will dive deeper into how this works later.

The first benefit of this approach is that it’s more objective than answer evaluation. For example, if an employee asks how their HSA works, we expect an unambiguous place in the knowledge base where this information is stored.

Secondly, retrieval eval has a much faster development cycle for three reasons:

  1. It does not require making LLM calls since it does not test answer generation. This reduces the run time of evals by 20 times compared to answer eval. 

  2. Curating ground truth labels is faster. For retrieval eval, the labeler just has to copy the relevant excerpt from the source document. On the other hand, answer labels must be carefully crafted.

  3. Retrieval eval grading can be completely automated by conducting a string match between the ground truth and retrieval results, whereas answer eval grading is expensive, as discussed previously.

Third, retrieval evaluation is effective for conducting retrieval experiments, which is a high-leverage area for improving RAG products. Answer evaluation alone is not sufficient for evaluating these experiments because an incorrect answer does not provide enough information to determine whether the failure is due to the LLM or poor context.

For those reasons, retrieval eval has been instrumental in effective and fast AI experimentation on Benji. We walk through a concrete retrieval experiment in the next section.

The Reranking Experiment

The basic RAG setup uses embeddings based vector search for retrieval. This involves converting the user query and knowledge base from text into vectors of numbers, which are then mathematically compared for similarity. The most similar parts of the knowledge base, typically referred to as “chunks,” to the user query are passed to the LLM. This yielded decent results for Benji but needed improvement.

Reranking is an additional ranking process after the initial vector search. It involves a slower but more powerful model that more accurately ranks search results. The idea is that both stages optimize for speed and accuracy: the vector search stage very quickly narrows the pool of candidates down, while the reranking stage more slowly but accurately decides on the final ranking.

A quick example to grasp the intuition: imagine you are planning a stay at a hotel. You want to find a small set of hotels to share with your travel group. You search on Expedia and skim through the top results, deciding on the three most promising options. Expedia acts like the vector search. It’s very fast and good enough for a rough list of options to consider. Then, you act like the reranker. This step is slower and good for choosing the best candidates from a limited list.

Retrieval Eval Design

The goal of retrieval eval is to decide what retrieval system we should implement and how it should be configured. This means answering the following two questions for every retrieval experiment:

  1. Does the experimental change result in a better performing retrieval system?

  2. How much context should be passed into the prompt?

Setting the context cutoff threshold involves balancing the trade-off between retrieval recall and retrieval precision. Retrieval recall measures how often the correct data is retrieved and passed into the context. Retrieval precision measures what proportion of the context includes relevant data. Increasing the cutoff threshold will increase recall since there are more chances that the correct data is captured. However, this decreases precision since more irrelevant data is passed to the LLM. Precision is important because more irrelevant context increases the likelihood of the LLM parsing the context poorly and generating an incorrect answer.

In order to answer both of the questions in tandem, we calculate retrieval recall for a range of cutoff thresholds.

Retrieval recall is calculated in the following manner:

First, we label ground truth. For example, a common query like “What is my 401k match?” could be answered by a 401k section in the employee benefits guide. The relevant excerpt to label could look like:


Company Match

LLM Enterprises offers a dollar-for-dollar match on the first 5% of an employee’s contributions.

Matching contributions vest over three years:

Year 1: 33% vested
Year 2: 66% vested
Year 3: 100% vested


This excerpt is copied and pasted from the source document into our eval data Google Sheet. This is then ingested via Fivetran into our Snowflake data warehouse and queried in our evaluation Python code.

Second, we call the retrieval service to return the most relevant chunks for the query for a specified knowledge base.

Third, we match the labeled data against the retrieved results. Because we don’t know how the source data is OCR’ed or chunked at labeling time, we need to do a fuzzy string match between the label and each chunk. We prefer to keep the labeling chunk agnostic in order to support experiments that compare different chunking methods.

Fourth, the retrieval result is graded. If the labeled data is included in a chunk that would be injected into the LLM context, the case is graded as correct and vice versa.

Lastly, the results of all evaluation cases are aggregated. We calculate retrieval recall by dividing the number of correct retrieval cases by the total number of cases tested.

The below section will illustrate how this is put into practice!

Applying Retrieval Eval to Reranking

We tested various local and API reranking models with the Python rerankers library which provides a standardized interface for calling popular models. The following chart shows retrieval recall over a range of context cutoff thresholds for the models we tested:

Cohere’s Rerank model performed the best in testing on a labeled set of 50 challenging cases, providing an 18% lift in retrieval recall with a context cutoff threshold of 8! This means that for 2 out of 10 questions asked to Benji, the LLM was not given the necessary data from the knowledge base, which prevented the possibility of a correct answer. Adding a reranker will solve those cases. We chose the cutoff threshold of 8 chunks because gains to retrieval recall diminished with additional context.

Why We Rolled Our Own Retrieval Eval

As mentioned previously, we need to accomplish two goals simultaneously in a retrieval experiment: understand whether a change improves the system and decide which context cutoff threshold to select. External solutions we researched only accomplished the first goal but not the second. For example, RAGAS is a popular library for AI evaluation and supports RAG specific eval metrics. While it supports the same concept of retrieval recall defined in our own approach, it doesn’t support any concept that allows us to make a data-driven decision about the context threshold.

Conclusion

Hopefully this post helps you systematically improve your RAG application quickly by unlocking retrieval experiments!

Thank you Linda Yang, Patrick Miller, Michael Shum, and Gordon Winthrob for their contributions to this work and blog post.

The Author
Mathew Wang

Data Scientist

Mathew leads AI development on Benji, Newfront’s flagship AI product which automatically helps employees navigate their company’s resources and saves people teams time in fielding questions. He focuses on building LLM, retrieval and evaluation systems to improve AI products through rigorous and rapid experimentation. Before Newfront, Mathew worked on detecting spam and unsafe content on Pinterest’s Trust & Safety engineering team.

Connect with Mathew on LinkedIn
The information provided here is of a general nature only and is not intended to provide advice. For more detail about how this information may be treated, see our General Terms of Use.