Seven Failure Points When Engineering a Retrieval Augmented Generation System

Author

Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, Mohamed Abdelrazek

Year

2024

Seven Failure Points When Engineering a Retrieval Augmented Generation System

Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, Mohamed Abdelrazek. 2024. (View Paper → )

Software engineers are increasingly adding semantic search capabilities to applications using a strategy known as Retrieval Augmented Generation (RAG). A RAG system involves finding documents that semantically match a query and then passing the documents to a large language model (LLM) such as ChatGPT to extract the right answer using an LLM. RAG systems aim to: a) reduce the problem of hallucinated responses from LLMs, b) link sources/references to generated responses, and c) remove the need for annotating documents with meta-data. However, RAG systems suffer from limitations inherent to information retrieval systems and from reliance on LLMs. In this paper, we present an experience report on the failure points of RAG systems from three case studies from separate domains: research, education, and biomedical.

The paper focuses on Retrieval Augmented Generation (RAG) systems, combining large language models (LLMs) with semantic search capabilities. RAG systems are designed to overcome limitations of LLMs by retrieving documents that semantically match a query for generating responses.
Validation of RAG systems is feasible only during operation.The robustness of RAG systems evolves over time rather than being fully designed at the start.
Key failure points:

Missing Content: Inability to answer questions due to lack of relevant documents.
Missed Top Ranked Documents: Failure to rank pertinent documents highly enough for user retrieval.
Not in Context - Consolidation Strategy Limitations: Challenges in the consolidation process leading to not being able to fit key information into the context window for generating accurate answers.
Not Extracted: Issues with LLMs failing to extract the correct answer from provided context. This occurs when there is too much noise or contradicting information in the context.
Wrong Format: Inability to adhere to specific formats (like tables or lists) when extracting information.
Incorrect Specificity: Providing answers that are either too general or too specific for the user's needs.
Incomplete: Partial answers that omit significant information despite its availability in the context.