7  Retrieval augmented generation

Retrieval augmented generation (RAG) is the most common augmentation technique for large language models. It is easy to implement and provides a very good value for money, especially in restricted domains where a vanilla LLM has limited knowledge.

A typical example is building a Q&A bot that can answer queries related to institutional or private knowledge. For example, a bot that knows about your company’s policies and internal rules. It is unlikely that any vanilla LLM, no matter how powerful, can answer a precise query about your organization.

To bridge this knowledge gap, we introduce a retrieval component that can access a knowledge base (database, folder of documents, etc.) and obtain a small chunk of relevant context for a particular user query. This context is then inserted into a prompt template, along with the user query, instructing the LLM to answer the query based on the given context. If the context is indeed relevant, this often works wonders.

Different RAG approaches differ in at least two dimensions. First, you must decide how to index the background knowledge, which in turns determines how the retrieval works. For example, if the knowledge consists of a set of text documents (the most common scenario), you will almost certainly use a vector database for similarity queries. But you can also use a more sofisticated search strategy, such as full-text search engine like ElasticSearch or even a search service like Google or a commercial search API.

Second, you must decide how exactly the user query is used to locate the most relevant context. The simplest option is to directly provide the user query to the retrieval engine–e.g., embedding the user query if you’re using a vector store, or directly submitting the user query to the search engine. However, the user query is often not the most informative way to query the search engine, and you can resort to the LLM itself to modify, augment, or even completely change the query.

Retrieval strategies

These are the most common retrieval strategies.

Vector databases

In this approach, each document is split into meaningful chunks of text (e.g, paragraphs or sections) and each chunk is transformed into an embedding. At inference time, the user query is used to obtain a “key” embedding. This key is compared to each stored chunk and the most similar are inserted in the context. For large scale scenarions, you would use an efficient vector store that can quickly locate the most similar vectors.

Structured databases

If the knowledge is stored in a traditional SQL database, then you must resort to SQL code generation for retrieval. The simplest solution is to have an LLM generate the approriate SQL statement for a given user query in a single shot, but this process can be improved with multiple passes, as we’ll see in ?sec-code-generation.

Knowledge graphs

A compelling alternative for storing well-structured facts about a specific business domains is knowledge graphs. Explaining what is a knowledge graph in detail goes beyond the scope of this book, but in a nutshell, it is a network of the relevant entities in a domain and their interrelationships. For example, if you are in the medical domain, this graph could contain nodes for all known diseases, symptoms, and drugs, and edges relating which disease is associated with which symptoms, and which drugs can be prescribed for each disease.

Querying a knowledge graph depends on the underlying implementation. If you are using a graph database, such as Neo4j, this isn’t that much different to querying a traditional database. You will probably use an LLM to generate query statements in an appropriate query language (e.g., Cypher in the case of Neo4).

However, you can also expose the graph structure to the LLM and use it as a controller to navigate the graph. The simplest approaches involve asking the LLM for the relevant entities and relations to focus on, and obtaining the relevant induced subgraph. More advanced approaches involve constructing a relevant subgraph by iteratively querying the LLM about which relations (edges) are worthwile to explore in each iteration.

Search APIs

Finally, you can store your relevant domain knowledge in a storage service that provides a search API. This can range from locally deployed document databases such as ElasticSearch or MongoDB to cloud-based instances to using third-party search APIs like Google, Reddit, Wikipedia, and a myriad other domain-specific services.

In these cases, your retrieval strategy will depend on the idionsincracies of the search service you use. For something simple like ElasticSearch full-text search or the Google API, you may simply submit the user query directly. However, if the search API has relevant parameters, this becomes an instance of function calling, which we’ll see in Chapter 8.

Query strategies

The following are ways to use the user query for generating a suitable search query.

Search by query

The most direct strategy is to simply submit the user query to your retrieval engine. In the case of vector databases, this implies embedding the user query directly, while in the case of search APIs this involves sending the user query as the main argument in the API search method.

The obvious upside of this strategy is its simplicity, and the fact that is works surprinsingly well in many cases. This of course depends on how robust your search engine is, and more specifically, how closely related is the user query to the expected query language used by your search engine.

Search by answer

Specifically in vector databases and embedding-based retrieval, researchers have observed that the user query is often not informative enough to pinpoint the most relevant document chunk. For example, if you knowledge base is composed of research papers or technical documents in general, it is very unlikely that a user query formulated in an imprecise, informal language will have an embedding that is most similar to the exact paragraph that holds the answer, especially if that answer is far from trivial.

In these cases, a neat trick is to use the LLM first to generate an answer on-the-fly, and then embed that answer and use it as query key. The reason this works is that, even if the vanilla LLM doesn’t have the precise knowledge to answer the user query in detail, it is often capable of producing at least a plausibily-sounding response that mimics the right language.

Iterated retrieval & refinement

This is an extension of the previous approach in which instead of a single shot, we let the LLM iteratively provide more relevant queries. The objective is to construct a relevant context one step at a time, by obtaining a query from the LLM, extracting the relevant chunk, and using self-reflection to let the LLM decide if additional information is required.

This approach has the advantage that if the first query is not as informative as necessary, we still get a few shots to pinpoint the exact context we need. However, this can quickly get out of hand and produce a huge, semi-relevant or mostly irrelevant context that will confuse the LLM more than it helps.

To counter this effect, we can add a refinement step after each retrieval, as follows. We let the LLM produce a query, find the most relevant chunk, and then ask the LLM to, given the query and the context, extract a summary of the relevant points mentioned in the context. This way, even if we end up extracting dozens of chunks, the final context could be a very concise and relevant summary of the necessary background knowledge.