7 Retrieval augmented generation
Retrieval augmented generation (RAG) is the most common augmentation technique for large language models. It is easy to implement and provides a very good value for money, especially in restricted domains where a vanilla LLM has limited knowledge.
A typical example is building a Q&A bot that can answer queries related to institutional or private knowledge. For example, a bot that knows about your company’s policies and internal rules. It is unlikely that any vanilla LLM, no matter how powerful, can answer a precise query about your organization.
To bridge this knowledge gap, we introduce a retrieval component that can access a knowledge base (database, folder of documents, etc.) and obtain a small chunk of relevant context for a particular user query. This context is then inserted into a prompt template, along with the user query, instructing the LLM to answer the query based on the given context. If the context is indeed relevant, this often works wonders.
Different RAG approaches differ in at least two dimensions. First, you must decide how to index the background knowledge, which in turns determines how the retrieval works. For example, if the knowledge consists of a set of text documents (the most common scenario), you will almost certainly use a vector database for similarity queries. But you can also use a more sofisticated search strategy, such as full-text search engine like ElasticSearch or even a search service like Google or a commercial search API.
Second, you must decide how exactly the user query is used to locate the most relevant context. The simplest option is to directly provide the user query to the retrieval engine–e.g., embedding the user query if you’re using a vector store, or directly submitting the user query to the search engine. However, the user query is often not the most informative way to query the search engine, and you can resort to the LLM itself to modify, augment, or even completely change the query.
Retrieval strategies
These are the most common retrieval strategies.
Vector databases
In this approach, each document is split into meaningful chunks of text (e.g, paragraphs or sections) and each chunk is transformed into an embedding. At inference time, the user query is used to obtain a “key” embedding. This key is compared to each stored chunk and the most similar are inserted in the context. For large scale scenarions, you would use an efficient vector store that can quickly locate the most similar vectors.
Structured databases
If the knowledge is stored in a traditional SQL database, then you must resort to SQL code generation for retrieval. The simplest solution is to have an LLM generate the approriate SQL statement for a given user query in a single shot, but this process can be improved with multiple passes, as we’ll see in ?sec-code-generation.
Knowledge graphs
A compelling alternative for storing well-structured facts about a specific business domains is knowledge graphs. Explaining what is a knowledge graph in detail goes beyond the scope of this book, but in a nutshell, it is a network of the relevant entities in a domain and their interrelationships. For example, if you are in the medical domain, this graph could contain nodes for all known diseases, symptoms, and drugs, and edges relating which disease is associated with which symptoms, and which drugs can be prescribed for each disease.
Querying a knowledge graph depends on the underlying implementation. If you are using a graph database, such as Neo4j, this isn’t that much different to querying a traditional database. You will probably use an LLM to generate query statements in an appropriate query language (e.g., Cypher in the case of Neo4).
However, you can also expose the graph structure to the LLM and use it as a controller to navigate the graph. The simplest approaches involve asking the LLM for the relevant entities and relations to focus on, and obtaining the relevant induced subgraph. More advanced approaches involve constructing a relevant subgraph by iteratively querying the LLM about which relations (edges) are worthwile to explore in each iteration.
Search APIs
Finally, you can store your relevant domain knowledge in a storage service that provides a search API. This can range from locally deployed document databases such as ElasticSearch or MongoDB to cloud-based instances to using third-party search APIs like Google, Reddit, Wikipedia, and a myriad other domain-specific services.
In these cases, your retrieval strategy will depend on the idionsincracies of the search service you use. For something simple like ElasticSearch full-text search or the Google API, you may simply submit the user query directly. However, if the search API has relevant parameters, this becomes an instance of function calling, which we’ll see in Chapter 8.
Query strategies
The following are ways to use the user query for generating a suitable search query.
Search by query
The most direct strategy is to simply submit the user query to your retrieval engine. In the case of vector databases, this implies embedding the user query directly, while in the case of search APIs this involves sending the user query as the main argument in the API search method.
The obvious upside of this strategy is its simplicity, and the fact that is works surprinsingly well in many cases. This of course depends on how robust your search engine is, and more specifically, how closely related is the user query to the expected query language used by your search engine.
Search by answer
Specifically in vector databases and embedding-based retrieval, researchers have observed that the user query is often not informative enough to pinpoint the most relevant document chunk. For example, if you knowledge base is composed of research papers or technical documents in general, it is very unlikely that a user query formulated in an imprecise, informal language will have an embedding that is most similar to the exact paragraph that holds the answer, especially if that answer is far from trivial.
In these cases, a neat trick is to use the LLM first to generate an answer on-the-fly, and then embed that answer and use it as query key. The reason this works is that, even if the vanilla LLM doesn’t have the precise knowledge to answer the user query in detail, it is often capable of producing at least a plausibily-sounding response that mimics the right language.
LLM-guided search
A more elaborate search strategy involves prompting the LLM to generate a suitable search query. This can work better than both previous strategies if the prompt is carefully constructed. By providing the LLM both the context from which the user is coming as well as the characteristics of the knowledge base, we can leverage the powerful natural language understanding capabilities of the LLM to bridge that gap between what the user says, what the user actually wants, and what the search engine needs.
The simplest example of this approach is prompting the LLM to, given a user query, produce a set of relevant search queries for a general-purpose search engine. For example, if you’re building a Medical Q&A bot backed by a custom knowledge base, and the user query is something like “What are the effects of looking straight into the Sun during an eclipse”, it is unlikey this query by itself will result in the right article. However, an LLM can easily determine that an appropriate query would be “Solar eclipses: medical recommendations”.
If you enhance this approach with a small set of examples, the LLM can quickly learn to map from fuzzy user queries to much more precise and domain-specific queries. Thus, this approach works best when you’re dealing with custom search engines or knowledge bases that are not as capable as, say, Google, to provide a relevant context for an arbitrary user query.
Iterated retrieval & refinement
This is an extension of the previous approach in which instead of a single shot, we let the LLM iteratively provide more relevant queries. The objective is to construct a relevant context one step at a time, by obtaining a query from the LLM, extracting the relevant chunk, and using self-reflection to let the LLM decide if additional information is required.
This approach has the advantage that if the first query is not as informative as necessary, we still get a few shots to pinpoint the exact context we need. However, this can quickly get out of hand and produce a huge, semi-relevant or mostly irrelevant context that will confuse the LLM more than it helps.
To counter this effect, we can add a refinement step after each retrieval, as follows. We let the LLM produce a query, find the most relevant chunk, and then ask the LLM to, given the query and the context, extract a summary of the relevant points mentioned in the context. This way, even if we end up extracting dozens of chunks, the final context could be a very concise and relevant summary of the necessary background knowledge.