3 What can LLMs do?

Now that we understand how language models are built, let’s turn our attention to their capabilities. As we’ve seen so far, base models are, ultimately, just autocompletion models. Given an initial prefix, they can produce a mostly coherent continuation that is plausible as far as the data and the training procedure allow.

But autocompletion is far from the only task you can do with LLMs. As we will see in this chapter, a sufficiently powerful autocompletion engine can be coerced into performing many disparate tasks. Combine this with task-specific fine tuning and you can turn a chatty, hallucination-prone LLM into a powerful tool for many domains.

We will start by examning what can base models do since, ultimately, all fine-tuning can do is unlock existing capabilities, making them easier to prompt. Then, we will survey many specific tasks LLMs can and have been used for.

What can base models do?

Autocompletion on steroids, as cool as it sounds, doesn’t really ring like anything smart, right? Well, it turns out, if your are very very good completing any text prefix, that implies you must be good at a wide range of cognitive tasks.

For example, suppose you want to build a question-answering engine. Take a question like “Who is the current president of the United States”, and turn it into a prompt like “the current president of the United States is…”. If you feed this to a powerfull base LLM, the most likely continuation represents the correct answer to the question. This means, autocomplete on steroids gives you question answering for free.

And you can do this for a whole lot of tasks. Just turn them into an appropriate prefix and continuation. Do you want to translate a sentence? Use the prompt like “An English translation of the previous sentence is…” Do you want to summarize a text? Use a prompt like “A summary of the previous text is…” You get the point.

But it goes much farther than that! The scientists at OpenAI discovered that models the size of GPT-3 and above where capable of inferring the semantics of a task given examples, without explicitly telling tham what is the task. This is called in-context learning, and it works wonders. For example, if you want to use an LLM for sentiment analysis, you can use a prompt like the following.

Comment: This movie was so good!
Sentiment: Positive

Comment: This movie really sucks.
Sentiment: Negative

Comment: The book was better.
Sentiment: Neutral

Comment: I couldn't stop looking at the screen!
Sentiment:

That is, you build a prompt with a few examples of inputs and outputs and feed that to the LLM, leaving the last input unanswered. The most likely continuation is the right answer to the last input, so provided the base model has seen similar tasks in its training data, it will pick up the pattern and answer correctly most of the time.

In-context learning is a surprising discovery at first, but when you look deep down, it makes total sense. Since base LLMs are completion machines, provided they have seen examples of some arbitrary task in their training set, all you need to do is come up with a text prefix that makes the model “remember” that task. And that prefix is often just a set of examples of a given task, because that is actually what is stored in the LLM weights: a loosely and implicitely connected set of similar text fragments.

In a sense, the input to the LLM is a key to retrieve a part of its training set, but not in an accurate way. Since LLMs only store correlations between words, anything you “retrieve” from an LLM is a fuzzy approximation and aggregation of several (possibly millions) of similar training examples. For this reason, we say base models already “know” everything, but it’s very hard for them to “remember” it, because you have to find the right key–i.e., the right context prefix.

But what if we could teach the LLM that some arbitary instruction is equivalent to the right key for a given task? That is exactly what instruction tuning is about. By showing the LLM input/output pairs of, this time, precise instructions and the corresponding answer, we are rewiring some of its parameters to strengthen the correlation between the instruction and the response. In a sense, fine-tuning is like finding a path between the input space and the output space in the base model’s fuzzy web of word correlations and connect those two subspaces of words with a shortcut, so next time you input the instruction, the LLM will “remember” where is the appropriate answer.

If all of this sounds overly anthropomorphic, it is because we have stretched the analogies a bit to make it easier to understand. In reality, there is no “remembering” or “knowing” happening inside a large language model, at least not in any way akin to how human memory and reasoning works. We will talk about this difference and its implication in Chapter 4. For the time being, please be cognizant that any analogy between LLMs and human brains is bound to break pretty soon and cause major misunderstanding if taken too seriously.

Use cases for fine-tuned LLMs

With a proper fine tuning in a concrete domain, you can turn LLMs into task-specific models for a huge variety of linguistic problems. In this section we’ll review some of the most common tasks in which LLMs can be deployed. These will be rather abstract tasks, and in ?sec-applications we will see many practical applications involving combinations of these tasks.

When discussing the use cases of fine-tuned LLMs, we don’t talk about an “input prefix” anymore because even if, technically, that is still what we are feeding the LLM, the response is not necessarily a direct, human-like continuation of the text. Instead, depending on which dataset it was fine-tuned, the LLM will respond with something that looks more like an answer to a question or an instruction than a pure continuation. Actually, if you give a fine-tuned LLM like GPT-4 an incomplete text prefix, it will often reply back with something like “I didn’t understand you entirely, but it appears what you are trying to do is […]” instead of casually continuing where you left.

Thus, it is often best to interpret this process as “prompting” the LLM with an instruction, and this is the reason why the input text is called a “prompt”, and the process of designing, testing, and optimizing these prompts is called, sometimes undeservedly, “prompt engineering”.

Text generation

The simplest, most straightforward use case for large language models is of course text generation, whether for fictional content as for technical articles, office work, homework, emails, and anything in-between. But instead of using a base model, where you have to provide a prefix to continue, an instruction-tuned model can be instructed directly to write a paragraph, passage, or even a short essay on a given topic. Depending on how powerful and well-trained the model is, you can even provide hints about the intended audience, the complexity of the language to use, etc.

Text generation–and all instructions in general–often works better the more descriptive the prompt. If you simply ask the LLM to “tell me a fairy story”, yes, it will come up with something plausible, and it might even surprise in the good way. But you most likely want to have finer control over the result, and thus crafting a well-structured and informative prompt is crucial. In Chapter 6 we will learn the most basic strategies to create effective prompts.

A common issue in text generation, especially in longer formats, is that the LLM can and will often steer away from the main points in the discourse. The longer the response, the most likely some hallucinations will happen, which may be in the form of incoherent or plain contradictory items, e.g., characters acting “out of character” if you’re generating fiction.

A battle-tested solution for generating coherent, long-form text is the divide-and-conquer approach. Instead of asking for a full text from the begining, prompt the LLM to first generate an outline of the text, and then, sequentially, ask it to fill in the sections and subsections, potentially feeding it with previously generated content to help it mantain consistency. In Chapter 19 we will see this approach taken to one extreme.

Summarization

Summarization is one of the most common and well-understood use cases of LLMs. In a sense, it is a special case of text generation–what isn’t, right?–but it has specific quirks that merit a separate discussion. In general, LLMs excel at summarizing. After all, that’s what they’ve been implicitely trained to do: construct a statistical model of the whole internet, which is rather, in a sense, a summary of the whole human knowledge.

However, summarization isn’t a trivial problem. Besides the usual concerns about the audience, complexity of the language, style, etc., you will probably also want to control which aspects of the original text is the LLM focusing on. For example, rather than a simple compactation of the text, you might want a summary that emphasizes the consequences of whatever is described in the original text, or that highlights and contrasts the benefits and limitations. This is a more abstract form of summary that produces novel value, beyond just being a shorter text.

There are important caveats with summarization, though. As we’ll see in Chapter 4, LLMs are very prone to hallucination, and the more you push the boundary between a plain summary and something closer to a critical analysis, the more the LLM will tend to ignore the original text and rely on its own pre-trained knowledge.

And just like before, the best way to counteract any form of rebellious generation is to be very intentinal in your prompt, and make it as structured as necessary. For example, you can first ask the LLM to extract the key points, advantages, and limitations. Then, ask it to cluster the advantages and limitations according to whatever criteria you are seeking. And only then, ask it to provide a natural language summary of that semi-structured analysis. This gives you finer control over the end result and will tend to reduce hallucinations, at the same time being easier to debug, since you can see the intermediate steps.

Translation & style transfer

The text-to-text transformer architecture (the precursor and core component of modern language model) was originally design for translation. By encoding the input sentence into a latent space of word correlations, detached from a specific language, and then decoding that sentence in a different vocabulary, these models achieves state-of-the-art translation in the early 2018s. The more general notion of style transfer is, deep down, a translation problem, but instead of between English and French, say, between technical and plain language.

Modern LLMs carry this capability, and will be more than enough for many practical translation tasks. However, beware that plenty of studies show that LLM translation are often poorer in many linguistic notions from professional translations. Translation is an art, as much or more than it is a science. It involves a deep knowledge of the cultural similarities and differences between readers of both languages, to correctly capture all the nuances that even a seemingly simple phrase can encode.

That being said, LLMs can help bridge the gap for non-native speakers in many domains where you don’t need–or can’t hope for–a professional translation. An example is inter-institutional communication, e.g., emails from co-workers that don’t speak your native language. In these cases, you must also be careful nothing important is lost in translation, literally, but as long as everyone is aware of the limitations, this is one of the most practical uses cases for LLMs.

Structured generation

Continuing with the topic of text generation capabilities, our next stop is generation from structured data. This is one specific area where LLMs come to mostly solve a long-standing problem in computer science: to generate human-sounding explanations of dry, structured data.

Examples of this task are everywhere. You can generate a summary of your calendar for the day, and pass it to a speech synthesis engine, so your personal assistant can send you every morning an audio message reminding you what you have to do, with cute linguistic cues like “Oh, and on the way to the office, remember to pick up the your wife’s present.” We will see an example of this functionality in ?sec-planner.

Other examples include generating summaries of recent purchases for a banking app or product descriptions for an online store–basically anywhere you’d have a dashboard full of numbers and stats, you can have an LLM generate a natural language description of what’s going on. You can pair this capability with the super skills LLMs have for question answering (at least when the answer is explicit in the context) to construct linguistic interfaces to any of number of online services or apps. We will exploit this capability in many of our demos, starting with ?sec-shoping where we build a bot to help you do your shoping.

Text classification

Text classification is the problem of categorizing a text fragment–be it a single sentence, a whole book, or anything in between–into one of a fixed set of categories. Examples vary from categorizing comments as positive/neutral/negative, determining if an email is spam or not, or detecting the tone and style of a text, to more specific tasks like extracting the intended intention from a user, e.g., chatting with an airline bot.

To have an LLM correctly and robustly classify your text, it is often not enough to just instruct it and provide the intended categories. The LLM might come up with a new category you didn’t mention just because it makes sense in that context. And negative instructions, in general, don’t work pretty well. In fact, LLMs are lousy at interpreting negative instructions precisely because of the underlying statistical model. We will see in Section 4.2 why this is the case.

Instead of a dry, zero-shot instruction, you can improve the LLM classification capabilities substantially with a few examples (also called a k-shot instruction). It works even better if you select the examples dynamically based on the input text, a procedure that eerily similar to k-NN classification but in the world of LLMs. Furthermore, many LLMs tend to be chatty by design, and will often fail to provide a single word classification even if you instruct it to. You can mitigate this by using a structured response prompt, as seen in Chapter 6.

Structured data extraction

A generalization of text classification is the problem of structured data extraction from natural language. A common example is extracting mentions of people, dates, and tasks in a text, for example, a transcription from a video meeting. In the more general case, you can extract any entity-relation schema from natural text and build a structured representation of any domain.

But this capability goes much further. If you have any kind of structured input format–e.g., an API call for any online service–you can instruct (and probably k-shot) an LLM to produce the exact JSON-formatted input given a user query. This is often encapsulated in modern LLM providers i a functionality called “function calling”, which we will explore in Chapter 8.

As usual, the main caveat with structured generation is the potential for subtle hallucinations. In this case they can be in two forms. The simplest one, when the LLM fails to produce the expected format by, e.g., missing a key in JSON object or providing an invalid type. This type of error is what we call a syntactic hallucination and, although anoying, is often trivial to detect and correct, even if just by retrying the prompt.

The second form of hallucination is much more insidious: the response can be in the right format, and all values have the right type, but they don’t match what’s in the text. The LLM hallucinated some values. The reason this is a huge problem is because detecting this form of semantic hallucination is as hard to solve as hallucinations in general. As we’ll see in Section 4.5, we simply have no idea how to ensure an LLM always produce truthful responses, and it might be impossible even in principle.

Question answering

Question answering is one of the most surprising capabilities of sufficiently large language models. To some extent, question answering can be seen as a form of retrieval, when you’re asking about some facts explicitly mentioned in the training set. For example, if you ask “Who wrote The Illiad”, it is not surprinsing given what whe know of LLMs, that a fine-tuned model can easily generate “Homer” as the most plausible response. The sentence “Homer wrote The Illiad” must have appeared thousands of times in different ways in the training set.

But modern LLMs can go way beyond simply retrieving the right answer to a trivia question. You can ask questions that involve a small set of reasoning steps, combining facts here and there to produce a response that is not, at least explicitly, in the training set. This is rather surprising because there is no explicit reasoning mechanism implemented in LLMs. All forms of reasoning that can be said to happen are an emergent consequence of learning to predict the next token, and that is at least very intriguing.

In any case, as we will argue in Section 4.2, the statistical modelling paradigm has some inherent limitations that restrict the types of reasoning that LLMs can do, even in principle. THis doesn’t mean that, in practice, for the types of problems you encounter, it can’t work. But in its most general form, long-term reasoning and planning is still an open problem in artificial intelligence, I don’t think LLMs alone are equiped to solve it.

You can, however, plug LLMs with external tools to enhance its reasoning skills. One of the most fruitful research lines is to have them generate code to solve a problem, and then run it, effectively making LLMs Turing-complete, at least in principle, even if in practice they may fail to generate the right code. Which leads us to the next use case.

Code generation

Since LLMs are trained to autocomplete text, it may not be that surprising that, when feed with enough training examples of code in several programming languages, they can generate small snippets of mostly correct code. However, for anyone who codes, it is evident that writing correct code is not as simple as just concatenating plausible continuations. Programming languages have much stricter syntax rules that require, e.g., to close all parenthesis and to use explicit and very tight naming conventions. Failing to produce even a single semicolon in the right place can render a program unusable.

For this reason, it is at least a bit surprising that LLMs can code. More surprising still, that they can not only autocomplete existing code, but generate code from scratch given natural language instructions. This is one of the most powerful capabilities in terms of integrating LLMs with external tools because code is, by definition, the most general type of external tool. There is nothing you can do in a computer, that you can’t do with code.

The simplest use case in this domain is of course using LLMs as coding assistants embedded in developer tools like code editors. But this is just scratching the surface. As implied in the above section, you can have an LLM generate code to solve a problem it would otherwise fail to answer correctly–e.g., perform some complex physics computations. Code generation allows an LLM to analyze large collections of data by computing some statistics and running some formulas. You can even have an LLM generate the code to output some chart, and voilá, you just taught the LLM to draw!

Given how powerful code generation as an LLM skill is, we will spend a significant amount of time in ?sec-applications exploting this capability in different demo applications. For example, we will make a data analysis bot in Chapter 17 that can answer questions about some dataset in natural language, but also output charts and tables.

Code explanation

Code explanation is the inverse problem of code generation: given some existing code, produce a natural language explanation or, more generally, answer questions about it. In principle, this is a form of question answering that involves all the caveats about complex reasoning we have already discussed. But it gets harder.

The problem is the majority of the most interesting questions about code cannot be answer in general: they are undecidable, meaning no algorithm can exist that will always produce the right response. The most poignant example is the question “Does this function ever returns?”. This is the well-known Halting problem, the most famous problem in computability theory, and the grandfather of all undecidability results. Similar questions such as whether a variable is ever assigned, or a method is ever called, are also undecidable in the general case.

And this is not just a theoretical issue. The Halting problem highlights one crucial aspect of computation: in the general case you cannot predict what an algorithm will do without running it. However, in practice, as anyone who codes knows, you can predict what lots of your code will do, if only because it is similar to code you’ve written before. And this is where LLMs shine: learning to extrapolate from patterns to novel specific instances, even if the general problem is unsolvable.

To top it all, we can easily imagine an LLM that, when prompted with a question that seemingly cannot be answered from the code alone, could decide to run the code with some specific parameters and observe its results, drawing conclusions not from the syntax alone but from the execution logs. A debugging agent, if you will. In Chapter 20 we will play a little bit with this idea.

Conclusions

These are the most important high-level tasks where LLMs can be deployed, but they span hundreds if not thousands of potential applications. Text classification, for example, covers a huge range of applications, just changing the classification target. In part 3 of the book we will explore many of these tasks, and combinations thereof, in concrete applications.

One conclusion you can take from this chapter is that LLMs are some of the most versatile digital technologies we’ve ever invented. While we don’t know if artificial general intelligence is anywhere near, we’re definitely one step closer to general-purpose AI–models that can be easily adapted to any new domain without research or costly training procedures.

However, language modeling is not magic. We have already seen a glimpse of some of the fundamental limitations of this paradigm in the above dicussion. Next chapter, we will go deeper into how these models learn compared to humans, and what this difference entails in terms of their limitations.