2 How do LLMs work?

A generative language model, at its core, is just a statistical machine learning model trained to predict the continuation of a text sequence. Essentially, it’s a prediction machine. You input a text prefix, run it through the model, and receive the most likely next token–a token is more or less a word or component of a word.

Note

Actually, you don’t really get just the next most likely token. The model provides a distribution across all possible tokens, giving you the probability of each one being the next continuation.

To use an LLM, we start with user input, like a query or text prefix, and run the model to get the next input token. We append it to the sequence and repeat the whole process until reaching a maximum number of tokens or the model predicts a special STOP token.

There are choices to make in this process. Choosing only the most likely continuation can quickly lead to repetitive predictions. Instead, you can choose from the top 50 most likely tokens at random, weighted by their probability. This injects some variety in the generated and is the reason why for the same prompt you can get different albeit semantically similar responses.

There are a few key parameters in this sampling process: the top K tokens to choose, the cumulative probability, and the temperature which is the most relevant. The temperature is a parameter that affects the weights of the tokens you will pick for continuation. If the temperature is 0, you’ll usually choose the most likely token. If it’s higher, probabilities are smoothed out, making it more likely to choose less probable tokens. This increases the model’s variability.

Note

That’s why some call high temperature “creative mode” and low temperature “precise mode.” It has nothing to do with actual precision or creativity, just how deterministic the response for a given prompt will be.

From this perspective, you can already see why some people say language models are “just autocomplete on steroids”. Certainly, that is the gist of their mechanics: you’re completing a text sequence by adding one token at a time until you decide to stop. However, this is just scratching the surface. There is so much more involved in getting these models to behave in a useful way, and we will talk about some of those aspects in the next section.

But before moving on, here is a key insight from this explanation of how LLMs work: A language model always performs a fixed amount of computation per token.

This means that, whatever limited form of “reasoning” can be said to happen in an LLM, the depth and complexity of that reasoning is directly proportional to the number of total tokens the model processes. This implies two things:

If the input prompt is larger, the model will perform more computation before starting to compute its answer. This is part of the reason why more detailed prompts are better. But crucially, if the output is larger, the model is also doing more computation.

This is why techniques like chain-of-thought–and basically anything that makes a model “talk more”–tend to improve their performance at some tasks. They have more compute available to do whatever reasoning they can do. If you ask a model a quick question and instruct them to give a one-word answer, the amount of compute spent producing that answer is proportional to just the input size. But if you ask the model to produce a step-by-step reasoning of the answer before the final answer, there is a higher chance you’ll get a better answer just by virtue of spending more computation.

At the risk of anthropomorfizing too much, I like to summarize this insight as follows: LLMs only think out loud. If you want them to “think” better, get them to “talk” more.

So, this is how a language model works from a user perspective. Let’s see how you build one.

How to make an LLM

How do you make your language model work? There are three main steps.

Pre-training

The first step is called self-supervised pretraining. In this step, you take a raw transformer architecture with uninitialized weights and train it on a massive amount of data to predict the next token. You use a large corpus of data, such as news, internet blog posts, articles, and books, and train the model on trillions of words.

The simplest training method is next token prediction. You show the model a random text and ask it what the next token is. Take a random substring from the dataset, remove the last token, show the prefix to the model, and ask for likely continuations. Compute a loss function to determine how mistaken the model was in its predictions and adjust it slightly to improve future predictions.

So far this is a standard machine learning approach. We call it self-supervised learning because the targets are not given by humans, but chosen automatically from the input. But deep down, this is just supervised learning at scale.

Now, that being said, scaling this training process to billions of parameters and trillions of tokens presents a massive engineering challenge. No single supercomputer in the world can handle training GPT-4 from scratch, so you must resort to distributed systems to split the model across used across hundreds or thousands of GPUs for extended periods of time, synchronizing different parts of the model across multiple computers is crucial for efficient training. This just means, while the conceptual part of training an LLM is pretty straightforward, it is nothing short of an engineering prowess to get build like GPT-4.

Once pre-training is completed, you have what is called a “base model”, a language model that can continue any sentence in a way that closely resembles existing text. This model is already extremely powerful. Give it any prefix of text with any content whatsoever and the model will complete it with a mostly coherent continuation. It’s really autocompletion on steroids!

However, these base models, as powerful as they are, are still very hard to prompt. Crucially, they do not understand precise instructions, mostly because their training data doesn’t contain a lot of examples of instructions. They are just stochastic parrots, in a sense. The next step is to get tame them.

Instruction tuning

At this point the LLM already has all the knowledge in the world somewhere hidden in its weights–metaphorically speaking–but it is very hard to locate any concrete piece of knowledge. You must juggle with transforming questions into the right prompts to find a pattern that matches what the model has seen.

The way to solve this problem is to include another training phase, but this time much shorter and focused on a very well-curated dataset of instructions and responses. Here, the quality is crucial, much more than the quantity. You won’t teach the model anything new, you will just tune it to expect instruction-like inputs and produce answer-like outputs.

Once finished, you have what’s called an instruction-tuned model. These models are much more robust and easy to prompt compared to the base model, and this is the point where most open-source models end. But this is not the end of the story.

Instruction-tuned models are still not suitable for publicly-facing products for one crucial reason: they can be coerced into answering anything at all, including producing biased, discriminatory, or hate speech and instructions on how to build bombs and deadly poisons.

Given base models are trained on the whole Internet, they are full of all the good and bad you can read online–although some effort is put into cleaning the pretraining dataset, but it’s never enough. We must teach the model that some questions are better left unanswered.

Preference tuning

The final step is to fine-tune the model to produce answers that are more closely aligned with user preferences. This can and is primarily used to avoid biased or hate speech, and to reject any questions that are deemed unethical by the developers training the model. However, it also has the effect of making the model more polite in general, if you wish so.

The way this process works is to turn the problem from supervised learning into the real of reinforcement learning. In short, the main difference is that, while in supervised learning we give the model the correct answers (as in instruction tuning), in reinforcement learning we don’t have access to ground truth answers.

Instead, we use an evaluator that ranks different answers provided by the LLM, and a feedback loop that teaches the LLM to approximate that ranking. In its original inception, this process was performed with a human evaluator, thus giving raise to the term “reinforcement learning with human feedback”, but since including humans makes this process slower and more more expensive, smaller organizations have turned to using other models as evaluators.

For example, if you have one strong model, like GPT-4, you can use it to rank responses by a smaller, still in-training model. This is one example of a more general concept in machine learning called “knowledge distillation” in which you attemp to compact the knowledge of a larger model into a smaller model, gaining in efficiency without sacrificing too much in performance.

And finally, we have now something that works like GPT-4. The process was long and expensive: a massive pretraining following by a carefully curated instruction tuning and a human-backed preference tuning. This is the reason why so few organizations have the resources to train a state-of-the-art large language model.

Beyond pure language modeling

Once a model is deployed into production, the most basic application you can implement is a ChatGPT-clone: a chat interface where you can interact with a powerful model and get it to work for you. But this is far from the limit of what current models can do.

With careful prompting and some augmentation techniques, you can get integrate an LLM intro more traditional applications to work either as a powerful natural language frontend, or as a backend tool for language understanding. This is where LLMs can really shine, beyond the basic chatbot application.

You have to be careful, though. There are many common pitfals to using these models, including some inherent limitations like the dreaded hallucinations, which, although can be mitigated to a certain extent, are probably impossible to solve altogether without a paradigm shift, as we’ll see in the next chapter.

However, despite their many limitations, large language models are one of the most transformative computational tools we’ve ever invented. Learning to harness their power will supercharge your skills, in whatever field you are working.