4 Language Modeling

By the end of this chapter you can narrate the loop step by step: how your sentence becomes numbers, how those numbers become more numbers, how one of those numbers gets picked, how that pick becomes the next word, and what you are paying for every spin of the cycle. You can also say, for every behaviour you notice in a chatbot — its manners, its apparent memory, its willingness to think before it answers, the price tag on your invoice — which training step put it there. Nothing in what follows is magic. The astonishing part is that it isn’t.

Language modeling, the core mechanic

In machine learning, language modelling is the task of assigning a probability to a sequence of words. The sentence “the sun rises in the east and sets in the west” is, statistically, very likely to be uttered by some English speaker somewhere; the sentence “purple banister axiom rhinoceros plough” is not. A language model is a function that takes a piece of text and returns a number between zero and one, and good language models give common sentences high numbers and rare ones low numbers. The practical version of this task is even narrower. Given a piece of text, predict the next unit of text. Iterate. That is generation.

The unit, in practice, is not the word but the token. Tokens are short stretches of characters — sometimes a full word, often a piece of one, occasionally a single character. A modern model has a vocabulary of fifty thousand to a few hundred thousand tokens, and any text you give it can be chopped into a sequence drawn from that vocabulary. We will come back to what a token actually is in §4; for now, just hold the idea that the model speaks token-by-token, not word-by-word.

The model itself, mechanically, is a function that takes a sequence of tokens and returns a probability distribution over the next one. Feed it the sun rises in the east and sets in the and the function emits a vector with one number per token in the vocabulary, summing to one. The number opposite west will be high. The number opposite purple will be tiny. To produce text, the model samples one token from that distribution, appends it to the sequence, and runs again. The whole conversation you ever have with a chatbot is exactly this loop, billions of arithmetic operations per token, run thousands of times until the model decides to stop. That is why a working description of an LLM is advanced autocomplete. The framing is correct, and the astonishing part is what falls out of it.

Sampling — the step where one token is chosen from the probability vector — is not a single fixed rule. You can pick the most likely token every time. You can pick proportionally to the probabilities and accept some randomness. You can squeeze the distribution toward its peaks or smear it toward its tails. A parameter called the temperature controls how aggressively you do that, and turning it up gives you a model that surprises you more often and is wrong more often in the same breath. That is the one-sentence version. The full treatment, including why the same simple sampling step is what makes reasoning models work, is in §6.

What we have so far is the smallest possible machine: a fixed function, one cycle, one token at a time. Why this machine, run at enough scale and trained the right way, produces something that feels like a fluent assistant with manners, memory, and the ability to think before it speaks is the rest of the chapter.

The path to context — the mechanics of attention

For the better part of half a century, getting a machine to understand context was the open problem of natural-language processing. Three generations of techniques tried to solve it, each one stumbling on a structural flaw that the next one fixed. This section walks the three stages, not as a history lesson — the intro gave you that — but as a chain of architectural reasons.

The first stage was the n-gram. To estimate the probability of the next word, count how often it has followed the previous one, or the previous two, or the previous three, in a giant pile of text. The probability of York following New is high because that bigram appears constantly in newspapers. The probability of York following the cat sat on the is harder to estimate from counts alone, because that specific seven-word run barely appears in any corpus, even an internet-sized one. This is the curse of dimensionality: the longer the window of context you try to track, the more possible windows there are, and the sparser your counts become. N-gram models capped out at four or five words of memory. Past that, the data ran out.

The second stage was the word embedding. Instead of treating each word as an atomic symbol with no internal structure, represent it as a vector of, say, three hundred real numbers. Train the vectors so that words appearing in similar contexts end up close to each other in this high-dimensional space. The underlying claim, which the field calls the distributional hypothesis, is that a word’s meaning is the company it keeps. The result, in models like Word2Vec from 2011, was that you could now do arithmetic on words. The vector for king minus the vector for man plus the vector for woman lands remarkably close to the vector for queen. The geometry, for one fleeting moment, looked like semantics.

But embeddings had their own flaw. Each word got one vector, no matter where it appeared. The word bank in I sat by the river bank and the word bank in I deposited the cheque at the bank mapped to the same three hundred numbers. The representation could not see the context. To know what a word means in this sentence, the representation had to change depending on its neighbours. Embeddings were static. That was the wall.

The third stage was the transformer, introduced in 2017 by eight researchers at Google in a paper with one of the boldest titles in computer-science publishing: Attention Is All You Need.¹ The mechanism the paper introduced, self-attention, is the answer to the static-vector problem. For each token in the input, the model looks at every other token in the input and computes a weight saying how much that other token should contribute to the current one’s new representation. Then it mixes them, weighted, into a new vector. The result is that the representation of bank in a sentence about rivers is no longer the same vector as the representation of bank in a sentence about cheques. The vector has been recomputed in context.

Self-attention is precisely the structural piece chapter 3 said was missing. It is a sequence model without recurrence: every token attends to every other token in parallel, in one pass, with no hidden state being threaded along a chain. Once you can do that, you can stack the operation in many layers and train the whole tower with the same blame-propagation algorithm from the last chapter. What matters from here on is that the model’s representation of a word changes with its neighbours, and that this is the structural prerequisite for everything that follows. Without it, none of the manners-or-memory tricks below would work.

The rise of LLMs — four stages, not three

A transformer is an architecture. A large language model is what you get when you train an enormous transformer on an enormous pile of text, then keep working on the result through several more training steps until it stops behaving like an autocomplete and starts behaving like an assistant. Until late 2024, that recipe had three steps. There are now four. They stack on each other in a fixed order, and each one accounts for a different chunk of how a chatbot feels in your hands.

Self-supervised pretraining

The first step is pretraining. Take a freshly initialised transformer, point it at a corpus of trillions of tokens scraped from the public web, and train it on one and only one objective: predict the next token. This is the same self-supervised paradigm chapter 2 introduced, now at internet scale. No human labels anything. The data labels itself: for every sentence in the corpus, every prefix is an input, every following token is the answer. There are no manual annotations to bottleneck the work. The bottleneck is electricity.

Pretraining is where the model absorbs what is in the data. Grammar, factual knowledge, the texture of legal prose, the syntax of every popular programming language, the cadence of children’s stories, the conventions of academic writing. Nothing in this step teaches the model what to do with that knowledge. The output of pretraining is the base model — a fluent completion engine with no instinct for being helpful. Ask a raw base model a question and it might answer, or it might generate a list of related questions, or it might continue the prompt as if it were the next paragraph of a Reddit thread, because all of those are statistically plausible continuations of a question on the open internet.²

The empirical result from 2020 onward, published as the scaling laws, is that the loss of a pretrained model decreases predictably as you scale data, parameters, and compute together.³ In 2022, the Chinchilla paper sharpened the recipe by showing that most large models had been over-parameterised and under-trained, and that you get more capability per dollar by holding model size constant and feeding it more tokens.⁴ Empiricism, the part of the field that had historically run on guesswork, now had something resembling physics.

Instruction tuning

Pretraining gives the model knowledge. Instruction tuning gives it manners. You take the base model and continue training it on a small, carefully curated dataset of pairs: an instruction, and the kind of answer you wish the model produced. Summarise this article. → Here is a three-sentence summary… A few tens of thousands of such pairs, written by humans who care, are enough to teach the model that when prompted, it should answer rather than free-associate. The published evidence is that quality dominates quantity at this stage: a thousand good examples outperform a hundred thousand mediocre ones. Instruction tuning teaches a skill, not a fact.

Preference optimization

An instruction-tuned model is usable. It is not necessarily safe, not necessarily honest, and not necessarily good. The third step closes that gap. The canonical recipe, reinforcement learning from human feedback or RLHF, is the trick OpenAI made famous: have the model generate several candidate answers to the same prompt, have humans rank them best to worst, train a separate small reward model to predict the human rankings, then use reinforcement learning to nudge the language model toward higher-ranked answers.⁵ A small amount of human preference data, multiplied by RL, turns a competent autocomplete into a useful assistant.

RLHF is the canonical recipe, but a family of cousins now does the same job by different routes: direct preference optimisation dispenses with the separate reward model and trains directly on the ranked pairs; RLAIF and constitutional methods replace the human ranker with another language model judging against a written rubric.⁶ ⁷ The recipes differ; the trick is the same. Turn a small amount of preference signal into a behaviour change. The reason this whole step matters beyond the engineering is that it is also a live instance of the alignment problem: how do you get a learning system to do what you actually want, when what you want is something a reward function can only approximate? Part III takes that question apart in earnest.

Reasoning RL

The newest layer of the stack is also the strangest. The intro told you the story: in September 2024 OpenAI shipped o1, the first commercial reasoning model. In January 2025, DeepSeek-R1 reproduced the trick in the open with full technical details.⁸ By 2026 every major lab has its own reasoning model and non-reasoning models are the budget tier. What changed?

Mechanically, the change is small. Start from a model that has already been pretrained, instruction-tuned, and preference-optimised. Generate a question with a verifiable answer — a math problem, a coding problem, a logic puzzle, anything where the correctness of the final answer can be checked by program. Have the model generate a long answer, including all its intermediate reasoning steps, before producing the final result. Score the output by whether the final answer is right. Use reinforcement learning to push the model toward generating the kind of intermediate sequences that lead to right answers more often. Repeat at scale.

The result is a model that, when faced with a hard question, spends tokens thinking before it speaks. It writes out its reasoning, sometimes for thousands of tokens, before producing the final reply. Section 6 is where we unpack why this works at all and what it actually changes about the machine. For now, the structural point is that the four-step stack — pretrain, instruction-tune, preference-optimise, reason-RL — is what turns a transformer into the kind of object you actually talk to. Each step accounts for a distinct property of the assistant in front of you. You can now point at any conversational behaviour and ask which step put it there.

Tokens and the context window

The model does not see characters. It sees integers. Before any of the machinery in the previous sections can run, the text you typed has to be broken into a sequence of tokens drawn from a fixed vocabulary, and each token has to be mapped to its integer ID. The component that does this is called a tokenizer, and it sits underneath everything else.

The dominant scheme, called byte-pair encoding or BPE, is a greedy compression trick. Start with single characters as your vocabulary. Look at a giant corpus. Find the pair of adjacent symbols that occurs most often, glue them into a new symbol, add it to the vocabulary, and repeat — fifty thousand times.⁹ What you end up with is a vocabulary where common words like the, and, machine are single tokens, less common words like tokenizer might be two tokens (token and izer), and rare words like antidisestablishmentarianism break into half a dozen pieces. Whitespace and punctuation are folded in. Emoji become tokens. Programming-language operators become tokens. The whole written record of the internet, including the parts that are not English, gets squashed into one finite alphabet of subword pieces.

The practical consequence is that characters per token is a slippery rule of thumb. English averages about four characters per token. Code averages closer to three. Languages like Mandarin or Arabic, where the tokenizer was trained mostly on English, can swell to one character per token or worse. This matters because cost scales with tokens. Every API I have ever used prices its services per token, both input and output. Talking to a model in a language poorly served by its tokenizer is, mechanically, more expensive than talking to it in English. That is a fact about engineering, not about the languages.

The context window is the next thing to understand. The transformer cannot look at an arbitrary amount of text in one pass. It has a hard limit, set at training time, on the number of tokens it can process at once. Early models had context windows of two thousand tokens. The state of the art in 2026 is anywhere from a few hundred thousand to a few million, depending on the vendor. Inside that window, every token can attend to every other token. Outside it, the model has no idea anything ever existed.

The mechanism by which context windows get longer is not free. Self-attention, in its original form, scales quadratically with the length of the input: double the window, quadruple the computation. A long-context model is a model where engineers have spent enormous effort on tricks — rotary positional embeddings, sparse attention patterns, careful kernel implementations — to make the scaling more manageable without giving up the property that every token can in principle see every other one. The names of those tricks are interesting; the implementation is for engineers; the takeaway for you is this: context is finite, and longer context is more expensive.

Everything that looks like memory in a chatbot, and a large fraction of what makes one model cost more than another to operate, falls out of how long the window is and how the window is filled.

The conversation buffer

Here is what is actually sent to the model on the second turn of a chat. The whole conversation. From scratch. Every time.

Imagine you ask a chatbot a question; it answers; you ask a follow-up. On the surface, that looks like a dialogue with continuity. Underneath, it is nothing of the kind. What gets sent to the model on the follow-up is the concatenation of: an invisible system prompt set by whoever built the product, followed by your first message, followed by the model’s first reply, followed by your follow-up. The whole thing, every token of it, packed into the context window. The model reads it as one long document, produces the next token, then the next, and so on until it generates a stop signal. Then it stops. The next time you say something, the whole document is reassembled — system prompt, every turn so far, your new message — and shipped again.

The model has no persistent memory between turns. There is no state being held inside it. There is no internal narrative being updated. The chatbot’s sense of continuity is, mechanically, an illusion maintained by re-sending the entire conversation on every turn. It feels like memory because it is functionally like memory, but the mechanism is brute. The whole transcript is the memory, and the price of a turn scales with the length of the transcript.

This has consequences you can derive cleanly from the mechanism. Long conversations slow down and get expensive, because the input to the model is now thousands of tokens of history before your message even starts. Conversations eventually hit the context window’s ceiling, at which point either the chatbot starts dropping early turns silently or the application has to summarise the history into something shorter. Anything you told the model in turn three has to still be in the buffer in turn thirty for it to remember. The moment that turn slides off the end of the window, it is gone, and the model behaves as if the exchange never happened.

This is also why how do you give a language model a memory that survives the conversation turns out to be the central question once you wrap an agent around it. The answer, in one word, is retrieval: store things outside the model, and put the relevant ones back into the window when you need them. The window is the only memory there is.

Sampling and the reasoning trace

We owe ourselves one piece of unfinished business from §1. The model emits a probability distribution over the vocabulary at every step. It does not emit a word. The step that turns a distribution into a word is sampling, and how you do it shapes how the model feels in your hands.

The simplest rule is greedy decoding: pick the most likely token, every time. Greedy decoding gives you a deterministic, repeatable output that is also often dull and occasionally pathological — the model can get stuck in repetitive loops, because the most likely token at step n is sometimes the one that makes the most likely token at step n+1 a repeat. To escape that, you sample stochastically, with two main controls. Temperature rescales the probabilities before sampling: low temperatures (say 0.2) sharpen the distribution toward its peak, making the output more conservative and more repeatable; high temperatures (say 1.0 or above) flatten it, making the output more varied and more surprising. Top-p sampling, sometimes called nucleus sampling, keeps only the smallest set of tokens whose probabilities sum to some threshold and samples from that set. Combine the two and you have control over the trade-off between predictability and creativity.

The failure modes are easy to recognise. Too cold, and the model writes plodding, slightly robotic prose that repeats itself within a paragraph. Too hot, and the model goes off the rails: it picks an unlikely token, then has to keep going from a sentence that no longer makes sense, and the whole reply spirals into nonsense. The right setting depends on the task. Factual answers want low temperatures. Brainstorming wants high ones. Code generation wants something in between with structured grammar enforced on top.

There is a deeper observation about sampling that turns out to matter enormously. A language model performs a fixed amount of computation per token. The matrix multiplications inside the network have the same cost for every token regardless of how hard the underlying question is. Therefore the only way for a model to spend more compute on a hard problem is to produce more tokens. When you ask it to think step by step, you are not giving the model a richer cognitive process. You are giving it more arithmetic. The intermediate tokens it generates are the thinking; the only place that thinking can happen is in the visible output stream.

This is what made chain-of-thought prompting work in 2022.¹⁰ On math problems, on logic problems, on multi-step reasoning of any kind, prompting the model to write out its working before producing the answer measurably improved its accuracy. Nothing about the model changed. The only thing that changed was that the model had been given more tokens in which to compute. An LLM only thinks out loud, and step-by-step prompting was the discovery that out-loud thinking is cheap and effective.

Reasoning models industrialise that discovery. Step 4 of the training stack, which we introduced in §3, is exactly the move that bakes out-loud thinking into the model’s default behaviour. The model is rewarded, during RL, for producing the kind of long intermediate reasoning sequence that leads to a verifiably correct final answer. By the end of training, when faced with a hard problem, it generates that scratch work without being asked. The cost shows up at inference time. A reasoning model can spend tens of seconds and tens of thousands of tokens deliberating before it answers, and on hard problems it really is more accurate than its non-reasoning sibling for the same underlying parameters.

This is the new scaling law of 2024 onward. The first scaling law said: bigger pretrained models are better. The second one — test-time compute scaling — says: the same model, given more tokens to think with at inference, is also better. The first scaling law is something only the labs can ride, because it requires hundreds of millions of dollars of training compute. The second is something anyone with an API key can ride, simply by paying for more output tokens. That is a new lever, and the field is still working out where it bottoms out.

Whether what the reasoning model is doing when it writes out its deliberation is really reasoning, in the sense a philosopher or a mathematician would mean the word, is a different question. The token stream looks like reasoning; the reward signal during RL was correctness; the model converges on token sequences that correlate with correctness; the result is, mechanically, a very convincing performance. Whether the performance counts as the thing itself is a question we pick up in Part III. For now, what matters is that the trick works, and that the rest of this book lives in a world where it works.

Operationalizing language models

Up to this section, the chapter has been about the model. This section is about what changes when you have to serve it — give an answer to a real user in real time, on real hardware, for a real price. In 2026, the operational question is no longer mostly “how do I train a better model.” It is mostly “how do I serve this model cheaply enough to stay in business.”

The starting point is inference economics. Every API call has a per-token cost on input and output. Behind that cost sits a stack of decisions about hardware, batching, and architecture, made by the lab that hosts the model. The two operational levers that matter most are throughput (tokens per second per GPU) and latency (time-to-first-token, plus time-per-output-token thereafter). A high-throughput setup batches many users’ requests together and runs them through one big matrix multiply, which is cheap per token but bad for latency. A low-latency setup processes one request at a time, which feels snappy but costs more per token. Every serving system in the world is some compromise between these two, and every price tag you see is downstream of where on that curve the provider is sitting.

The other half of inference economics is model size. A model with seventy billion parameters costs roughly seventy times the GPU memory and roughly seventy times the per-token compute of a one-billion-parameter model. If you can serve the same task with a smaller model, you save the difference. That is what most of the historical optimisation literature is about. The techniques fall into a small family.

Fine-tuning is the move of taking a base model and continuing to train it on your own narrow data, hoping that a smaller fine-tuned model can match a larger general one on your specific task. In 2026, fine-tuning is mostly the wrong answer. State-of-the-art prompting with retrieval beats most fine-tuning on most tasks, costs less to set up, and survives the next model release. When fine-tuning does win, it is usually parameter-efficient, in particular LoRA — a technique that trains a tiny adapter on top of the frozen base model instead of updating the original weights. LoRA is cheap, portable, and respectable; it is also the floor of the technique, not the ceiling. Most teams that think they need fine-tuning would be better served by a good retrieval system and a strong base model.

Weight pruning is the move of setting some of the model’s weights to zero, on the theory that many of them barely contribute to the output and can be discarded without serious damage. There are two kinds in serious use. Unstructured pruning zeros out individual weights based on their magnitude and produces sparse weight matrices that are mathematically smaller but hard for standard GPUs to exploit. Structured pruning removes entire neurons, attention heads, or layers, which costs more accuracy per parameter but produces a model the hardware can actually run faster. Structured is what gets used in production; unstructured is mostly an academic story. Either way, pruning typically claws back a factor of two or so, no more, before the model degrades.

Quantisation is the move of representing the model’s weights with fewer bits. The default during training is sixteen-bit floating-point. After training, you can compress the weights down to eight-bit or even four-bit integers, with arithmetic carefully designed so that the multiplications still produce sensible numbers. A well-quantised model loses something on the order of a percent or two of accuracy and gains a factor of two to four in memory and a similar factor in speed. Of all the optimisation techniques, this is the one with the cleanest cost-benefit ratio, and almost every model you actually run locally has been quantised.

Knowledge distillation is the move of training a small student model to imitate the outputs of a larger teacher model. The student learns from the teacher’s full probability distributions rather than from hard labels, which carries strictly more information per example. Distillation is the reason the open-weight reasoning scene exploded so quickly in 2025: smaller models distilled from DeepSeek-R1 work surprisingly well, because the teacher’s reasoning traces are themselves training data of exceptional quality.

Sparse architectures are the modern frontier of getting more capability per dollar at training time. The dominant variant is the mixture of experts (MoE).¹¹ An MoE model has many sets of feed-forward weights — the “experts” — and a small router that, for each token, picks one or two experts to actually run. Most of the parameters sit idle on any given token; only a fraction of the model is active per forward pass. The result is a model with the knowledge capacity of a large dense model and the inference cost of a much smaller one. Almost every frontier model in 2026 is some flavour of MoE.

The closing principle is the one I would tattoo on every engineer building with these systems in 2026: the right question is rarely “how do I optimise this model.” It is “do I need a bespoke model at all?” The strong base models hosted by the three big labs are now good enough at most tasks that the optimisation pipeline you would have built for a smaller fine-tuned model is dominated, on cost and on quality, by a well-engineered prompt against a hosted frontier model. Optimisation matters at the operational scale, when you are spending hundreds of thousands of dollars a month on inference and a quantisation pass saves you forty per cent. Below that scale, in the hands of small teams and individual developers, it is mostly a distraction.

Limitations and caveats

Even with all four training stages applied and all the operational tricks in place, the model in front of you is still, mechanically, a statistical generator. That has consequences that no amount of post-training erases. The clearest one is hallucination: the model can produce text that is fluent, confident, and wrong. This is not a bug to be patched away. It is a structural feature of the cycle described in §1. The model picks tokens to maximise the probability of plausible continuations, and the world is full of plausible continuations that are also factually false. Plausibility and truth are different objectives; the model is trained on the first and only loosely correlated with the second. Better models hallucinate less, retrieval-grounded systems hallucinate less, but the floor is not zero.

The second limitation is inherited bias. A model trained on the internet absorbs the internet’s distribution of opinions, omissions, slurs, and stereotypes, in their statistical proportions. Preference optimisation in step 3 sands down the worst of this, but it cannot un-learn what pretraining baked in. A prompt for a portrait of a doctor generates more men than women, more white than not, more middle-aged than otherwise. The model is acting as a faithful mirror of its training corpus, which is most of what we have ever written down, and the mirror is not flattering.

The third limitation is the one §6 already named: the model’s apparent reasoning is pattern matching at scale, not formal logic. It is extraordinarily good pattern matching, sometimes good enough to compose into something that looks like deduction, but it has no internal proof checker, no symbolic reasoner, no guarantee against contradiction. On problems that genuinely require formal logic to solve — long arithmetic, complex constraint satisfaction, multi-step planning with hard correctness criteria — it fails in ways that a calculator or a SAT solver does not, and it fails fluently, which makes the failures harder to catch.

These three caveats are the calibration. Part III takes them apart in earnest, with the failure modes and the social consequences. Here, the job was only to name them, so that when chapter 5 takes the same machine and points it at pixels and waveforms, you carry forward an honest sense of what the machine is and what it is not.

The whole machine

You can now narrate the cycle end to end. The text you typed becomes tokens. The tokens become vectors. The vectors flow through a stack of attention layers that recompute each token’s representation in context. The final layer emits a probability distribution over the next token. A sampling rule picks one. The token is appended. The cycle runs again. It keeps running until the model emits a stop signal, and the whole reply you see is the trail of those choices.

You can also say, for any property of the chatbot, which training step put it there. Knowledge: pretraining. Manners: instruction tuning. Usefulness and refusal of obvious bad behaviour: preference optimisation. Step-by-step deliberation before answering: reasoning RL. Cost of a turn: tokens, multiplied by per-token prices, set by hardware and architecture and the operational choices in §7. Sense of continuity across turns: the conversation buffer, re-sent every time, with no actual memory inside the model. Hallucinations and bias and pattern-matching limits: structural features of the cycle that no amount of training fully removes.

Two questions are left open on purpose: how you give a model a memory that survives the conversation, and how seriously to take its apparent reasoning. The next chapter leaves language entirely and points the same statistical machinery at pixels and sound.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. Attention Is All You Need. NeurIPS, 2017. arXiv:1706.03762.↩︎
Brown, T. B. et al. Language Models are Few-Shot Learners. NeurIPS, 2020. arXiv:2005.14165. The paper that demonstrated in-context learning as an emergent property of scale.↩︎
Kaplan, J. et al. Scaling Laws for Neural Language Models. arXiv:2001.08361, 2020. The first published curves linking loss to model size, data, and compute over many orders of magnitude.↩︎
Hoffmann, J. et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556, 2022. The compute-optimal correction to Kaplan: most large models had been over-parameterised and under-trained.↩︎
Ouyang, L. et al. Training language models to follow instructions with human feedback. NeurIPS, 2022. The InstructGPT paper that established RLHF as the canonical post-training recipe.↩︎
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS, 2023. arXiv:2305.18290.↩︎
Bai, Y. et al. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073, 2022. The Anthropic recipe that replaces human preference data with a written rubric judged by another language model.↩︎
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025. The open-weight reasoning model that reproduced the o1 trick in public.↩︎
Sennrich, R., Haddow, B., & Birch, A. Neural Machine Translation of Rare Words with Subword Units. ACL, 2016. arXiv:1508.07909. The BPE recipe that almost every modern tokenizer descends from.↩︎
Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 2022. arXiv:2201.11903.↩︎
Fedus, W., Zoph, B., & Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR, 2022. arXiv:2101.03961. The mixture-of-experts recipe that anchors the modern sparse-architecture line.↩︎