15 Limits of Language Models

Every limit in this chapter follows from a single architectural fact. A language model is trained to predict the next plausible token, not the next true one. Fluent fabrications and lost-in-the-middle context are the visible faces; reasoning that collapses past the training distribution and a reliability ceiling that no scaling pass has yet closed are the deeper ones. All of them are downstream of that one design choice. METR’s 2025–26 horizon measurements pin down where the ceiling sits; a 2024 formal result from Kalai and Vempala explains why one half of it cannot be designed away. My job in this chapter is to make the chain from cause to consequence legible, separate the gaps that next year’s model might close from the ceilings that are architectural, and leave you with a clear-eyed account of what you can and cannot trust an LLM to do.

Plausibility, not truth

The training of a language model is covered end-to-end in Chapter 4. For this chapter, one feature of the procedure matters: the model is optimised to assign high probability to text that looks like its training corpus. It is not optimised to be right. It is optimised to be plausible. Plausibility is an approximation of truth in domains where the corpus is mostly truthful — which is how the technology works at all — but the optimisation target itself never mentions truth. The asymmetry between plausible and true is the source of every failure mode in the rest of this chapter. Keep it in mind as you read.

A second contrast is also worth holding. Humans learn language with their hands and eyes and ears on a real world; symbols are anchored in objects, actions, and consequences. LLMs learn from text alone. When the model produces the sentence the apple is red, it has correlated the words. When you produce the same sentence, you have a fruit-shape, a colour, a weight, a smell. The grounding gap is not the cause of every failure below, but it is the cheapest way to remember why a language model can be fluent and confidently wrong at the same time.

Bender, Gebru, McMillan-Major and Mitchell put this in famous form: an LLM is a stochastic parrot, a system that manipulates linguistic form without access to meaning in the grounding-theoretic sense.¹ You can quarrel with the polemic in their paper and still accept the technical claim, which is the one I am borrowing here. The model has read the word apple a billion times. It has never tasted one. The optimisation target was form. The output is form. The reader supplies the meaning.

That root fact splits into a family of consequences. The rest of the chapter walks them one at a time, then closes by showing how they compound into an empirical ceiling that no scaling pass has yet closed.

The shape of the argument

I want you to read what follows as a single argument, not as a list of separate complaints. Concretely: a legal brief that cites a case that does not exist; a meeting summary that quietly invents a decision someone never made; a chain of reasoning whose third step does not follow from the second. Each section below names a place where the gap between plausible and true surfaces as a specific failure of trust. Fluent fabrication is the gap visible in the generated sentence. Context recall is the gap visible across the prompt. Memory and bias are the gap visible across the training distribution. Reasoning is the gap visible across computational depth. The reliability ceiling at the end of the chapter is what all four become when you measure them end-to-end on tasks that matter. The chain is the chapter. If any one section reads as a standalone essay, I have failed.

The first place to look is the failure mode that expresses the root fact most directly — the one where plausibility-not-truth simply walks onto the page.

Fluent fabrication

The technical literature uses hallucination for the phenomenon where a generative model produces output that is fluent, internally coherent, and not anchored in any fact. The name is imperfect. A human hallucination is a perceptual event; the model is not perceiving anything. But the term has stuck and I will use it. Keep in mind that we are talking about something completely different from what the word means in clinical contexts.

The danger of hallucination is not that it happens sometimes. It is that the model’s surface gives you no signal it is happening. A fluently-written wrong sentence and a fluently-written right one have, by construction, the same probability profile. You cannot read the model’s output and tell which one you got. In low-stakes settings this is annoying. In a medical summary, a legal brief, or a meeting transcript that quietly invents an action item nobody agreed to, it is dangerous in a way that is hard to detect until the consequences have already propagated through whoever read it next.

This makes hallucination a different kind of bug from the ones we are used to. Conventional software either does what it should or it crashes; a fault is something you can see. Hallucination is a feature of the next-token training design, not a defect of any specific model, and the next two subsections explain why.

Why hallucination is by design

A language model captures statistical patterns in text. It does not store explicit factual claims; it stores correlations. The reason this mostly works is that generating plausibly-sounding text has a high probability of reproducing something true, provided you trained on mostly truthful data. But these models are trained on corpora scraped from the open internet, which contains inaccuracies, half-truths, jokes, fictions, and outright fabrications, and the training procedure has no mechanism to distinguish them.

Now suppose you ran the training only on factual data — pretending for a moment that there is enough of it, which there is not. The statistical nature of the model would still produce hallucinations. Imagine two factual sentences that differ in a single span: Person A was born in year X and Person B was born in year Y. The probability the model assigns to the mixed-up sentence Person B was born in year X is only marginally lower than the probability of either correct sentence, because the model treats the input as a sequence of words and assumes that small changes in the sequence produce small changes in plausibility.

That smoothness assumption is not a bug. It is what lets the model generalise at all — generating a sentence it has never seen requires interpolating between sentences it has seen, and interpolating requires assuming the space is smooth. The problem is that facts are not smooth. A sentence is true or false; there is no halfway. The model has no internal mechanism to represent the cliff between them. It encodes a smooth surface, and on a smooth surface the fabricated sentence sits a small step away from the true one, with a probability that reflects that small step rather than the chasm between truth and falsehood.

Kalai and Vempala made this argument formal in 2024.² They prove that any pretrained language model that is calibrated — meaning the probability it assigns to a piece of information matches the frequency of that information in natural text — must hallucinate arbitrary facts at a rate bounded below by the fraction of facts that appear exactly once in the training corpus. For a power-law corpus like the web, that fraction is large. The bound is independent of architecture, training algorithm, and data quality. It is a property of doing calibrated probabilistic prediction over finite text. The implication is uncomfortable: the more honest your model is about how often it has seen a fact, the more it will fabricate the facts it has only seen once. Post-training alignment like RLHF can suppress the hallucination rate by breaking calibration, but that is a trade, not a fix.

Mitigations and their ceiling

None of this means we are helpless. Several mitigations work in practice and you should use them.

Retrieval-augmented generation is the strongest lever today. You ground the model in a curated corpus at inference time; the model looks up evidence and is prompted to produce text that cites it. Done well, RAG converts a free-running generator into something closer to a paraphrase-and-summarise system over a known knowledge base. Kalai and Vempala note this directly: retrieval bypasses the lower bound because the fact is no longer arbitrary — it is the one in the retrieved document.

Specialised fine-tuning on systematic domains works for the same structural reason. Arithmetic, code, formal logic — facts that can be checked by rule — carry no inherent hallucination floor. Training can drive these toward zero, and the second-generation reasoning models discussed later in the chapter do exactly that on the slice of problems where verification is mechanical.

External verification — running the generated code, checking the generated SQL against the schema, asking a logical solver whether the generated proof step is valid — converts the model’s output from a final answer into a candidate that a deterministic system signs off on. Where you can build the verifier, you can recover trust. Where you cannot, you cannot.

What none of these mitigations do is eliminate the underlying floor. You can push the rate down. You can route around it with retrieval. You can specialise out the cases where rule-checking is possible. You cannot get a generative model to guarantee zero hallucinations on open-ended generation, because the same machinery that lets it answer also lets it fabricate. Treat any system that promises otherwise as a marketing artefact.

Even when the model is grounded in evidence you handed it directly, the trust problem is not yet solved. You also need it to actually read what you put in front of it. As it turns out, it doesn’t.

What the model can hold

A language model has access to a finite slice of recent text — the context window — and nothing else. Everything beyond that window is invisible. The window has grown from a few thousand tokens in 2020 to over a million tokens in some 2025 models, and the headline figures are real. They are also not the whole story.

Liu and colleagues at Stanford documented the failure mode that most users only notice once they have been bitten by it.³ They constructed a multi-document question-answering task: hand the model k passages, exactly one of which contains the answer, and vary the position of the answering passage across all k slots. If the model is actually attending uniformly to its context, position should not matter. It does. Performance is highest when the relevant information sits at the very beginning or the very end of the prompt, and drops by something like fifteen percentage points when the same information is buried in the middle. They named the curve a U-shape, and the name is exact.

The effect is not a quirk of one model. It shows up across closed and open models, across different context lengths, and persists even in models that were explicitly trained on long sequences. Extending the training context does not flatten the curve; the U-shape survives. A controlled key-value retrieval task — synthetic, no linguistic variation — produces the same pattern, which rules out the explanation that the model is being confused by the passages themselves.

The practical consequence is that the model has read the prompt is a hopeful description, not a guarantee. You cannot assume that something you put in the middle of a long context will inform the answer. If you are building a RAG system, the retrieved evidence wants to live at the edges of the prompt, not folded into a dense middle. If you are pasting in a long document and asking the model to find a specific clause, you should expect the clause’s position to matter as much as its content. Liu’s recommendation — performance-as-a-function-of-position should be a standard benchmark — has not yet propagated to the marketing material.

Context limits are about what the model sees on any given prompt. The next limit is about what it stored from training, and the shape that storage takes.

Asymmetric and brittle memory

Memory in a language model is not propositional. The model does not store a fact like Paris is the capital of France as a structured assertion; it stores the surface forms in which that assertion appeared, and the statistical correlations between those forms. Three failure modes — the reversal curse, verbatim memorisation, and adversarial brittleness — look superficially unrelated. They are three faces of the same property. Each is what happens when surface-form storage meets a use case that assumed propositional storage.

The reversal curse and form-bound recall

Berglund and colleagues trained models on synthetic sentences of the form “A is B” and then asked the trained models “who is B?”⁴ Models that perfectly recalled the forward direction were at chance on the reverse. The model had not learned a relation between two entities; it had learned the surface ordering of two tokens. Recall the ordering and you get the answer. Reverse the ordering and the surface match is gone, and the answer is gone with it.

You can dismiss this as a contrived experiment until you notice it in production. Ask a model “who directed Inception?” and the answer is reliable. Ask “what films did Christopher Nolan direct?” and watch the recall degrade — not always, but with a pattern that maps onto how often each direction of the relation appeared in the training text. The model is good at the form it has seen and worse at the form it has not, and there is no general way to predict from a user’s question which form the corpus carried.

Verbatim memorisation and privacy

Carlini and colleagues showed that models memorise verbatim chunks of their training data, and that memorisation grows with model size.⁵ You can craft prompts that elicit personal email addresses, private code, copyrighted passages, and unique sequences that no statistical generalisation would produce. The original 2021 attack on GPT-2 was already enough to make this a security property of the model rather than a curiosity; subsequent work confirmed the pattern across every major model family.

The trust consequence is independent of task accuracy. A model can score perfectly on a benchmark and simultaneously be a privacy liability, because the benchmark measures behaviour under nominal prompts and the leak is what happens under adversarial ones. If you are deploying an LLM trained on data you do not fully control, the question is not whether it has memorised sensitive content, but how easily the next user can extract it.

Adversarial perturbations

The original adversarial-examples result predates LLMs by a decade. Szegedy and colleagues showed that small, sometimes imperceptible perturbations of an input can flip the behaviour of a neural network.⁶ In the image-classifier era this looked like academic mischief — a sticker on a stop sign, a hat that defeats face recognition. In the LLM era it is the entire jailbreak and prompt-injection literature, treated in the risks chapter. For this chapter the point is structural: the model’s behaviour in the wild is a function not just of what you ask but of what an adversary might ask in your place, and the adversarial input often does not need to look like an attack.

Memory and context together bound what the model can recall and use. The next limit is about what it produces when it has to make something up to fill the gap.

The biases it inherits

A language model does not have an opinion. It has a distribution, and the distribution reflects who wrote the internet.

That sentence is a slogan, but it is also literally true. Bender and colleagues catalogue the demographic skew of web-scale corpora in detail: Reddit users are roughly two-thirds male, English-Wikipedia editors are something like ten percent women, English-speaking sources dominate the training mix even for nominally multilingual models, and the quality filtering steps that prune offensive text from training data tend to silently erase reclaimed-slur discourse from marginalised communities.⁷ At every stage — who participates online, what survives filters, what gets weighted by quality heuristics — the surviving distribution narrows. The model trained on that distribution carries its shape forward, and then its outputs become someone else’s training data, and the loop tightens.

Bias is not an abstract problem. Parrish and colleagues built the BBQ benchmark — fifty-eight thousand hand-written question-answering examples that measure whether a model uses social stereotypes rather than the evidence in the prompt.⁸ Every example targets one of nine social axes: age, disability, gender, nationality, appearance, race, religion, orientation, socioeconomic status. Every example comes in four variants, crossing ambiguous-versus-disambiguated context with negative-versus-non-negative question polarity. The findings have a uniform shape across model generations. When the context is ambiguous, the model picks the stereotypically-associated answer. When the context disambiguates correctly, the model is still a few percentage points more accurate when the correct answer happens to align with the stereotype. The gap is small per case and consistent across cases, which is how you know it is signal rather than noise.

The point of measuring is to be honest with yourself about what the model is doing. Bias in a generative model is not a soft failure that more polish will fix. It is a measurable property of the output, on specific axes, that you can score with the same rigour you score accuracy. The mitigation work — value-aligned fine-tuning, demographic re-balancing of training data, instruction-tuning against stereotyped templates — moves the numbers. None of the published mitigations move them to zero. Treat the residual as a property of the system you are deploying, not as a moral lapse to be edited around in the prompt.

All the limits so far are static properties of a trained model — what it has stored, what it can see, what it produces when it fills in the gaps. The next one is dynamic, and it fails in a different kind of way.

The depth ceiling on reasoning

Large language models often appear to reason. They produce coherent multi-step responses, generate plausible derivations, and articulate intermediate steps that look like the thinking a human would do. That appearance is misleading in a specific and provable way, and the second-generation reasoning models that landed through 2024 and 2025 — the o-series, R1, the extended-thinking variants — mitigate the failure without repairing it.

I want to be precise about what I am and am not claiming. I am not claiming AGI is impossible; I am a computationalist and I think machines can reach and surpass human cognition. I am not claiming that LLMs are useless; they are extraordinary at what they were designed for. I am claiming there is one specific form of reasoning that no LLM as currently designed can perform reliably. That form is provably-sound deduction over arbitrary instances, and no amount of scaling closes the gap. The three subsections below make the case.

Reasoning is bounded computation

When AI folks say LLMs cannot reason, they are not talking about anything philosophical. The relevant definition comes straight from math: reasoning is the capacity to draw logically sound conclusions from a given premise. In its strongest form — deduction — it is the mechanical application of a finite set of inference rules to obtain provably-true conclusions from existing true claims. It is what mathematicians do all day. The claim LLMs cannot reason in this chapter is the narrow technical claim that there exist deductively-simple problems an LLM provably cannot solve.

A language model spends a fixed amount of computation per output token. The amount depends on input size and model size — and nothing else. So if I ask an LLM a yes-or-no question about a logical puzzle, the thinking it can do before answering is a finite, predetermined number of matrix multiplications. Now consider two puzzles of the same input length. One has a short proof. The other has a long one. The LLM spends the same amount of computation on both. This cannot be right, and it isn’t.

A basic result in computational complexity is that some problems — the NP-complete ones — have small inputs but appear to require exponentially-many computation steps to solve. Most computer scientists believe no polynomial-time algorithm for them exists. A huge swath of reasoning problems lives in this class, including the most basic deductive task of all: determining whether a logical formula can be satisfied. For any LLM, no matter how big, there are instances of these problems small enough to fit in its context window and large enough to exceed its fixed computational budget. The model will answer them, of course. The answer will not be reliable.

The deeper version of the same argument is that LLMs are not Turing-complete. Turing completeness requires the ability to loop indefinitely; some computations cannot be bounded in advance because you do not know how long they will take until they finish. LLMs, by definition, are computationally bounded. There is always a problem instance — perhaps one you cannot identify beforehand — that requires more computation than the model’s matrix multiplications can carry out. The model is, in this precise sense, a very large finite automaton. Useful for a great many things. Insufficient for the specific thing called sound deductive reasoning.

Why scale does not close the gap

The most common objection to the argument above takes three shapes. None of them work, and I’ll handle all three together because they fail for related reasons.

The first is that humans also have these limits. True in a trivial sense — humans make mistakes — and irrelevant. Humans built two thousand years of provably-correct mathematics, including the proofs that NP-complete problems are hard in the first place; the species-level capacity for sound deduction is the gold standard, not an inferior baseline. And the comparison class for a piece of technology is the best system we have for the task, not an average human. For deductive reasoning that system is a SAT solver, not a person, and on deductive reasoning a SAT solver flatly beats an LLM.

The second objection is that randomness is a feature, not a bug, because the best deductive solvers use randomised search. True for the search phase. False for the system as a whole. A SAT solver uses randomness to explore the space of candidate solutions; once a candidate is found, a deterministic validation step certifies whether it actually satisfies the formula. The exploration is stochastic; the acceptance is not. An LLM has only the exploration. There is no deterministic validator. The same probability machinery that generated the candidate is also evaluating it, and any error in generation can survive into the output unflagged.

The third is that you can make an LLM Turing-complete by wrapping it in something Turing-complete — a code interpreter, a recursive call, a SAT solver. True in principle. It is also true that you can build a flame-thrower out of bamboo, duct tape, and a working flame-thrower. The question is whether the composition is reliable, and the answer is that the same stochastic output that limits the model in isolation also limits its ability to invoke external tools with the precise arguments those tools require. You can connect the LLM to a SAT solver. You cannot guarantee the LLM hands the solver the right formula. Garbage in, deduction out.

All three objections share a structure: they assume that mitigating the architectural limit is the same as removing it. The mitigations are real. They buy you longer chains, more reliable invocations, better-shaped problems. They do not change what the model fundamentally is. For tasks that require sound reasoning rather than probably-correct reasoning, the difference shows up at the last decimal place, which is the place that decides whether you can deploy the system unsupervised.

Chain-of-thought, self-critique, tool use

Three families of technique have been developed to push the practical ceiling outward. Each one helps. None of them dissolves the underlying limit.

Chain-of-thought prompting asks the model to articulate its intermediate steps before producing the final answer. The model breaks a problem into smaller subproblems, tackles each, and assembles the result. CoT works because it routes around the per-token compute cap — the model now spends a sequence of forward passes on the problem instead of just one — and it works better when the steps decompose along lines the model has seen many examples of. Its limits are inherited from the underlying design: the chain extends the budget by a finite amount; the per-step error compounds across the chain; the quality of the decomposition depends on the prompt, which means the same problem in different wording can produce wildly different chains. CoT pushes the ceiling outward. It does not remove it.

Self-critique asks the model to evaluate its own output and revise it. Intuitively appealing, and the early results looked good. Huang and colleagues at Google Research went back and re-examined the methods that had been claimed to demonstrate self-correction and found that the apparent gains were artefacts of methodological confounds.⁹ In the most common case, the experimental protocol was leaking oracle correctness signals to the model — your previous answer was wrong; try again requires knowing the previous answer was wrong, which is the unknown the system was supposed to solve. When the leak is removed, the gains vanish. Worse, intrinsic self-correction degrades performance on reasoning benchmarks: the model talks itself out of correct answers as often as it talks itself into them. The fundamental paradox is the one the paper names directly: if the model could recognise the error, it would not have made it in the first place. Reliable revision requires an external grounding signal — a verifier, a code executor, a test suite, a human.

Tool use is the most promising of the three, and the only one that even in principle escapes the bounded-computation argument. You give the model a SAT solver, a calculator, a code interpreter, a search index. The model issues calls, the tools execute, the model integrates the results. For tasks whose underlying problem is well-served by an existing deterministic solver, this is the right architecture, and it is the one I personally bet on for the next several years of agent work. What it does not do is repair the bounded-computation argument cleanly. The composition is now a hybrid: stochastic orchestrator, deterministic engine. The orchestrator still hallucinates. The orchestrator still misroutes. The engine returns the correct answer to the wrong question if the model formulated the question wrong. Tool use raises the ceiling. It also moves the failure mode upstream into the orchestrator, where it is harder to spot.

The arguments above tell you why, in principle, an LLM cannot reliably reason past a certain horizon. They do not tell you where the ceiling actually sits. For that, you want a number.

The reliability ceiling

The evaluation group METR has produced the most rigorous number we have. Their time-horizon methodology takes a population of well-defined tasks with measured human completion times — the time an experienced practitioner would need to do the task — and asks, for a given model, what is the longest task it can complete at some specified rate of success?¹⁰ The 50%-reliability horizon — the task length at which the agent succeeds half the time — has been doubling roughly every seven months across recent frontier models, and by late 2025 sat around a couple of hours for software tasks. That is the curve that gets quoted in headlines. The chapter on AI for software development walks the ladder in detail.

The number that matters for this chapter is the other one. The 80%-reliability horizon — the task length at which the agent finishes the work well enough that you would not feel the need to check it — is much, much shorter. Across most benchmarks it sits stubbornly around seventy to eighty percent below the 50% figure, and it has been moving up far more slowly. The same agent that nominally clears a two-hour task half the time clears a half-hour task only four-out-of-five times. The longer the horizon, the wider the gap between can do it sometimes and can be trusted with it.

This is the empirical bookend to the conceptual argument. The two arrive at the same place from opposite directions. The conceptual side says: any next-token model is a finite, bounded, probabilistic process; it cannot perform open-ended deduction; small surface perturbations of a problem it has seen knock it out, as GSM-Symbolic showed.¹¹ The empirical side says: when you actually measure end-to-end task success at the threshold of trust, the curve plateaus, and the plateau holds across model generations. The doubling of the 50% horizon is real progress. The slow movement of the 80% horizon is the fact that matters.

The mechanical reason the two horizons diverge is the same root fact the chapter opened with. A per-token probability of error compounds multiplicatively with the length of the output. If each step is correct with probability \(p\), a sequence of \(n\) steps is correct with probability roughly \(p^n\). A \(p\) of 0.99 looks excellent on a single token and collapses to 0.37 over a hundred. The longer the chain — the more tool calls, the more intermediate reasoning, the more state to keep coherent — the more aggressively the error accumulates. Reasoning models buy you a higher per-step \(p\) and they let you spend more steps, but they do not change the shape of the curve. Compositional tasks pile this on top of itself: Faith and Fate showed that transformer accuracy collapses with the depth of the computational graph even when each sub-step is individually solvable.¹² The ceiling is not an engineering gap that next year’s training run will close. It is an architectural property of the paradigm.

And it is not the only architectural ceiling. The same training design that produces the reliability cliff is also bumping into harder boundaries on the inputs that produced it. Training the largest models has measurable energy and carbon costs — the BERT-large training run was carbon-comparable to several cars over their lifetimes, and GPT-3-scale runs an order of magnitude higher.¹³ The supply of high-quality human-written text is finite; the Epoch AI projection puts exhaustion of usable public text between 2026 and 2032 at current consumption rates.¹⁴ The curves that produced the last decade of improvement cannot be extrapolated indefinitely on the same inputs. When I write no bigger model fixes this, the phrase is doing more work than you might first read into it.

If you want reliable autonomy past the current horizon, you will need a different kind of system on top of the model, not a bigger model.

What this chapter delivered

I argued one root fact: the language-modelling approach optimises for plausibility, not truth, and every failure mode in this chapter is a downstream consequence of that single design choice. I catalogued seven failures of trust that follow from it — fluent fabrication; non-uniform attention across the context window; form-bound recall and the reversal curse; verbatim memorisation and the privacy leak it carries; adversarial brittleness; demographic bias measurable on specific axes; the depth ceiling on deductive reasoning. I established the empirical bookend: METR’s 80%-reliability horizon is the place where probability arithmetic and architectural limits meet a measurable curve, and the curve plateaus where the math says it should. I identified, along the way, which limits are engineering gaps that next year’s models might close — chain-of-thought length, retrieval quality, calibration after RLHF — and which are architectural properties of next-token prediction that no amount of scaling will repair.

The chapter that follows turns from limits to alignment. These are what the failures above become when you scale them up, hand them more autonomy, and ask them to act in the world. The limits documented here are the design boundary inside which any reliable application of an LLM has to sit. Cross the boundary and the model still talks. It just stops being trustworthy at the same rate it stops being correct.

Bender, E. M., Gebru, T., McMillan-Major, A. & Mitchell, M. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT 2021. DOI: 10.1145/3442188.3445922. Coins the stochastic parrot framing for systems that manipulate linguistic form without grounding in meaning, and catalogues the demographic skew of web-scale training corpora that the bias section of this chapter draws on.↩︎
Kalai, A. T. & Vempala, S. S. Calibrated Language Models Must Hallucinate. STOC 2024. arXiv:2311.14648. Proves a statistical lower bound on the hallucination rate of any calibrated pretrained language model — the bound equals the Good-Turing missing-mass estimator (fraction of facts seen exactly once in training) minus calibration error, and is independent of architecture or training procedure.↩︎
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F. & Liang, P. Lost in the Middle: How Language Models Use Long Contexts. TACL 2024. arXiv:2307.03172. Documents the U-shaped position curve — performance highest at the start and end of the context, much lower in the middle — across closed and open models, and shows that the curve survives extended-context training.↩︎
Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T. & Evans, O. The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. arXiv:2309.12288, September 2023.↩︎
Carlini, N., Tramèr, F., Wallace, E., et al. Extracting Training Data from Large Language Models. USENIX Security 2021. arXiv:2012.07805. The original extraction attack on GPT-2: verbatim recovery of PII, code, and URLs from the pretraining set. Follow-up: Carlini et al., Quantifying Memorization Across Neural Language Models, ICLR 2023, arXiv:2202.07646, establishing that larger models memorise more.↩︎
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. & Fergus, R. Intriguing Properties of Neural Networks. ICLR 2014. arXiv:1312.6199. The discovery paper for adversarial examples. Companion: Goodfellow, Shlens & Szegedy, Explaining and Harnessing Adversarial Examples (FGSM), ICLR 2015, arXiv:1412.6572.↩︎
Bender, E. M., Gebru, T., McMillan-Major, A. & Mitchell, M. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT 2021. DOI: 10.1145/3442188.3445922. Coins the stochastic parrot framing for systems that manipulate linguistic form without grounding in meaning, and catalogues the demographic skew of web-scale training corpora that the bias section of this chapter draws on.↩︎
Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M. & Bowman, S. R. BBQ: A Hand-Built Bias Benchmark for Question Answering. Findings of ACL 2022. arXiv:2110.08193. Hand-written 58,492-example benchmark spanning nine US-English social axes; crosses ambiguous-vs-disambiguated context with negative-vs-non-negative question polarity to isolate when models substitute stereotype for evidence.↩︎
Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X. & Zhou, D. Large Language Models Cannot Self-Correct Reasoning Yet. ICLR 2024. arXiv:2310.01798. Re-examines methods previously claimed to demonstrate intrinsic self-correction (RCI, Reflexion, Self-Refine, Multi-Agent Debate) and shows the gains are explained by oracle leakage or sub-optimal initial prompts. Without oracle signals, intrinsic self-correction degrades reasoning performance across GSM8K, CommonSenseQA, and HotpotQA.↩︎
Kwa, T., West, B., Becker, J., et al. (METR). Measuring AI’s Ability to Complete Long Tasks. arXiv:2503.14499, March 2025; updated as Time Horizon 1.1, January 2026. Establishes the doubling-every-seven-months trend on the 50%-reliability horizon across frontier models, and the much shorter 80%-reliability horizon that is the empirical anchor of the reliability-ceiling argument here.↩︎
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S. & Farajtabar, M. (Apple). GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. arXiv:2410.05229, October 2024. Frontier reasoning models drop tens of percentage points in accuracy under surface-level perturbations (variable renaming, irrelevant-but-plausible clauses) of standard math-benchmark problems — evidence that the apparent reasoning is pattern-matching over training distribution.↩︎
Dziri, N., Lu, X., Sclar, M., Li, X., Jiang, L., et al. Faith and Fate: Limits of Transformers on Compositionality. NeurIPS 2023. arXiv:2305.18654. Compositional tasks — multi-digit multiplication, dynamic programming, logic-grid puzzles — decompose as computational graphs; transformer accuracy collapses with graph depth even when each sub-step is individually solvable.↩︎
Strubell, E., Ganesh, A. & McCallum, A. Energy and Policy Considerations for Deep Learning in NLP. ACL 2019. arXiv:1906.02243. The often-quoted “BERT-large training is comparable to several cars over their lifetimes” figure. Refined by Patterson, D. et al., Carbon Emissions and Large Neural Network Training, arXiv:2104.10350, 2021, which corrects for datacenter PUE and quantifies GPT-3-scale training at roughly an order of magnitude higher.↩︎
Villalobos, P., Sevilla, J., Heim, L., Besiroglu, T., Hobbhahn, M. & Ho, A. (Epoch AI). Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data. arXiv:2211.04325, revised 2024. Quantitative estimate that high-quality public text stock is exhausted between 2026 and 2032 at current training-rate consumption.↩︎