The Road Ahead

This book opened on a seventy-year argument. The rationalists held that intelligence is rules and facts, laid down by hand. The empiricists held that intelligence is a function, fit from data. Both camps wanted the same machine. Neither camp ever agreed on how to build it. The introduction promised that the seventy-year argument did not produce a winner; it produced a synthesis. The book between the introduction and this page has been an unpacking of that synthesis, layer by layer.

What I want to do in this final chapter is something I have deliberately avoided doing in every other chapter of this book. I want to step back and look at the shape of the whole.

The synthesis is not a victory for one side. It is a working truce. A strong empiricist core, in the form of a language model that has absorbed most of the written record of humanity, sits inside a thin symbolic harness — named tools, explicit procedures, audit trails, verifiers — and the whole assembly is bounded in its outermost loop by human judgement. The chatbot in your browser, the agent that edited half a feature in your codebase last week, the protein folder that quietly compressed a fifty-year-old problem into months: these are all the same shape. The shape is not the model alone, and not the harness alone. The shape is the marriage.

The marriage works. It also creaks. The rest of this chapter is about what creaks, what we are learning to do about it, and what kind of future the work makes possible.

The challenges that remain

Three of the book’s open problems return here, because the epilogue is the place where they get re-framed rather than re-taught. Each is a structural ceiling on the empiricist core. None will be removed by training a bigger model. All three will, in the working systems of the next decade, be lived with — bounded by the harness around the core, rather than solved at the level of the core itself. This is the central claim of the book, and the epilogue is where it lands.

The first ceiling is reliability. A language model samples tokens; sampling is, by construction, a random process; randomness can always wander into a sequence that is fluent and wrong. The chapter on limitations made the empirical version of this argument concrete: even with the reasoning paradigm, models break on perturbations of problems they have previously solved, because the surface form was being matched and the symbolic step was always cosmetic. The honest stance is that hallucination is not a bug to be patched; it is a property of the substrate. What we can change is what surrounds the substrate. A verifier that catches the wrong answer, a sandbox that contains the wrong action, a human approver at the irreversible step — these are not workarounds. They are the actual solution.

The second ceiling is fairness. Models inherit the statistics of their training data; the statistics encode historical biases; reinforcement learning from human feedback shifts the distribution but does not eliminate the inheritance. The chapter on alignment argued that this is the easier half of a harder problem. Fairness is what we get when we can write the constraint down. The deeper trouble — that we often cannot agree on what the constraint should be — is the part that does not yield to engineering. We can build harnesses that audit who the system serves and how. We cannot build a model that decides, on our behalf, what fairness means. The first is operational. The second is political. The book has been careful, throughout, not to confuse them.

The third ceiling is understanding. Whether a language model trained only on linguistic interaction can develop a faithful model of the world it talks about is, in 2026, still genuinely open. The reasoning paradigm has narrowed the question without answering it. What the chapter on limitations established is that today’s frontier — even the most capable thinking model — remains a pattern-matcher whose performance falls off sharply when the surface drifts from training distribution. The hopeful version of this is that linguistic competence may turn out to be sufficient for a wide class of useful work; the sober version is that there is a band of human cognition that the substrate, in its current form, does not reach. The book has tried to draw that band honestly, neither shrinking nor widening it.

The right response to all three ceilings is the same response. Stop chasing the unfixable property at the level of the model. Build the scaffold around the model that bounds the failure. This is the lesson the alignment chapter named operational alignment: when the inside cannot be perfectly specified, design the outside so that the inside cannot do irreversible damage. It is what the agentic chapter called the harness: the layer that exposes a bounded action space and audits every call. It is what every reliable production system in the last decade of AI deployment has, in practice, been about. The model is the centre of gravity in the discourse. The scaffold is the centre of gravity in the working systems.

The path to general-purpose AI

For decades, the field’s name for the long-term destination was Artificial General Intelligence — a system with human-like consciousness, capable of replacing human intellect across the board. AGI is a philosophical benchmark, not an engineering one. The chase for it has always been fraught with ambiguity, and ambiguity makes for poor planning. I prefer a target that can actually be aimed at.

I will use the name General-Purpose AI (GPAI) for what I think the field is, in practice, building. GPAI is not a replacement. It is the state of the AI infrastructure, tools, and techniques at which any developer can incorporate narrow AI into any specific problem. To put it succinctly, GPAI aims for anything, anywhere, anytime, in contrast to AGI’s often-implied everything, everywhere, all at once. The shift is from the philosophical question of “can it think?” to the engineering question of “what can we build with it?”

The honest way to talk about how close we are to GPAI is not to argue from analogy. It is to look at the curve.

The METR research group has been tracking, since 2019, the length of task — measured in human-minutes — that a frontier AI agent can autonomously complete at a fixed reliability threshold. The 2025 fit gives a doubling time of approximately seven months on the 50%-reliability horizon, with a much shorter horizon at 80% reliability.¹ In early 2024 the frontier task length was around half an hour. By 2026 it is approaching half a workday. If the trend continues, it crosses a full workday within two years, and several workdays within five.

This is not a prediction. It is an extrapolation, on a curve, that may bend. Part III’s register applies in full: the doubling has held across four orders of magnitude of compute and across three distinct architectural eras, and the cleaner the verifier the steeper the curve has been. But trends bend. The right way to read the METR result is as the most disciplined statement available about where the frontier is and how fast it has moved, with the explicit acknowledgement that the projection is a projection.

What the curve does not measure is the second half of the story. Capability is the model. Reliability and safety are the scaffold. The METR curve charts the first half. The second half — harness design, verifier engineering, operational alignment, the boring infrastructure of bounded action spaces and audited tool calls — is where most of the actual difficulty lives.

This is the layered-scaffold thesis the book has been building toward. The future is not waiting on a perfectly aligned model. The future is being built, right now, by getting the scaffold around the model right. The capability layer scales on a measurable curve. The scaffold layer scales by deliberate engineering. The product of the two is what an agent can actually do in the world without supervision. A heroic model in a flimsy harness is a hazard. A modest model in a careful harness is a useful colleague. Almost every working AI deployment in the next decade will be the second sort.

For the GPAI vision to fully land, the same infrastructure has to extend outward. Foundation models reach beyond text and image into 3D scenes with realistic physics, complex relational data, time series, graphs. Interoperability protocols turn AI systems into citizens of the existing software stack, speaking cleanly to databases, operating systems, file systems, web browsers. Automated machine learning becomes a routine development task, integrated into IDEs in the way version control already is. Symbolic AI’s old toolkit — constraint satisfaction, planning, search — becomes accessible alongside the empiricist core, recovering capabilities the field has spent forty years pretending it could do without. Inference moves to the edge, with operating-system-level support, so that anywhere, anytime is not just a slogan. None of this requires a new theoretical breakthrough. All of it requires patient engineering.

And underneath the engineering sits the harder open problem: skill acquisition itself. Today’s models are capable because of emergent abilities that look like pattern-matching at scale. True general-purpose capability requires moving from out-of-training and out-of-distribution performance to genuine out-of-domain adaptation. The most promising engineering path is program synthesis — the ability to generate correct, executable code that solves novel problems. A language model that can reliably write, test, and run code is, by definition, part of a Turing-complete system. Perfect synthesis is theoretically impossible, a limitation formalised by Rice’s theorem. The achievable target is the human standard: iterative writing, testing, debugging, refactoring, all inside an agentic loop. The future of GPAI is not a model that emits a perfect block of code in one shot. It is a scaffold-rich system that iterates its way to one.

The hardest version of the lesson

The alignment chapter ended with a forward-pointer to this section. The hardest version of the scaffold thesis is the case where the system inside the scaffold is more capable than the people building it. We do not know when superhuman AI arrives, and we do not, today, know how to align it directly. The honest engineering response is not to wait. It is to ask whether the same move that works on a model we can almost match also works on a model we cannot.

It is not a law of nature that a stronger intelligence always wins. Three research programmes, all active in 2026, give that intuition some teeth. None of them claims to have solved the problem. Each is a working attempt to bound a strong system with weaker, trusted components.

The first is scalable oversight. The empirical question is whether a weak supervisor can elicit aligned behaviour from a stronger learner. The Burns et al. Weak-to-Strong Generalization paper from OpenAI’s superalignment team studied the analogous setup with current models: a weak model labels the training data; a stronger model is fine-tuned on those labels; the strong model, on a held-out test set, can recover most of the performance gap.² ³ The result is partial and the methodology is contested, but the existence proof is the point. A weaker process can, under the right conditions, transmit useful supervision to a stronger one.

The second is the debate-and-verification family. The original 2018 proposal from Irving, Christiano, and Amodei imagined two strong systems debating a question in front of a weaker judge, with each side incentivised to expose the other’s errors.⁴ The 2023 Doubly-Efficient Debate result from Brown-Cohen, Irving, and Piliouras tightened the formal guarantees, showing conditions under which the judge can be much weaker than the debaters and the truthful side still wins.⁵ The 2024 Khan et al. empirical follow-up demonstrated that, on real reading-comprehension tasks with current LLMs, more persuasive debaters lead non-expert judges to more truthful answers, not fewer.⁶ OpenAI’s 2024 Prover-Verifier Games paper goes further, training models to produce outputs that are not just correct but legible to a weaker verifier — a property that, if it generalises, would let small trusted checkers police large untrusted producers.⁷ The throughline is the same: the verifier need not be smarter than what it checks. It only needs to be sound.

The third is the guaranteed-safe programme. The 2024 Towards Guaranteed Safe AI paper, with Dalrymple, Skalse, Bengio, Russell, Tegmark, and others, sketched a framework where an AI system is bounded by a formal world-model and a formal specification, and only actions verifiable against the specification are permitted.⁸ The UK’s ARIA agency is funding the Safeguarded AI programme that operationalises this vision, with David “davidad” Dalrymple as its programme director.⁹ Bengio’s Cautious Scientist AI proposal pushes in the same direction: train systems to reason about the world rather than to act in it, and pass their conclusions through provable safety bounds.¹⁰ The bet is that the gap between acting safely and being safe is bridgeable by formal methods. Christiano’s older Eliciting Latent Knowledge report is the standing open problem these programmes have to keep honest: how do you know the system has actually reported what it believes, rather than what it has been trained to say?¹¹

The chapter on AI in science introduced a small argument that turns out to do most of the work here. Terence Tao, writing about machine-assisted proof, observed that a fast but fallible proposer composed with a slow but sound verifier yields a system that is, in aggregate, both fast and sound — neither component on its own can give you both. The point generalises. A 2023 paper from Yang and collaborators, LeanDojo, demonstrates the composition in working code: language models propose proof steps; the Lean theorem prover verifies them; together they navigate problems neither could solve alone, with the soundness of the prover protecting the trajectory from the model’s errors.¹² Where the output of a system can be expressed in a formal language with a sound checker — proofs, programs against type signatures, plans against safety specifications, predictions against measurable quantities — the same shape works. The producer can be arbitrarily strong. The checker can be much weaker, provided it is right.

What this gives us is calibrated hope, not a promise. The superintelligence problem is hard. It is not, on the available evidence, obviously insoluble. The same scaffold thesis that works for the merely-impressive AI we deploy today seems to extend, in research-grade form, to the systems we have not yet built and would not yet know how to align. None of the programmes above will scale automatically. All of them have unsolved technical questions. Each of them is enough to take seriously. Together they describe a research direction that is at least as serious as the case for the worry that motivates them.

This is the second-hardest argument the book makes. The hardest argument is the one in the manifesto: that the future is genuinely undetermined and we are the ones who decide. The research above is what it looks like, in practice, to take that argument seriously when the stakes are at their most uncomfortable.

Why the future of AI is open

The full GPAI roadmap — new foundation-model modalities, interoperability protocols, mainstream AutoML, on-device inference, accessible symbolic methods, and a research-grade safety scaffold for systems stronger than we are — is a project of immense scale. It is precisely this scale that makes a closed, single-company approach to AI infrastructure unviable.

The fundamental problems of reliability and bias and alignment are too big for any one team to solve. The frontier labs keep their strongest models closed for a recognisable economic reason: the upfront investment in data and compute creates a competitive moat, and giving the moat away undercuts the business. The lesson from the history of software, though, is that foundational infrastructure thrives when it is open. Operating systems, databases, programming languages, web protocols, container runtimes — every layer of the stack on which the AI boom itself was built was, at the moment it became infrastructure, open. Foundation models are the new infrastructure. The same logic applies.

The transition will not be a single event. It is a sequence of overlapping phases unfolding concurrently over the next five to ten years. Foundational plumbing first: protocols that let AI systems speak cleanly to databases, operating systems, browsers, files. Framework accessibility second: modern machine learning and classical symbolic methods exposed in toolkits that any developer can adopt. Modality extension third: foundation models pushed beyond text and image into 3D, tabular, time-series, and graph data. AutoML mainstreaming fourth: model selection and tuning becoming a routine part of the development workflow. On-device execution fifth: operating-system-level support for running competent foundation models locally, so that AI becomes a utility rather than a remote service.

No single company, however well-funded, can drive all five fronts at once. Open source is not just preferable here. It is structurally necessary.

The same conclusion lands from a safety angle. Solving deep problems like hallucination and bias and operational alignment requires more than API access. It requires model-level access — weights, architectures, training data — so the global research community can collaboratively audit, critique, and improve the systems it has been asked to trust. The scarcity of human talent in alignment and interpretability is the actual bottleneck. The only way to scale that talent to the size of the challenge is to let everyone work on the same artefacts.

Your place in this new era

We are inside, by any reasonable measure, the most consequential era of artificial intelligence since the field was named. The next few years will be about crystallising many potential applications of AI into actual, useful products — and about getting the scaffold around those products right.

If you feel the hype has passed and you got left behind, worry not. Today is the best moment to get involved with artificial intelligence. The substrate is more capable than it has ever been; the scaffolding tools are improving every quarter; the open-source ecosystem is wide enough that you can read the actual code; and the problems worth working on outnumber the people working on them by a margin that should be embarrassing.

Whatever your profession and whatever your interests, there is something in this mission for you. If you care about fundamental theory, the open questions in reasoning, generalisation, and program synthesis are some of the most fascinating problems in mathematics and computer science today. If you care about engineering, building robust agentic systems and the harnesses that bound them is the grand challenge of the decade. If you care about science, the proposer-verifier paradigm is reshaping how new knowledge gets generated. If you care about ethics and policy, ensuring these tools are safe and fair is a critical task, and the field has been honest in admitting how short it is on the people willing to do the work. Technologists, humanists, scientists, artists, educators, lawyers, regulators — all of you have a role to play, and many of those roles do not yet exist as job titles.

The future is never certain, because technology is not deterministic. Its path is shaped by our choices. The techno-pragmatist way forward is to build human-centred tools that prioritise augmentation over automation — that empower us to think better, not to think less — and that we use thoughtfully, knowing what they are good for and what they are not. It requires shouldering a shared responsibility to make sure AI is developed sustainably and equitably for everyone.

The seventy-year argument the introduction opened on has ended, for now, in a working truce. The truce is what we have. What we do with it is still up to us.

Kwa, T., West, B., Becker, J., et al. (METR). Measuring AI Ability to Complete Long Tasks. arXiv:2503.14499, March 2025; updated as Time Horizon 1.1, January 2026. Establishes the doubling-every-seven-months trend on the 50%-reliability horizon across frontier models from 2019 to 2025, with a much shorter horizon at 80% reliability — the empirical anchor for the trajectory claim.↩︎
Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., et al. (OpenAI Superalignment). Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision. arXiv:2312.09390, December 2023. The first empirical study of whether weak supervisors can elicit aligned behaviour from stronger learners; recovers most of the performance gap on a held-out test set under specified conditions.↩︎
OpenAI. Introducing Superalignment. July 2023. Wayback capture: https://web.archive.org/web/2024/https://openai.com/index/introducing-superalignment/. The research-programme launch post that framed scalable oversight as the team’s organising problem.↩︎
Irving, G., Christiano, P. & Amodei, D. AI Safety via Debate. arXiv:1805.00899, May 2018. The original proposal: two strong systems debate a question in front of a weaker judge, each side incentivised to expose the other’s errors.↩︎
Brown-Cohen, J., Irving, G. & Piliouras, G. Scalable AI Safety via Doubly-Efficient Debate. arXiv:2311.14125, November 2023. Tightened formal guarantees for the debate setup; conditions under which the judge can be much weaker than the debaters and the truthful side still wins.↩︎
Khan, A., Hughes, J., Valentine, D., Ruis, L., Sachan, K., et al. Debating with More Persuasive LLMs Leads to More Truthful Answers. arXiv:2402.06782, February 2024. The empirical follow-up: on reading-comprehension tasks, more persuasive debaters lead non-expert judges to more truthful answers, not fewer.↩︎
Kirchner, J. H., Chen, Y., Edwards, H., Leike, J., McAleese, N. & Burda, Y. (OpenAI). Prover-Verifier Games Improve Legibility of LLM Outputs. arXiv:2407.13692, July 2024. Training models to produce outputs that are not just correct but legible to a weaker verifier — the property that, if it generalises, lets small trusted checkers police large untrusted producers.↩︎
Dalrymple, D., Skalse, J., Bengio, Y., Russell, S., Tegmark, M., et al. Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems. arXiv:2405.06624, May 2024. A framework where AI behaviour is bounded by a formal world-model and a formal specification, and only actions verifiable against the specification are permitted.↩︎
UK Advanced Research and Invention Agency (ARIA). Safeguarded AI programme. https://www.aria.org.uk/opportunity-spaces/mathematics-for-safe-ai/safeguarded-ai/. Programme director: David Dalrymple. Operationalises the guaranteed-safe framework as a multi-year research programme.↩︎
Bengio, Y. Towards a Cautious Scientist AI with Convergent Safety Bounds. yoshuabengio.org, February 2024. https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/. The proposal: train systems to reason about the world rather than to act in it, and pass conclusions through provable safety bounds.↩︎
Christiano, P., Cotra, A. & Xu, M. (Alignment Research Center). Eliciting Latent Knowledge. 2021. Wayback capture of the Alignment Forum mirror: https://web.archive.org/web/2024/https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge. The standing open problem: how do you know the system has reported what it actually believes, rather than what it has been trained to say?↩︎
Yang, K., Swope, A., Gu, A., Chalamala, R., Song, P., et al. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. arXiv:2306.15626, June 2023 (NeurIPS 2023 Datasets and Benchmarks). A working instance of the proposer-verifier composition: a language model proposes proof steps; the Lean theorem prover verifies them; the soundness of the prover protects the trajectory from the model’s errors.↩︎