10 AI for Software Development

A developer on a 2026 team opens a terminal in the morning, dictates a paragraph that begins add OAuth login to the admin app, use the same provider as marketing, write the tests, open the PR, and walks to make coffee. By the time the cup is cool enough to drink, an agent has read the codebase, found the marketing app’s auth module, written a new one alongside it, generated a test suite, run it twice, fixed the two failures it found on the first run, opened a pull request with a written rationale for each non-obvious choice, and tagged a teammate for review. The developer reads the diff with the kind of attention they would give to a colleague’s work. They send back three change requests. The agent makes the changes. They merge.

This is not a thought experiment. Some version of this happens, today, in the working week of every developer who has bothered to set the tooling up. The interesting question is not whether it works — it does — but what the developer is now actually doing. The mechanical job, write the code, has been mostly handed off. The intellectual job, decide what to build and whether what came back is right, has not moved at all. If anything, it has grown, because the agent will cheerfully produce three plausible implementations of a wrong specification before a careful human stops it.

The promise of this chapter is one specific thing. By the end of it you should be able to take any coding task and place it on a rung of the autonomy ladder — from typing alongside an autocomplete to handing a full feature to an agent overnight — and say what trusting the result on that rung actually costs. The promise holds whether you write code yourself or read about people who do. A developer gets an honest map of the working day. A policymaker, a manager, a curious reader gets the shape of how software engineering is changing, well enough to reason about the second-order effects on industries, hiring, and the regulations that touch software.

The mechanics of how a language model writes code at all are in Part I — the language-modelling chapter for the underlying capability, the agentic chapter for the loop that turns one model into a system that can do work. This chapter does not re-teach them. It assumes the machinery and asks the harder question: given the machinery, what is the work, and how reliable is the answer that comes back?

Software 1.0, 2.0, and 3.0

A useful way to read the last decade is as a slow change in what the word program means.

For most of computing’s history, a program was a sequence of explicit instructions written by a person. The programmer’s job was to think through the problem, decompose it into steps the machine could execute, and type those steps in a language the compiler could check. Andrej Karpathy named this regime Software 1.0 in a 2017 essay and contrasted it with what was already, by then, taking over inside machine-learning teams: a regime where the programmer no longer wrote the instructions but specified an objective and a dataset, and a learning algorithm produced the program in the form of a set of weights inside a neural network.¹ He called the second regime Software 2.0. The 1.0 program is legible — you can read it. The 2.0 program is a numerical artefact — you can run it and probe it, but the policy it encodes does not live in any one line. The two regimes were already coexisting in 2017. They still coexist. Image classifiers and speech recognisers and recommender systems have been Software 2.0 for years; the operating system underneath them has not.

In a June 2025 keynote at Y Combinator’s AI Startup School, Karpathy added a third regime.² Software 3.0 is the world in which the source of the program is a prompt and the operating system that runs it is a large language model. You describe the behaviour you want in English; the LLM, often wrapped in an agent, produces and runs the code that realises it. The prompt is no longer just a thing you type into a chat window — it is the artefact that, with enough surrounding scaffolding, is the software. Karpathy’s framing is now common shorthand in the field; the chapter borrows it once and carries the idea in its own words.

The point is not to draw three clean boxes and shove every system into one. By 2026 all three regimes coexist inside a single product. The frontend of a modern application is mostly 1.0 — humans wrote it, humans read it, humans debug it. The recommender that decides what the user sees is 2.0 — the team that owns it tunes data and loss functions, and what comes out is a frozen set of weights. The glue around the edges — a new internal tool, a one-off migration script, a feature spike for an experiment — is increasingly 3.0, drafted by a coding agent from a few paragraphs of intent.

What changes across the three regimes is where the difficulty lives. In 1.0, the difficulty is in the typing and the design at the same time, and a good programmer is good at both. In 2.0, the typing is cheap and the difficulty migrates upstream into curating data, designing objectives, and probing the learned policy for failure modes. In 3.0, the typing is nearly free and the difficulty migrates further upstream still — into specifying what the system is supposed to do, choosing which subsystem can be safely produced by a prompt, and verifying what the agent delivers. The reader who is not a developer should hold onto that arc. The discipline of software engineering did not get smaller as 3.0 arrived; it got more concentrated. Less of it is typing. More of it is judgement.

A developer reading this should see their own toolchain in the frame. The completion suggestions floating above the cursor are 3.0 in miniature — a prompt, a model, a probabilistic program produced on the fly. The CI bot that proposes a fix when a test breaks is 3.0 wrapped in a 1.0 trigger. The library being pulled in is still mostly 1.0. The work, taken as a whole, is now distributed across the three regimes, and the practitioner question is which to use, where, and at what cost.

The ladder of autonomy

Inside the 3.0 regime, the practitioner-facing distinction that matters most is not which model but how much of the loop you are handing off. Five rungs of a ladder, each one a step further from the keyboard.

Completion. The lowest rung is what GitHub Copilot popularised when it launched in 2021: a model trained on code that suggests the next few tokens, line, or block based on the surrounding file.³ The human is still typing. The model is shortening the typing. There is no plan, no decision, no autonomy. The output is reviewed before it is even saved.

In-IDE assistant. One rung up, the model has not just the cursor’s neighbourhood but a chat sidebar and, often, an indexed view of the whole project. The developer highlights a function and asks for a refactor, or selects a failing test and asks for a fix. Tools in this rung — Cursor and its competitors made the form mainstream — feel less like autocomplete and more like a junior colleague with extremely fast hands.⁴ The human still makes every decision; the model handles the mechanical implementation of decisions already taken.

Tool-using model. Higher still, the model can invoke things on the developer’s behalf: run a command, read a file, search the web, execute a test. The mechanism is the function-calling pattern from the agentic chapter, here put in service of code. The leap is qualitative, not just quantitative. A model that can run the test suite is a model that can tell whether its own answer worked.

Single agent. When the tool-using model is wrapped in an outer loop that lets it plan, act, observe, and iterate until a task is done, it becomes a coding agent. Cognition’s Devin, launched in early 2024 with the marketing line the first AI software engineer, was the early high-profile demonstration of the form.⁵ By 2026 the single-agent rung is mainstream — Claude Code, Cursor’s agent mode, OpenAI’s Codex CLI, and a dozen others are all versions of it — and the practitioner question is no longer whether to use one but what to give it.

Multi-agent workflow. The ceiling, at least in early-2026 hype, is multiple agents collaborating: a planner that decomposes the work, a coder that implements, a reviewer that critiques. The reality is messier. Cognition’s own engineering team, after a year of running production agents, published a widely-read post arguing that most multi-agent designs leak context across the agents and make decisions worse rather than better, and that the right default is a single, capable agent with a long context window and good tool use.⁶ The advice is contested in the literature but it is honest about what actually works at scale today: one strong agent, given the verifier it needs, is the floor of reliable coding-agent practice.

The mechanics behind this ladder live in Part I. The agentic chapter walks through function calling, the loop, harnesses, skills, and the Model Context Protocol; the language-modelling chapter explains why the model can produce plausible code at all. This chapter never re-teaches that machinery. It assumes it and asks where on the ladder a given task belongs.

A second distinction, sitting orthogonal to the rungs, is worth spelling out because every practitioner has run into the vocabulary and most readers half-understand it. Four words that recur in any conversation about coding agents in 2026: agents, commands, skills, MCP.

An agent is the loop itself — the entity that decides what to do next, invokes tools, reads their output, and decides again. Loop is the operative word; everything else attaches to it.

A command is a saved invocation. You write a short procedure once — run the tests, look at the failures, propose the smallest patch that makes them green — and give it a name. Next time, you type the name, and the agent runs the procedure. A command is a frozen instruction, like a shell alias for the natural-language interface.

A skill is a reusable piece of procedural knowledge the agent can rediscover on its own. Where a command is invoked by the human, a skill is invoked by the agent when it recognises a situation that the skill knows how to handle. The agent’s library of skills is its accumulated competence at the things its operator does often, retrieved into the prompt at the moment it becomes relevant.

The Model Context Protocol is the socket that everything else plugs into. Anthropic introduced MCP in late 2024 as a standard way for an agent to discover and use the tools, prompts, and data sources a user wants it to have. The protocol’s contribution is not the tools themselves; the protocol is the contract that means each tool’s author writes one server, each harness speaks one protocol, and the ecosystem around them gets to grow.⁷ The mechanical specification is in Part I. The practical consequence for the developer is that an agent today is rarely a single program in isolation; it is a small loop that talks to a constellation of MCP servers, each owned by a different vendor or maintained by the developer themselves.

These four words are the core vocabulary of agentic coding in 2026. A developer who half-knew them now has the distinctions cleanly. A non-developer who has heard them on a vendor demo now has a working model for what each one names.

The shape of the working day

What does it look like, in practice, to use these tools across the lifecycle? The five rungs and the four words come into focus when they meet a real task.

A coding task has three phases the field has named for decades: figuring out what to build, building it, and figuring out whether what was built is right. AI now reaches into each of the three, with different leverage and different failure modes.

Ideation and design. Before any file is touched, the leverage is at its highest and the risk is at its lowest. A model used as a thinking partner — the partner mindset from the Part II intro, applied here — turns a vague feature request into a structured design memo. The architectural trade-offs of two database schemas, the libraries available for the protocol you have to speak, the failure modes other teams encountered when they tried the same thing — the model will surface all of these with little effort. The output is not the design; the output is the option space the developer then narrows. This is the rung at which a non-developer reader can also do useful work, because the conversation lives in English rather than syntax.

Implementation. Here the ladder matters most. A self-contained snippet — a parser, a small utility function, the glue between two libraries — sits comfortably on the in-IDE-assistant rung: the developer types a docstring or a chat instruction, reads the result, accepts or rejects it. A larger piece of feature work — adding a screen, wiring a new API endpoint to an existing service — usually sits on the single-agent rung: the developer hands the task off, watches the agent work, intervenes when it goes off-track. The choice is not aesthetic. It is a judgement about how well-specified the task is, how much surface area in the codebase it touches, and how cheaply the result can be verified. A well-specified task with a fast test suite belongs higher on the ladder; a fuzzy task in a fragile area belongs lower. A non-developer reader should hold onto the asymmetry: AI’s reach across the lifecycle is uneven, and the part that has changed most is exactly the part of the day that used to be longest — the typing.

Verification and explanation. This is the phase the chapter will spend a section on shortly, because verification is the single layer that decides whether the rest of the ladder is usable at all. Suffice it for now that the same models that write code also read it. They can summarise a function, generate unit tests, propose edge cases, and explain a legacy system written in a language nobody on the team still speaks. Legacy modernisation — taking a COBOL or old-Java codebase and producing an annotated description, a test scaffold, or even a translation to a modern stack — has gone from a high-cost specialist exercise to something an agent can do a first pass on in an afternoon. The first pass is never the final answer. It is a starting point that did not previously exist.

The canonical workflow stitches the three phases together. The example, repeated across teams enough that it has become an in-joke about how much AI tooling looks the same now: add OAuth login to an existing application. Ideation happens in chat — the developer talks through which provider to use, which library, where the session token should live, what the migration path looks like for existing users. The agreed plan is written down. Implementation happens at the single-agent rung — the developer hands the plan to the coding agent and watches it create the auth module, modify the user model, write the integration tests, and run them until they pass. Verification happens half-by-tool, half-by-human — the agent produces a diff and a written rationale, and the developer reads both with the suspicion they would apply to a competent colleague’s first draft, sending change requests until the patch is good enough to merge. Three phases, three different rungs, one feature that used to take a week and now takes most of a day. The shape of the working day, for the developer who has set this up, is the shape of the OAuth example, run again with a different feature attached.

How reliable is autonomous coding, really

Everything above assumes the agent’s output is usable enough to be worth handing to. That assumption needs interrogation, because the gap between what an agent can attempt and what it can finish reliably is the gap that decides whether the workflow above is the new normal or a brittle stunt.

The most rigorous picture of where agents are on this comes from METR, the evaluation group that has built a multi-year measurement of how long an autonomous task an AI agent can finish at human-level reliability.⁸ Their headline result is a time-horizon curve: take a population of software tasks with known human completion times, score each task as solved or not for a given model, and read off the task length at which the model crosses some reliability threshold. The 50%-reliability horizon — the task length at which the agent succeeds half the time — has been doubling roughly every seven months across recent frontier models. By early 2025, frontier agents were reliably finishing tasks that take an experienced human about an hour. By the end of the same year, the horizon at the same threshold had moved past two hours. The curve is the most legible thing the field has produced on the question of what can these agents actually do without supervision.

The same paper, and others around it, surface a second finding that is at least as important. The 50% horizon is the headline, but the 80% horizon — the task length at which the agent finishes the work reliably enough to actually trust without checking every detail — is much shorter. Across most coding benchmarks, this higher-reliability ceiling sits stubbornly around the 70–80% mark on task lengths that the 50% number would suggest are well within reach. METR’s RE-Bench, which evaluates frontier agents against expert humans on machine-learning research-engineering tasks, sharpens the same picture: agents are roughly four times faster than humans on tasks that take a couple of hours, level with humans around eight hours of task length, and worse than humans on tasks that stretch past a working day.⁹ What the practitioner reads off the curve is a counterintuitive instruction: the longer the task, the less you can trust the agent that nominally finished it.

The other big-bet 2026 benchmark family fills out the picture. SWE-bench, introduced by Jimenez and collaborators in 2023, is the now-canonical evaluation of whether an agent can resolve a real GitHub issue end-to-end — find the bug, write the fix, pass the project’s existing tests.¹⁰ When SWE-bench first appeared, frontier systems solved around 2% of the tasks. OpenAI’s human-verified subset, SWE-bench Verified, released in mid-2024 to tighten the eval against contamination and ambiguous specifications, is the version most 2025–26 model launches now report against.¹¹ The same year, Paul Gauthier’s aider released a polyglot benchmark covering 225 Exercism exercises across six languages, evaluating end-to-end code editing rather than just generation.¹² And LiveCodeBench, from Jain and collaborators, supplies a continuously-refreshed contest-programming benchmark designed to be free of training-data contamination.¹³ No single benchmark is the truth. The point of the constellation is that across them, the picture is the same: frontier agents are dramatically better than they were a year ago, the curves are still going up, and the high-reliability ceiling on long tasks has not been beaten.

There is a deeper reason this ceiling is not just an engineering gap to be closed by next year’s model. The reason has a name. Rice’s theorem, a 1953 result from theoretical computer science, says that any non-trivial semantic property of a program — does it ever crash? does it always return a positive number? is it free of a particular kind of bug? — is undecidable in the general case. You cannot build an automated checker that takes an arbitrary program and an arbitrary semantic question about it and always answers correctly. The result is not a statement about today’s tools or this decade’s machine learning. It is a statement about what is possible, in principle, with any algorithm whatsoever. The asymmetry it implies is the central fact of agentic coding. The agent can produce code in seconds. Verifying whether the code is right — that it does what the intent specified, and only what the intent specified — is structurally beyond the agent’s reach in the general case. Not difficult; impossible.

The practical consequence has shaped every working coding-agent setup in 2026 and is worth stating in one line. The verifier is what matters. The reason coding agents work as well as they do today is not the underlying model alone; it is that code uniquely admits cheap, fast, automated verifiers — type checkers, linters, unit tests, integration tests, the compiler. The agent’s loop is reliable to exactly the extent that the project has a test suite the agent can run, a CI pipeline that catches regressions the local tests do not, and a human reviewer with enough context to notice the kinds of error neither will catch. Strip out the test suite and the agent’s output is back to being a fluent guess. Build the test suite, and the agent becomes a tireless implementer of changes that the suite is structurally good at validating.

What this means for the practitioner is durable advice. Never ship what you cannot verify. If a piece of agent-produced code is not exercised by a test you trust, the test is the next thing to build, not later but before merging the patch. The single-agent rule from the previous section is downstream of this one: the reason the field has converged on one capable agent with a verifier rather than three agents debating each other is that more agents do not add more verification; they add more places for the verifier to miss. The non-developer reader is being given, here, the structural reason that the human’s job is to build the test suite. Not as a stopgap until the models get better. As the part the models, by a 1953 theorem, cannot do for themselves.

The sharp edges

The above is what works. The rest of the section is what cuts.

Hallucinations, three flavours. The model invents. Sometimes the invention is syntactic: a missing parenthesis, a reference to a function it just made up. These are the harmless kind — a linter or a compiler catches them in milliseconds. Sometimes the invention is semantic: the model uses a name that exists in the codebase but means something else, a function whose signature matches but whose contract does not, an API call shaped right but pointed at the wrong endpoint. These get past the compiler and have to be caught by tests or by a reading human. Worst, sometimes the invention is logical: the code reads plausibly, type-checks, runs, passes a few obvious tests, and is subtly wrong about an edge case. The variable that should have been updated inside the loop is updated outside; the off-by-one in the boundary is invisible in the common case and catastrophic when the data shifts. These are the bugs that take hours of human reading to find. They are also the kind of bug a competent senior developer would have made themselves, just on a different day. The asymmetry that matters is that the agent makes them faster than a human reviews them.

Automation bias. A senior reviewer reading another human’s code expects bugs and reads accordingly. The same reviewer reading machine-generated code, in the form they have known machine-generated code for forty years — compiler output, generated parsers, code coming out of protoc — reads it differently, because that kind of generated code is, by construction, correct in the ways the underlying tool guarantees. A coding agent’s output looks like generated code but is not generated by a tool with that kind of guarantee. The bias slips in unnoticed: the reviewer’s eye softens, the diff scrolls faster, the bugs get through. The countermeasure is to read the agent’s output with the suspicion you would apply to a competent stranger’s first draft, not to compiler output.

The technical-debt double-edge. AI can be a powerful debt-reduction tool — analysing a codebase for dead branches, proposing refactors with tests, writing missing documentation. It can also be the largest debt-generator in software history. The pattern emerges anywhere a team uses an agent to ship features faster than they can absorb the implications. The agent’s code is plausible, the tests pass, the feature ships, and three months later nobody on the team understands why a particular abstraction sits where it does. The original intent lived in the prompt the developer typed and threw away. The debt is now a class of debt the industry has not yet developed reflexes against.

Security and licensing. A model trained on the public corpus has read every SQL-injection vulnerability ever shipped, every buffer overflow in C, every secret accidentally committed to a public repository. It will, given the chance, reproduce these patterns when they look like the local conventions of the code it is editing. It will also occasionally regurgitate verbatim training material, including code under restrictive licences. Both risks have to be managed at the workflow level — secret-scanners on every commit, licence-aware tooling on every dependency suggestion, a security review pass on any code that touches an authentication boundary — because the model will not flag them. The model has no notion of which patterns are dangerous and which are merely common.

The economics, honest. The story sold by the tooling vendors is straightforward: the subscription costs a few tens of dollars per developer per month, and developer productivity goes up by some multiplier somewhere between one-point-something and several-times. The story measured by the rigorous studies is more uncomfortable. METR ran a randomised controlled trial in 2025 in which sixteen experienced open-source maintainers worked through 246 real issues in their own projects, with and without permission to use AI tools.¹⁴ The developers self-reported a 20% speedup from the AI-assisted condition. The measured outcome was a 19% slowdown. The same humans, on the same tasks, in their own codebases, were objectively slower when they had AI assistance available, while subjectively believing themselves faster. The result has not been replicated everywhere — the population was small, expert, and worked on familiar code — but it is the most rigorous single result on the question of where AI tooling actually pays off, and the answer it gives is not where the vendors say it does. The honest practitioner reading is that AI tooling probably accelerates the developer who is new to a codebase or to a stack, probably decelerates the senior maintainer doing surgical work on familiar ground, and that the team’s net productivity is the sum of those two effects, not the marketing slide. A non-developer reader should hold onto exactly this point: productivity claims about AI coding tools, in 2026, are not all wrong and not all right, and the honest answer depends on which job, in which codebase, by whom.

The future, and is coding dead?

The question is unavoidable enough that not addressing it would be cowardice. The short answer is no, but the long answer is more interesting than that, and saying no without saying why feeds the same hype cycle that asks the question in the first place.

The term vibe coding — coined in early 2025, now near-ubiquitous — names a particular working style: describe the rough behaviour or aesthetic of a piece of software in natural language, hand the description to an agent, and accept what comes back without writing a line of code yourself. For a class of work, vibe coding is fantastic. Prototypes that would have taken a weekend take an hour. A non-developer with a clear idea of what they want can ship an internal tool, a side project, a small site, without ever opening an editor. Hackathons are full of these. So is the long tail of one-off scripts that a developer would not previously have bothered to write at all. For this kind of work — disposable, exploratory, bounded — the workflow is real and durable.

For mission-critical software, vibe coding is bounded by exactly the constraint Rice’s theorem made structural. A natural-language description of a system is, almost by definition, ambiguous about the edge cases that determine whether the system is correct. The agent produces a program that satisfies the description; whether it is the program the operator meant requires the operator to have known what they meant precisely enough to test for it, which is the discipline software engineering has always been about. The economics literature on professional code is unambiguous on this point: precision, maintainability, and adherence to architectural constraints are not vestigial habits that AI eliminates. They are the properties that decide whether the software still works in six months. Vibe coding is a tool for speed and exploration. It is not a replacement for the engineering posture on anything whose failure is costly.

That said, coding itself — the activity of writing executable software, by human or by agent — is nowhere near saturation. The world has far more need for software than the population of people who can write it. Every productivity leap in this industry, from assembly to compilers, from compilers to high-level languages, from high-level languages to frameworks, has been met with more demand, not less. The result has consistently been more programmers, not fewer, because the threshold to participate dropped and the things worth automating multiplied. The current wave is, on the surface, the most dramatic of these leaps; underneath, it is the same pattern.

The deeper claim is that the hard parts of software development never were the typing. The hard parts are: understanding what a customer actually needs, designing the system that meets the need without painting the team into a corner, deciding what to build and what to not build, navigating the messy human collaboration that turns one engineer’s idea into a product that ships. These are human-centric activities. A language model can help with each of them — surface options, draft a memo, role-play a stakeholder — but a language model cannot be the customer, cannot own the consequences, cannot stake its salary on the call. The end user remains human. Human users are slow, get angry easily, do not understand the application, and do not know what it is they do not like about it. Until something can replicate what it feels like to be one — and at that point, the question of whether we still call it artificial becomes the more interesting one — the human stays in the loop.

Two corollaries follow, one for the developer and one for the non-developer.

For the developer, the role elevates rather than evaporates. Less of the day is typing; more of it is architecting, specifying, reviewing, and verifying. The skill of choosing which rung of the ladder a task belongs on, and of building the verifier that decides whether the work that came back is good, replaces the skill of recalling syntax. Knowing how code works, how algorithms work, how systems compose — none of that gets less valuable. The leverage of knowing it gets larger, because every developer who understands the substrate is now multiplied by the throughput of the tooling, and every developer who doesn’t is increasingly at the mercy of whatever the agent decided to do.

For the non-developer — the policymaker, the manager, the citizen — the picture worth holding is that programming literacy is on the same arc as numeracy. Most people know enough arithmetic to get by in daily life without hiring a mathematician, and the future plausibly has people knowing enough programming to automate the small annoyances of their lives without hiring a developer. When I get home, turn the lights on if it is dark and the electricity rate is below the daily average will be a sentence that produces working software, written by an agent the user does not have to think about. That expansion enlarges the field; it does not shrink it. It also means that the political and economic questions about who controls these tools, who owns the artefacts they produce, and who is liable when the artefacts fail will, increasingly, touch everyone — not because everyone codes, but because everyone uses software written this way.

So no, coding is not dead. It is becoming more important and more accessible at once, which is the same direction every other piece of software-engineering productivity has gone for fifty years. The skill that lasts is the one that lasted before: think carefully about the problem, build the thing that solves it, and check that what you built is what you meant.

Which rung, and who verifies

The chapter began with the cup of coffee getting cool while an agent shipped a feature. The promise was a way to place any coding task on the ladder of autonomy and say what trusting the result on that rung costs. Three things have to be true for the cup of coffee to mean what it looked like it meant.

The first is that the task was on the right rung. A small, well-specified change with a fast verifier belongs on the single-agent rung; a fuzzy architectural decision belongs in the partner-mindset conversation that happens before the keyboard. Sending the wrong task to the wrong rung is the most common failure mode of teams adopting these tools, and it usually looks like overreach, not underreach. The temptation to hand the agent a problem you have not yet thought through is real and ruinous. The discipline is to think the problem through to the point where the agent’s job is well-defined, and only then hand it off.

The second is that the verifier existed. The agent’s loop was reliable because the test suite was good. Where the test suite is thin, the agent’s reliability is thin, by exactly the same proportion. The work that does not get cheaper in a 3.0 world is the work of building the verifier that decides whether the 3.0 output is right. That work is largely human, and it is increasingly the most valuable part of the developer’s day. Rice’s theorem is not getting repealed.

The third is that someone read the diff. The most powerful thing the developer in the opening scene did was the part that looked the least dramatic. They read the agent’s output with the same attention they would give a colleague’s work. The reading is the irreducible move. It is not automatable, because the question it answers — is this the change I actually meant? — sits one level above the artefact the agent produces, in the part of the work that has not migrated and is not going to.

The technology is mostly harmless, in the same sense everything in this book is mostly harmless: in the right hands, on the right rung, with the right verifier, it is among the most extraordinary tools the trade has ever had. In the wrong hands, on the wrong rung, with no verifier, it ships fluent and confident garbage faster than any prior tool could have. The choice of which version any given team experiences is not made by the model. It is made by the humans deciding which rung, and owning the verification, and reading the diff.

Karpathy, A. Software 2.0. Medium, 11 November 2017. The essay introduced the framing of neural-network weights as a new kind of program and contrasted it with the explicit-instruction Software 1.0 regime.↩︎
Karpathy, A. Software Is Changing (Again). Keynote, Y Combinator AI Startup School, 17 June 2025. Extended the 2017 framing with Software 3.0: prompts as source, LLMs as operating system.↩︎
GitHub. Introducing GitHub Copilot: your AI pair programmer. 29 June 2021. The launch of the first mainstream IDE-integrated code completion product built on a large language model.↩︎
Anysphere. Cursor Series C announcement, 2025. The in-IDE-assistant rung made mainstream; the announcement is also a useful 2025 signal of the scale at which the rung now operates.↩︎
Cognition AI. Introducing Devin, the first AI software engineer. March 2024. The early high-profile demonstration of the single-agent rung framed as an autonomous-engineer product.↩︎
Walden, W. & Suo, D. (Cognition AI). Don’t build multi-agents. June 2025. The widely-cited practitioner post arguing that most multi-agent designs leak context and degrade decisions, and that a single capable agent with strong tool use is the right default.↩︎
Anthropic. Claude Code best practices and the Model Context Protocol specification (2024–25). Together these define the vocabulary of agents, commands, skills, and MCP that the chapter uses; the MCP specification lives at https://modelcontextprotocol.io/.↩︎
METR. Measuring AI’s ability to complete long tasks. 2025. The time-horizon study that established the doubling-every-seven-months trend on the 50%-reliability horizon, and surfaced the much shorter 80%-reliability horizon that has anchored the field’s discussion of agentic coding reliability since.↩︎
Wijk, H. et al. (METR). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts. arXiv:2411.15114, November 2024. Frontier agents are roughly 4× faster than expert humans on two-hour tasks and worse than humans on tasks past eight hours.↩︎
Jimenez, C. E. et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR, 2024. arXiv:2310.06770. The benchmark that turned coding-agent evaluation into a measurement against real, unmodified repositories.↩︎
OpenAI. Introducing SWE-bench Verified. 13 August 2024. A human-verified 500-task subset of SWE-bench designed to tighten the evaluation against ambiguous task specifications and contamination.↩︎
Gauthier, P. aider’s new polyglot benchmark. aider.chat blog, 21 December 2024. A 225-task multi-language code-editing benchmark used as a practitioner-facing leaderboard through 2025–26.↩︎
Jain, N. et al. LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code. arXiv:2403.07974, March 2024. A continuously-refreshed contest-programming benchmark built specifically to be free of training-data contamination.↩︎
Becker, J., Rush, N., Barnes, B. & Rein, D. (METR). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. arXiv:2507.09089 and the companion METR blog post, July 2025. The randomised controlled trial in which sixteen experienced OSS maintainers were objectively slowed by ~19% while self-reporting a ~20% speedup.↩︎