9 AI for Scientific Research
Donald Knuth published a paper this year crediting Claude as a co-explorer. He is 88 years old, the most decorated living algorithmist, and he has been publicly cool on generative AI for years. He published it anyway.
The problem was a cycle-decomposition question on Cayley digraphs that Knuth had been carrying around for years as an exercise in The Art of Computer Programming. His friend Filip Stappers, also a mathematician, had verified small cases by hand and decided, on a hunch, to put the open conjecture to Claude Opus 4.6 — released three weeks earlier.1 Stappers ran 31 numbered explorations across his sessions with the model. Some led nowhere. Exploration 15 surfaced a fiber decomposition that nobody had written down. Exploration 31 produced a program that constructed valid decompositions, which Stappers ran against every odd \(m\) up to 101. Knuth, when he found out, sat down and wrote the paper. It is titled Claude’s Cycles. The author block has one human name on it.
Most accounts that open with this scene immediately take one of two off-ramps. Either AI is doing science now, the singularity is warming up, call your mother. Or the model is a stochastic parrot and Knuth was being talked at by a very well-indexed library — no discovery happened, move along. Both answers are wrong in the same way. Both are answers to the wrong question. By the end of this chapter you will be able to say who actually does the discovery in a setup like Claude’s Cycles, and why both the hype and the dismissal have been answering a question the facts do not fit.
The wrong question
The question every panel asks is some version of will AI replace scientists? You have heard it posed with alarm, with excitement, with the practiced neutrality of someone who has already written their column and is now gathering quotes. It is the wrong question. Not because the stakes are low — they are not — but because the framing presupposes a clean one-for-one substitution: one scientist out, one model in, the building remains standing. Nobody serious is proposing this. Nobody serious is building it. The question is a phantom, and chasing it burns time that would be better spent on the question the facts actually support.
Two off-ramps dominate the discourse. The first is the replacement narrative: a model that produces fluent mathematical prose, candidate protein structures, or proposed algorithms has crossed a threshold, and the only remaining question is how fast the rest of the discipline gets automated. The second is the stochastic-parrot dismissal: nothing the model emits qualifies as discovery because the model does not understand what it emits, and any output that looks like a result was already implicit in the training data. Both off-ramps make the same move. They reduce a multi-step process to a single-actor verdict. Did the AI discover it, yes or no? Once you accept that framing, you are sorting into camps rather than describing what happened.
What actually happened, in Knuth and Stappers’s case and in every other case examined below, has a shape. A human posed the problem. A model proposed candidates. A verifier — sometimes a formal proof assistant, sometimes a wet lab, sometimes the same human reading carefully — accepted or rejected each candidate. The human curated what survived. None of those roles collapses into the model. None of them collapses into the human alone, either. The interesting object is the loop, and the loop has been the central structure in computer-assisted discovery for fifty years. What changed since 2022 is one component inside the loop, not the loop itself. The rest of this chapter is about which component changed, what stayed the same, and what that means both for the people who do science and for the people who pay for it, regulate it, or rely on its output.
Replacement is empirically wrong
The clearest way to dispatch the replacement narrative is to walk the cases. Take the five most-cited examples of “AI doing science” from the past three years and ask, in each, who chose the question, who ran the loop, and whose name went on the published artifact. The answers are not subtle.
In Claude’s Cycles, Knuth chose the problem — he was the one who phrased the original case in graph-theoretic terms, the one who decided this object was worth a question.2 Stappers ran 31 numbered explorations with Claude. Some were wrong. Some were interesting. Some pointed at structure nobody had named. Stappers verified the candidate solutions by hand and by program, and Knuth read the chat logs, judged the construction valid, and wrote the paper. The chat log is not the discovery. The chat log is closer to a lab notebook from an unusually fast and tireless graduate student. The paper is the discovery, and the paper has a single human author who understood every line of it well enough to stake his name on it.
In Terence Tao’s machine-assisted proofs, the kernel is Lean — a formal proof assistant, human-built and human-audited, that either accepts the step or rejects it.3 The model makes suggestions inside that kernel. There is no appeal from Lean’s verdict. Strip out the kernel and you do not have a looser version of the same process; you have unverified text, which is not a proof. The kernel does the work of converting proposal into theorem, and the kernel is ours.
DeepMind’s AlphaFold predicted the structures of two hundred million proteins, an achievement that compressed a fifty-year-long problem into months.4 The follow-on, AlphaFold 3, extended the predictions to DNA, RNA, and ligand interactions.5 The breathless framing is real and deserved. The structural framing matters too: the choice to bet the company on protein-structure prediction was a human judgement. The evaluation was designed against the Protein Data Bank, a human-curated archive of wet-lab structures measured at enormous expense over decades. The follow-up validations were grown in physical labs against new crystal structures. The poster headline reads two hundred million. The science underneath it — what to predict, how to evaluate, which wet-lab measurements to trust — was human work all the way down. You do not get the poster without it.
DeepMind’s GNoME proposed 380,000 candidate stable crystals.6 7 Sit with that number for a moment. Then look at the other number: the autonomous A-Lab synthesised 41 of the predicted structures in 17 days of operation, the wet-lab side of the same project.8 The proposer was fast and prolific. The verifier was slow and expensive and irreplaceable. Both numbers appear in the published record because both steps happened. One is the screen. The other is the crystal in your hand.
AlphaProof earned a silver-medal score at the 2024 International Mathematical Olympiad.9 The result counts precisely because the proofs were formalised in Lean before anyone called them proofs. The formalisation is not a technicality. Without it you have a language model producing confident mathematical text, which can be fluent and wrong simultaneously. Lean is the part that turns output into theorem. The medal lives inside the kernel, not outside it.
Five cases, one shape. A named human posed the problem. A named human or a human-built formal system verified the answer. The model sat in the middle, proposing fast and without complaint. That middle role is new. It matters in a specific way I will name shortly. But the middle role alone is not the discovery. The loop is. And the loop has a human in it.
This isn’t new
The replacement off-ramp depends on the perception that 2022 was a clean break. It wasn’t. Computer-assisted discovery has a continuous fifty-year history; the architecture has been stable, and the architecture is the part that matters.
In 1976, Kenneth Appel and Wolfgang Haken proved the four-colour theorem by feeding a computer 1,482 reducible configurations and having it check them exhaustively.10 The proof was correct. The mathematical community spent years fighting about it anyway — not about the result, but about the question underneath: is it still mathematics if no human can read every step? The bruise is the same one the discipline is pressing on now about LLM-assisted proofs. The theorem changes. The anxiety doesn’t. Same furrow in the same brow, fifty years apart.
Thomas Hales proved Kepler’s conjecture in 1998 — the old intuition that hexagonal close packing actually is the densest way to stack cannonballs.11 His proof ran to three hundred pages of text and gigabytes of computation, and the referees, working in good faith for years, eventually told the journal they were ninety-nine percent certain the result was correct and could not get further. So Hales started Flyspeck, a formal verification effort in HOL Light that took another sixteen years and finished in 2014. Human poses the conjecture. Computer searches the case space. Human writes the argument. Formal system checks it cold. Different decade, same shape.
Between those waypoints sit the slower, less-celebrated efforts. Computer algebra systems in the 1980s and 1990s. SAT solvers that demolished open problems in combinatorics. Symbolic regression systems like AI-Feynman, which rediscovered Feynman’s hundred classical physics equations from raw data without anyone writing a parser for the underlying mathematics.12 The lineage is not LLM-shaped. It is loop-shaped: a proposer that explores a space, a verifier that prunes the candidates, a human that decides which candidates were worth proposing.
What actually changed in 2022 is exactly one thing: open-ended, cross-domain proposers. FunSearch generates the search program rather than executing one you already specified.13 AlphaEvolve evolves the algorithm itself.14 A general-purpose model sketches proof strategies, suggests representations, proposes intermediate lemmas you did not think to ask for. These are systems that generate the search rather than execute one a human specified. That is a real change. It closes a gap that sat open for fifty years — the gap between here is the space to search and here is how to define the space. Everything else in the architecture — verifier, curator, the human keeping score — traces back to Appel and Haken’s IBM 360 running through its cases in Urbana-Champaign in the summer of 1976.
And the unchanged parts are not footnotes. Verification: unchanged. Curation of which results are worth naming: unchanged. Naming the problem in the first place: unchanged. Choosing the representation that makes the problem tractable at all: unchanged. These are the hard parts. These are also the parts no current AI does on its own.
The shape of the loop
If one finding from this chapter is worth carrying forward, it is not any particular result. It is the shape of the loop that produced them. A human poses the problem and represents it formally. A model proposes candidates. A reliable verifier filters the candidates. A human curates what survives. You could draw this on the back of an envelope in thirty seconds. The striking thing is that the same envelope fits AlphaProof working through Euclidean geometry, GNoME searching crystal space, FunSearch hunting combinatorial extremes, and Stappers sitting at his desk asking Claude about Cayley graphs. Same loop, different domain.
The proposer and the verifier are different jobs
The proposer and the verifier are doing different jobs, and they have to be built differently. The proposer is expensive, creative, and allowed to be wrong — it is supposed to throw candidates at the wall. The verifier is cheap, reliable, and not allowed to be wrong — its entire value comes from saying no and meaning it. Mixing the two roles, or being vague about which tool plays which, is how these systems fail. Part I’s chapter on agentic systems names the structural reason: any loop without an external verifier eventually decides for itself when it is done, and the only honest stopping rule a model can give itself is “out of tokens.” This chapter is the working-scientist version of that point, applied to discovery rather than to coding agents.
Walk the verifiers case by case and the pattern is tangible. In AlphaProof, Lean’s kernel is the verifier — a small, formally specified piece of software whose only job is to reject invalid proofs. In FunSearch, an automated evaluator runs the candidate program and measures its output against a known criterion. In AlphaTensor, the verifier is a timer on real silicon — you run the generated matrix routine and clock it.15 In GNoME paired with A-Lab, the verifier is a wet lab: a synthesis team and a robotic platform physically grew dozens of the predicted structures and measured whether they were stable.16 In Claude’s Cycles, Stappers and Knuth re-derived the construction by hand, and Stappers ran the verifier program over odd \(m\) up to 101.17 Every loop has a verifier. Every verifier’s only job is to say no.
Weaken the verifier and the loop collapses into noise. Galactica is the cautionary tale here — a large language model trained on scientific text, released in 2022, retracted within days because it produced plausible-sounding research papers in any field on demand, citations and all, including for results that did not exist.18 The proposer was fluent. There was no verifier. The output was fluent nonsense at scale. The lesson generalises: a better proposer with a weak verifier produces more confident nonsense, faster. A modest proposer with a tight verifier extracts real discoveries. The research dollars currently chasing bigger proposers are buying the wrong axis.
Tao’s principle of composability
Tao’s design principle, stated almost in passing in his essay but worth elevating, is that you combine tools so that one counteracts the weakness of another.19 This is not a happy accident of AlphaProof’s architecture. It is the design philosophy the loop is built around. You do not pick the single best tool and run it alone. You ask what each tool cannot do, and pair it with something whose strengths cover that gap. The resulting system is more reliable than any of its components, in the same way a three-legged stool is steadier than any single leg.
Four pairings, stated plainly. A language model hallucinates; a proof assistant verifies. A proof assistant is tedious to drive; a language model autocompletes the routine steps. A machine-learning model is opaque; a SAT solver and human inspection probe its structure. A SAT solver chokes at scale; a language model proposes a decomposition that cuts the case count. No single tool is reliable end-to-end. The combination is. Each pairing is a handshake — one partner’s output is the other partner’s input, and the weakness of the first is exactly what the second was designed to handle.
The unit of analysis is the system, not the model. The question worth asking is not can GPT-N solve competition mathematics? but which combination of tools, arranged in which workflow, verified by which oracle, can solve competition mathematics? The first question reads like a benchmark. The second reads like a research programme.
What is genuinely new since 2022
What is actually new since 2022 is two changes inside the proposer slot. The first is open-ended program synthesis in the service of search. FunSearch, AlphaEvolve, and — in the cycles paper — Claude do not just propose answers. They write programs that search.20 21 The output is not a candidate theorem or a candidate crystal structure; it is a procedure that, when run, explores a space of candidates and surfaces the extremes. That is a different capability, not a scaling of the old one.
The second change is cross-domain transfer at useful fidelity. A model trained on the broad sweep of human text — code, proofs, biology papers, folklore, recipes — produces proposals that are useful in domains it was never custom-trained on. Stappers was not using a “combinatorics AI” tuned on Cayley-graph literature. He was using the same general-purpose model anyone has in their browser. The model’s breadth is the point. It means a researcher does not need to wait for a domain-specific system to be built and trained before they can start exploring; they can start today, and the proposals will be rough but not useless.
The tight LLM-to-formal-verifier loop is also new, and the precision matters. Lean existed before 2022. Language models existed before 2022. The working pair — where the model drafts proof steps, Lean checks them, the result feeds back, and the system iterates without a human in the inner loop — did not exist at useful scale before roughly 2023.22 That pairing is what AlphaProof runs on. The same pairing animates Coscientist, an LLM-driven chemistry planner that designs experiments, dispatches them to a robotic platform, reads the results, and revises the plan — without a human in the inner loop, but with a human-built lab as the verifier.23 ChemCrow extends the pattern with tool-augmented chemistry.24 These are not incremental over what existed in 2021. They are a new architectural species, and the species name worth using is AI lab member, not AI scientist.
What hasn’t changed
Humans still pose the problems. The IMO questions came from human mathematicians. The crystal-structure space for GNoME was shaped by human crystallographers who defined what counts as a stable structure worth caring about. The conjecture Stappers brought to Claude was written in full, careful sentences by Knuth in a textbook exercise twenty pages long. The model does not walk in from outside and announce a direction. It responds to one.
Curation is still human too. The results that make it into published papers are the ones a scientist looked at and decided to trust. Verification still matters. Without a verifier whose answer the community actually believes, every output from the proposer is noise wearing the costume of a result. Same shape as 1976, give or take a proof assistant. New ingredient in the middle. Same shape.
Two strong opposing views
The most useful thing this chapter can do at this point is to take the two strongest objections seriously, give each one the sharpest edge it can have, and then say precisely where each cuts wrong. Both have real force.
Start with the maximalist objection: AI is already doing science, and the hedged language above is just reluctance to admit it. The case is not trivial. FunSearch found a better bound on the cap-set problem — a result that human mathematicians had not found in years of trying. AlphaTensor discovered a faster matrix-multiplication algorithm. When Stappers gave Claude the Cayley-digraph conjecture, the model proposed a fiber decomposition that Knuth — who has been thinking about exactly this kind of object for sixty years — had not written down. The output surprised the room. Isn’t surprise the signature of discovery? If the human in the room learns something they did not know, and the new knowledge came from the AI, what else would you call that?
The answer matters and it is not a hedge. Surprise to the observer is necessary for discovery, but it is nowhere near sufficient. Discovery is a three-part act. Someone has to choose which question is worth asking. Someone has to pick the representation — the coordinate system, the abstraction level, the vocabulary in which a solution is even expressible. And someone has to judge which surprises matter, which ones are mathematical curiosities and which ones are real results worth pursuing. What FunSearch and AlphaTensor actually did is the third act, once the first two had already run. The humans chose cap sets. The humans chose matrix multiplication. The humans decided that a one-multiplication improvement in a \(4 \times 4\) algorithm was worth a Nature paper. The AI found the move. That is not nothing. But it is not the whole game. And the maximalist case quietly assumes that the verifier and the question-chooser will be automated next; the empirical record is that they are not, that nobody knows how to automate them, and that confident timelines for when this will change are best read as fundraising prose.
Now the opposing view from the other direction: AI is just a tool, and calling it a “lab member” is overheated language for what is functionally a very fast calculator. Scientists have always used tools to extend their reach — the telescope, the spectrometer, the Monte Carlo simulation, Excel. We don’t say Excel does finance. We don’t give the centrifuge a co-authorship. The hype around AI in science is a category error: the speed and fluency dazzle, but the structure is the same as every other tool, and calling it something grander doesn’t make it so.
This view is attractive. It keeps the ontology clean. It is also wrong about the structure, and the reason is not subtle. Excel does not propose conjectures. When AlphaTensor searched the space of possible algorithms, it was not executing a procedure a human had already specified — it was generating candidates that the human had not imagined, evaluating them against criteria, and surfacing the ones worth attention. That is the structure of hypothesis generation. We have a word for the kind of work that produces novel candidates for human evaluation: it is called scientific creativity, and the discipline has historically said it requires a scientist. Now something that is not a scientist is doing a recognisable share of it. You can decline to update your taxonomy, but the taxonomy has stopped tracking the object it was supposed to describe. Calling AlphaTensor “just a tool” is accurate in the sense that a hammer is a tool. It is wildly inaccurate in the sense that a hammer does not tell you where to put the nail.
Neither off-ramp lands. AI scientist overstates the agency and, more practically, dissolves accountability in a place where accountability matters enormously. AI tool understates the structural shift and lets the discipline off the hook from thinking carefully about what has changed. The right frame is AI lab member: a participant with a specific seat at the table, a specific kind of contribution, and no claim to the chair at the head of the room. Credit, responsibility, and direction stay with the humans who chose the question, wrote the paper, and will answer for it if something turns out to be wrong.
What this does to the publishable paper
The previous sections argued about who does the discovery. There is a related shift that lands one rung up the institutional ladder, on the unit of credit. When a single loop can read the background literature, propose the hypothesis, run the verification, and draft the paper, a published paper stops being a clean proxy for research was done. The unit of credit decouples from the act of discovery.
Take the affirmative case first, because it is the strongest reading of the change and the one most likely to age well. Strip the mechanical paper-production work — literature scan, background section, methods boilerplate, figure layout, references — and what is left as the scarce, valuable thing in the workflow is exactly what AI cannot supply: a novel question worth asking, and the taste to pick which problems matter. The proposer got cheap. Choosing what to prove did not. A field that can no longer coast on producing papers is forced back onto producing ideas, and ideas are the thing science was supposed to produce in the first place. For a researcher who has been measured on paper count for a decade, this is uncomfortable. For a researcher who has been measured on what they actually think about, this is liberating. The chapter’s own thesis, applied to incentives: the bottleneck moves to the part that AI does not touch, which is choosing the question and judging the answer.
Now the harder reading, because the same cheapening is a Goodhart accelerant on metrics that were already failing. We have been publishing more and more papers for decades without proportionally more knowledge. Park, Leahey, and Funk analysed 25 million papers and 4 million patents and found that disruption — by their CD-index measure of how much a work displaces what came before — has declined monotonically since 1945, while paper counts have exploded.25 Ioannidis documented hyperauthorship at a scale that made the unit of credit barely recognisable; thousands of scientists now publish a paper every five days on average, which is to say almost none of those names are doing the work the byline implies.26 The paper-mill problem — fabricated manuscripts, citation cartels, retraction waves measured in the tens of thousands per year — was already a structural crisis by 2024, before LLM-assisted fabrication had even hit its stride.27 Goodhart’s law, in the Strathern formulation, names the underlying pathology: when a measure becomes a target, it ceases to be a good measure.28 Citation count became a target decades ago. AI is not the cause of the breakage. It is the accelerant that makes the breakage impossible to ignore.
State both edges plainly. The optimistic edge: AI strips mechanical paper-production work, makes question-selection and taste-judgement the scarce skills, and forces an honest reckoning with what science is actually producing. The pessimistic edge: AI pours fuel on paper-count-as-metric, generates least-publishable-units at scale, multiplies reviewer load, and lets paper mills industrialise. The pragmatist reading is neither AI fixes science nor AI drowns science. The metric was already broken. AI is what makes the breakage acute enough to act on. That is the opportunity, and it is a real one.
For the working researcher, the practical implication is that optimising for paper count is now a losing game on every horizon longer than the next grant cycle. Taste, question-selection, and the ability to recognise which problems matter are the scarce and durable goods. For the funder, the executive, or the policymaker reading the numbers from a distance, the implication is harder: number of papers was never a good measurement of amount of science, and pretending it was has just become more expensive. Part III returns to the broader institutional consequences of measurement collapse; the move worth making here is to stop treating publication volume as a proxy for anything in particular and to start asking what each published paper actually claims, who verified the claim, and whether anyone could reproduce it.
What this means — for science, and for everyone
The dual-read of this chapter splits cleanly along the audience.
If you do science, the operating discipline follows directly from the shape of the loop. Keep proposer and verifier separate. Build the verifier first or buy the verifier off the shelf; it is the central component, and it is currently the underinvested one. Use the model for breadth — for proposals you would not have generated alone, for representations you had not considered, for connections across literatures you do not have time to read in full. Keep the human at the two ends of the loop: choose the question, judge the answer. Expect exploration-15 surprises, but never trust them unverified. Document the search the way Stappers documented the explorations — every numbered run, the prompt, the output, the verifier’s verdict. That document is the difference between a result you can defend and a result you merely believe. None of this is glamorous. All of it is the discipline that turns a fluent proposer into a reproducible practice.
If you don’t do science, but science’s outputs land on your desk — as a policymaker funding it, an executive consuming it, a citizen reading the headlines — the change worth tracking is how the knowledge supply chain is shifting. Proposal is faster and cheaper. Verification has not improved. Curation is bottlenecked on human judgement and probably will be for a long time. That asymmetry means more candidate discoveries arrive, but the certification that any one of them is real takes the same amount of work it always did. The temptation to celebrate the proposer’s output as if it were the verified result is the cognitive shape of every snake-oil cycle in the history of science. The pragmatic counter is to ask, of any AI-aided scientific claim, what was the verifier, and who built it? The answer is informative. Part III returns to what happens to public trust in science when this question is not asked.
Closing — who actually does the discovery
An 88-year-old mathematician, a problem he has been carrying around for years, a friend who decided on a hunch to put it to a language model. Filip Stappers ran 31 explorations with Claude. The breakthrough came at exploration 15, the construction at exploration 31, and Stappers verified the program over odd \(m\) up to 101. What Knuth did was different and harder. He read what had happened, judged the construction valid, and wrote a single sentence in his postscript that will outlive most papers published this year: I’ll have to revise my opinions about generative AI one of these days.29 That is the news. Not that a theorem was proved. That a person who had every reason to ignore this technology looked at it clearly and let it change his mind.
The loop does the discovery. Humans still own the question and the verifier. The author block has one human name on it for a reason. Whether the discipline of science survives the next decade in a shape worth recognising depends on which part of the loop the institutions decide to invest in. The proposer is the dazzling part. The verifier is the one that matters. Choose accordingly.
Knuth, D. E. Claude’s Cycles. Self-published note, 2026, with explorations by Filip Stappers. https://cs.stanford.edu/~knuth/papers/claudes-cycles.pdf. The 31 numbered explorations, the fiber decomposition at exploration 15, and the program verified for every odd \(m\) up to 101 are documented in the paper’s appendix. Knuth’s postscript on revising his opinions about generative AI is in the same source.↩︎
Knuth, D. E. Claude’s Cycles. Self-published note, 2026, with explorations by Filip Stappers. https://cs.stanford.edu/~knuth/papers/claudes-cycles.pdf. The 31 numbered explorations, the fiber decomposition at exploration 15, and the program verified for every odd \(m\) up to 101 are documented in the paper’s appendix. Knuth’s postscript on revising his opinions about generative AI is in the same source.↩︎
Tao, T. Machine-assisted proofs. Notices of the American Mathematical Society 72(1):6–13, January 2025. The canonical statement of the composability principle — combining tools so that one counteracts the weakness of another — appears in the closing section.↩︎
Jumper, J., Evans, R., Pritzel, A., et al. Highly accurate protein structure prediction with AlphaFold. Nature 596: 583–589, 2021. https://www.nature.com/articles/s41586-021-03819-2. The 2024 Nobel Prize in Chemistry was awarded in part for the work behind this paper.↩︎
Abramson, J., Adler, J., Dunger, J., et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630: 493–500, 2024. https://www.nature.com/articles/s41586-024-07487-w. Extends the AlphaFold lineage to DNA, RNA, and small-molecule ligand interactions.↩︎
DeepMind. Millions of new materials discovered with deep learning. Blog post, 29 November 2023. The headline 380,000 stable-candidate figure is from this announcement; the primary paper is Merchant et al. 2023, cited separately.↩︎
Merchant, A., Batzner, S., Schoenholz, S. S., et al. Scaling deep learning for materials discovery. Nature 624: 80–85, 2023. https://www.nature.com/articles/s41586-023-06735-9. The primary paper for GNoME.↩︎
Szymanski, N. J., Rendy, B., Fei, Y., et al. An autonomous laboratory for the accelerated synthesis of novel materials. Nature 624: 86–91, 2023. https://www.nature.com/articles/s41586-023-06734-w. The A-Lab autonomous-synthesis platform that physically verified candidates from the GNoME screen — 41 syntheses in 17 days of operation.↩︎
DeepMind. AI achieves silver-medal standard solving International Mathematical Olympiad problems. Blog announcement, 25 July 2024. The AlphaProof and AlphaGeometry 2 systems, formalising candidate proofs in Lean before scoring.↩︎
Wikipedia, Four color theorem (consulted 2026). Appel and Haken’s 1976 computer-assisted proof; the human-readable history of the discipline’s response is documented in the article’s Simplification and verification section.↩︎
Wikipedia, Kepler conjecture (consulted 2026). Hales’s 1998 proof and the Flyspeck formal-verification project, completed 2014.↩︎
Udrescu, S.-M., & Tegmark, M. AI Feynman: a Physics-Inspired Method for Symbolic Regression. arXiv:1905.11481, 2020. https://arxiv.org/abs/1905.11481. Recovers 100 of the 100 equations in Feynman’s Lectures on Physics from numerical data, with no LLM in the loop — useful as the pre-2022 waypoint in the computer-assisted-discovery lineage.↩︎
Romera-Paredes, B., Barekatain, M., Novikov, A., et al. Mathematical discoveries from program search with large language models. Nature 625: 468–475, 2024 (DeepMind blog announcement, December 2023). Open-ended program search guided by an LLM; produced new lower bounds on the cap-set problem.↩︎
DeepMind. AlphaEvolve: a Gemini-powered coding agent for designing advanced algorithms. Blog announcement, May 2025. Evolutionary program synthesis over algorithm space.↩︎
DeepMind. Discovering novel algorithms with AlphaTensor. Blog announcement, October 2022. Faster algorithms for matrix multiplication discovered by reinforcement learning over tensor decompositions.↩︎
Szymanski, N. J., Rendy, B., Fei, Y., et al. An autonomous laboratory for the accelerated synthesis of novel materials. Nature 624: 86–91, 2023. https://www.nature.com/articles/s41586-023-06734-w. The A-Lab autonomous-synthesis platform that physically verified candidates from the GNoME screen — 41 syntheses in 17 days of operation.↩︎
Knuth, D. E. Claude’s Cycles. Self-published note, 2026, with explorations by Filip Stappers. https://cs.stanford.edu/~knuth/papers/claudes-cycles.pdf. The 31 numbered explorations, the fiber decomposition at exploration 15, and the program verified for every odd \(m\) up to 101 are documented in the paper’s appendix. Knuth’s postscript on revising his opinions about generative AI is in the same source.↩︎
Taylor, R., Kardas, M., Cucurull, G., et al. Galactica: A Large Language Model for Science. arXiv:2211.09085, 2022. https://arxiv.org/abs/2211.09085. The model Meta withdrew within days of release after demonstrators showed it would generate plausible-sounding but fabricated scientific papers on request — the canonical cautionary tale for a fluent proposer without a verifier.↩︎
Tao, T. Machine-assisted proofs. Notices of the American Mathematical Society 72(1):6–13, January 2025. The canonical statement of the composability principle — combining tools so that one counteracts the weakness of another — appears in the closing section.↩︎
Romera-Paredes, B., Barekatain, M., Novikov, A., et al. Mathematical discoveries from program search with large language models. Nature 625: 468–475, 2024 (DeepMind blog announcement, December 2023). Open-ended program search guided by an LLM; produced new lower bounds on the cap-set problem.↩︎
DeepMind. AlphaEvolve: a Gemini-powered coding agent for designing advanced algorithms. Blog announcement, May 2025. Evolutionary program synthesis over algorithm space.↩︎
DeepMind. AI achieves silver-medal standard solving International Mathematical Olympiad problems. Blog announcement, 25 July 2024. The AlphaProof and AlphaGeometry 2 systems, formalising candidate proofs in Lean before scoring.↩︎
Boiko, D. A., MacKnight, R., Kline, B., & Gomes, G. Autonomous chemical research with large language models. Nature 624: 570–578, 2023. https://www.nature.com/articles/s41586-023-06792-0. An LLM-driven planner that designs and executes chemical experiments on a robotic platform with a wet-lab verifier in the loop.↩︎
Bran, A. M., Cox, S., Schilter, O., et al. ChemCrow: Augmenting large language models with chemistry tools. arXiv:2304.05376, 2023. https://arxiv.org/abs/2304.05376. Tool-augmented LLM for synthesis planning; companion to Coscientist as evidence of the LLM-plus-tools architecture in chemistry.↩︎
Park, M., Leahey, E., & Funk, R. J. Papers and patents are becoming less disruptive over time. Nature 613: 138–144, 2023. https://www.nature.com/articles/s41586-022-05543-x. 25 million papers and 4 million patents analysed; the CD-index of disruption declines monotonically since 1945 while publication volume grows.↩︎
Ioannidis, J. P. A., Klavans, R., & Boyack, K. W. Thousands of scientists publish a paper every five days. Nature 561: 167–169, 2018. The hyperauthorship analysis grounding the “more papers, not more knowledge” claim.↩︎
Else, H., & Van Noorden, R. Science’s fake-paper problem: high-profile effort will tackle paper mills. Nature 626: 17–18, 2024. https://www.nature.com/articles/d41586-024-00159-9. The 2024 state of paper-mill retractions and the cross-publisher Quill–Sentinel response.↩︎
Wikipedia, Goodhart’s law (consulted 2026). Includes the Strathern (1997) formulation — when a measure becomes a target, it ceases to be a good measure — and the original Goodhart 1975 monetary-policy context.↩︎
Knuth, D. E. Claude’s Cycles. Self-published note, 2026, with explorations by Filip Stappers. https://cs.stanford.edu/~knuth/papers/claudes-cycles.pdf. The 31 numbered explorations, the fiber decomposition at exploration 15, and the program verified for every odd \(m\) up to 101 are documented in the paper’s appendix. Knuth’s postscript on revising his opinions about generative AI is in the same source.↩︎