6 Agentic AI

A language model, on its own, has no memory between turns, no eyes on the present moment, no hands to touch anything in the world. It can describe how to send an email; it cannot send one. It can list the airlines that fly to Madrid; it cannot tell you which seat is cheapest tonight. The fluent assistant of chapter 4 is, mechanically, a transcript completer. The chatbot you actually use is something else, and the difference between the two is what this chapter is about.

The difference is a loop. Wrap a language model in a procedure that lets it look things up, perform actions, and observe what happens, and you have something the field has been trying to build for fifty years. The shell is not new. The shell was always there, sitting in the appendix of the AI textbook, waiting for a brain competent enough to fill it. By the end of this chapter you can name every part of that loop: what feeds the model information, what lets it act, and the bookkeeping that keeps the whole thing accountable. One boundary up front: this is how an agent works, not how to use one well. Using one well is Part II’s subject.

From the classical agent to the linguistic agent

The shell is fifty years old. An agent is an entity that perceives its environment through sensors, deliberates over what to do, acts on the environment through actuators, and observes the consequences, repeating until done.¹ That sentence has been in the textbook since 1995. The picture has not changed.

What has changed, three times, is what fills the deliberation slot. In the 1970s and 1980s, the slot held a logic engine; STRIPS planning, expert-system rules, BDI architectures. In the 2010s, after deep reinforcement learning took off, the slot held a learned policy network; DQN, AlphaGo, the robots in OpenAI Gym. Since 2024, the slot has held a large language model. Call these the three brains: the same agent shell across half a century, a different cognition each time. Each new brain inherited a shell the previous generation had built and could not animate well enough to make useful.

For this chapter, define a linguistic agent as an agent whose deliberation slot is a language model and whose perception and action happen through text. The agent’s eyes are text the model gets to read; its hands are tool calls the model gets to emit. Everything else in the chapter is mechanism around those two points of contact.

Why one model cycle is not enough

Chapter 4 described the model as a fixed function that takes a sequence of tokens and produces a probability distribution over the next one, sampled to a single token, appended, run again. Run that cycle in isolation and you can do many useful things: summarise a document, translate a paragraph, draft an email, explain a concept. What you cannot do is anything that requires information the model does not already have in its weights, or any change to the world outside the transcript.

The Madrid example will run through the rest of this chapter, so introduce it now. Imagine the request is find me the cheapest flight from Havana to Madrid next Tuesday. A bare language model, by itself, fails this in three different ways at once. It does not know what next Tuesday refers to, because its training corpus has no concept of now. It has never seen the airline’s current price list, because the prices change daily and the model’s weights were frozen months ago. Even if it knew the right flight, it cannot book it, because booking requires an HTTP call to an airline, and the model does not make HTTP calls. It generates tokens.

The fix is not a smarter model. The fix is a control structure. Wrap the model in a loop that does three things the bare cycle cannot. First, supply it with the information it lacks, by pulling fresh text into the prompt before the model runs. Second, give it a way to act, by letting one kind of output it can produce be a request to execute a real function. Third, give it a way to observe results, by feeding the function’s return value back into the prompt for the next pass. The next three sections are exactly those three additions, each made mechanical.

Retrieval-augmented generation

The simplest augmentation is retrieval. Before the model runs, fetch the text most relevant to the user’s question, paste that text into the prompt, then let the model answer. The pattern, called retrieval-augmented generation or RAG, was the first widely deployed move beyond the bare model.

Mechanically there are four steps. Take the user’s query and convert it into a representation the retrieval system can search. For a vector store, this means embedding the query into a high-dimensional vector using the same embedding model that produced the vectors for the stored chunks. Search the store for the chunks whose vectors are closest to the query vector by some similarity metric, typically cosine similarity. Optionally rerank those results with a heavier model, because the cheap similarity metric overranks generic matches and underweights specifics. Finally, paste the top few chunks into the prompt together with the user’s question and an instruction along the lines of answer using only the context above.

The same shape works with other retrieval substrates, with different mechanics behind the search. Full-text engines like Elasticsearch use term overlap with statistical weighting; SQL databases require the model itself to write a query, often as a small function call; knowledge graphs require either a query language like Cypher or a navigation protocol the model can follow step by step. The substrate decides what relevant means; the structure is the same.

Apply this to Madrid. The user asks for the cheapest flight; before the model speaks, a retrieval step embeds the query and pulls the relevant fare-page snippet from a daily-refreshed cache, along with the airline’s policy on changing dates. The model reads both, plus the question, and answers in one cycle. It has not become smarter; it has become better informed. RAG adds knowledge, not capability. The model still has to reason over the retrieved text. If the page lists ten flights and the model picks the wrong one, retrieval did nothing wrong. The reasoning step did.

The mechanism has three well-known failure modes, all derivable from the description above. The first is bad retrieval. The embedding similarity finds chunks that share vocabulary with the query but not meaning, and the model dutifully answers from text that was not relevant in the first place. The second is retrieval ignored. The right chunk lands in the prompt, but the model leans on its own pretrained beliefs and produces an answer the context contradicts. The third is retrieval buried. So many chunks get pasted in that the answer is in the prompt somewhere, but the model attends to the wrong place. Each of these is a property of the cycle in chapter 4 meeting the cycle in this section. None of them is fixed by being clever about the prompt. The mitigations, when they exist, are operational, and they belong to Part II.

The structural fact worth carrying forward is small. RAG is the perception half of the agent shell, in its simplest form. The model’s eyes do not look at the world directly; they look at a prompt window that the system around them has filled with text. Retrieval is how you fill that window with text the model needs.

Function calling and tool use

Retrieval fixed perception. Function calling fixes action.

The mechanism, as it shipped from OpenAI in mid-2023 and was picked up by Anthropic and Google within months, has four steps.² First, the developer defines a set of functions the model may invoke: each function has a name, a natural-language description of what it does, and a JSON schema describing its arguments and their types. These descriptions are inserted into the model’s context along with the user’s message, by way of a system prompt that says, in effect, here are some tools; if invoking one would help, emit a tool call instead of a plain answer.

Second, when the model decides to invoke a tool, it does not call anything. It cannot. It is a token generator. What it does is emit a structured response: a JSON blob naming the function and supplying argument values, or, in Anthropic’s variant, a special tool-use block. Third, the harness sees that structured response, parses it, and actually invokes the named function with the supplied arguments. The harness is the piece of software between the model and the world; we devote §5 to it. Fourth, the harness takes the function’s return value, formats it back into the conversation as a tool-result message, and re-runs the model with the augmented context. The model now sees the new information and decides what to do next, which may be another tool call or, finally, an answer to the user.

Apply this to Madrid. The user asks for the cheapest flight; the model emits a call to search_flights with arguments {from: "HAV", to: "MAD", date: "2026-05-26"}. The harness invokes the real flight-search API, returns the rows it gets back, and re-runs the model. The model reads the rows, picks the cheapest, and either reports back to the user or emits a second call to book_flight with the chosen flight identifier and a payment token. The harness invokes the booking endpoint, returns the confirmation number, and the model writes a final reply that includes the booking. Three model cycles, two tool calls, one transaction in the world.

The conceptual payoff of this section, if you only take one thing from it: this is the moment the symbolic-agent action space of the 1970s finally meets a competent deliberator. The shell had always had named, discrete actions; STRIPS planning was built on them. What the shell never had, until now, was something that could choose between actions by understanding what the action would do. Fifty years of symbolic agents had crisp action spaces and a brain too narrow to make use of them. The 2024 linguistic agent inherits both.

There is a property of this arrangement worth stating once, because it shapes everything in §9. The tool, on the other side of the call, is deterministic; search_flights does what it does, reliably, every time. The decision to call the tool, and the choice of arguments, is statistical, made by a model that picks tokens to maximise plausibility. A reliable mechanism is being driven by a probabilistic caller. That asymmetry is why tool design, the act of shaping the function signature, the description text, the failure modes returned to the caller, decides far more than it looks like it should; getting it right is a Part II topic in its own right. The mechanism here is simple. The discipline of using it well is not.

The original function-calling release is by now the floor of the technique, not the ceiling. The community has demonstrated agents that interleave tool calls with explicit reasoning traces; Yao and collaborators’ ReAct pattern in 2022 was the canonical early form, in which the model alternates a thought, an action, and an observation before each step, and downstream agents inherit much of that scaffolding.³ Toolformer, also from 2023, went further and showed that a model can be fine-tuned to insert its own tool calls at training time, learning when a tool would help by being shown examples where the tool was useful.⁴ By 2024, structured tool calling is a baseline feature of every major model API.

One narrower mechanism deserves a paragraph because it answers a question that comes up later. The Model Context Protocol, introduced by Anthropic in late 2024, is the same idea standardised across vendors.⁵ An MCP server exposes tools, prompts, and data resources over a small protocol, and any MCP-capable client (a chat application, an IDE, a coding agent) can connect to that server and use what it exposes without bespoke integration code. The mechanical claim is small: instead of every harness writing its own integration with every tool, the harness speaks one protocol, and the tool’s author writes one server. The reason a standard matters is the same reason HTTP mattered for the web. Once consent, schemas, and capability negotiation are protocol primitives rather than per-application code, the ecosystem around them gets to grow. We will come back to MCP as a concrete instance in §5, after the harness is on the table.

The harness

This section introduces the chapter’s new conceptual primitive. The harness is everything that lives between the language model and the world. It is not the model. It is not the world. It is the system that takes the model’s outputs, decides whether and how to act on them, performs the action, captures what happened, and feeds the result back into the next prompt. In the function-calling description above, the harness is the piece that parses the tool call, runs the function, formats the result, and re-enters the model. In an agentic coding system, the harness is the file system, the shell, the test runner, the browser, the network, plus the schema layer that exposes those to the model as tools, plus the policy layer that decides which of them the model is allowed to touch and under what conditions. The harness is the new word for the symbolic structure that wraps the empiricist core.

The reason the harness is its own primitive is that it is where the accountability lives. The model can hallucinate; the harness cannot. The model can suggest deleting a directory; the harness is the layer that decides whether to actually run rm -rf or to refuse, log the attempt, and pass an error back. The harness is where permission gates sit, where dry-run defaults sit, where audit logs are written, where sandboxes are enforced. The asymmetry of the previous section, a probabilistic caller driving a deterministic mechanism, only becomes safe when something between the two can examine the call before it lands. That something is the harness.

A linguistic agent’s reliability is, in practice, a property of its harness rather than its model. Models hallucinate at a rate that varies smoothly with their size and training; the floor is not zero, and probably will not be for a long time. Harnesses, by contrast, can be made arbitrarily strict, because they are ordinary software. A harness that requires explicit confirmation for any irreversible action, sandboxes every file write to a working directory, and refuses tool calls that fall outside an allow-list cannot be made unsafe by a hallucinating model. The model can suggest anything it likes; the harness will not perform the suggestion. The same model wired into a permissive harness will, sooner or later, do something it shouldn’t. Reliability and safety, in agentic AI, live in the harness, not the weights. Part III returns to this, as the operational form of the alignment problem.

The Model Context Protocol from §4 lives at exactly this layer. An MCP server is a piece of harness that another harness can talk to. The protocol’s design treats consent as a first-class primitive: the client must request capability before invoking it, and the server can deny. That sounds like ceremony until you realise it is what makes a multi-vendor ecosystem of tools possible without each one being trusted by every harness. The protocol does not solve safety; it makes the safety policy explicit and exchangeable.

Two related mechanisms deserve a mention here, both because the chapter would feel incomplete without them and because both are recurring vocabulary in the agentic literature. Skills are reusable procedural artefacts the agent rediscovers on its own. Mechanically, a skill is a file (usually a script or a structured procedure) that lives inside the harness’s filesystem, with metadata describing when to invoke it. The agent finds the skill by retrieval, runs it as a tool, and accumulates a personal library over time. Voyager, the 2023 Minecraft agent from Wang and collaborators, was an early literature-grade demonstration of this pattern: the agent wrote programs that became reusable game actions, and the library grew across a long-running episode.⁶ Each new skill became, in effect, a new tool in the agent’s action space, discovered by the agent rather than authored by the developer. The mechanism is mechanical; the engineering judgment of when to lean on it is Part II.

The external loop is the second related mechanism. So far this chapter has assumed the agent runs synchronously: the user sends a message, the model and the harness interleave until the work is done, the user reads the reply. Real long-running agents do not look like that. They have a scheduler outside the model that wakes the agent on triggers (a cron firing, a webhook arriving, a file appearing in a directory), restores the durable state from somewhere on disk, runs one stretch of the loop, and writes state back out before going dormant again. The autonomous-agent open-source frameworks of 2023, AutoGPT and its many descendants, were early attempts to make this routine.⁷ The external loop is not a different agent architecture; it is the same loop, with the while condition supplied by a clock and a durable artefact instead of an open chat window. The internal mechanism is unchanged. The orchestration around it is what makes the difference between an agent that runs for ten seconds and an agent that runs for ten days.

The harness, in short, is the layer where the structure lives. The model thinks in soft, statistical, half-formed thoughts. The harness acts in crisp, named, accountable steps. The first is what the 2010s built; the second is what the 1970s built. The combination is what this chapter is about.

The agent loop, explicit

Everything in §§2–5 has been pieces. Here is the loop that uses them.

while not done:
    prompt   = system + history + latest_observation
    response = LM(prompt)
    if response.is_final_answer:
        done = True
    else:
        result = harness.execute(response.tool_call)
        history.append(response, result)

Six lines. The loop is the keystone of the chapter, and the rest of this section walks through it slowly, because everything else hangs off these six lines.

prompt = system + history + latest_observation assembles the input to the model. The system part is the developer’s instructions: who the agent is, what it can do, what tools it has. The history is the running conversation, from the user’s first request through every tool call and tool result so far, exactly as chapter 4 described the conversation buffer. The latest observation is whatever just changed; in the simplest case it is the user’s most recent message, but during a tool-call cycle it is the tool result that just came back from the harness.

response = LM(prompt) is the one full cycle of chapter 4. Tokens go in, attention layers process them, sampling produces tokens out. For a non-reasoning model this takes a few hundred milliseconds; for a reasoning model it can take minutes, because the model is spending tokens thinking before it answers. The mechanism is identical; the cost varies.

The conditional decides what the response is. If the model emitted a final answer in natural language, set done and return it to the user. If the model emitted a tool call instead, hand it to the harness. harness.execute is where the policy of §5 lives: it parses the call, checks whether the agent is allowed to make it, runs it if so (and asks for confirmation, returns an error, or refuses outright if not), captures the result, and packages the result into a form the model can read on the next pass. The new history is the old history with the model’s response and the tool result both appended. Then the loop runs again.

Apply this to Madrid one more time, this iteration by iteration. Iteration 1: history is the user’s request find me the cheapest flight from Havana to Madrid next Tuesday; the model reads it and emits a search_flights call with the parsed parameters; the harness runs the API and returns five rows; loop. Iteration 2: history now contains the request, the tool call, and the five rows; the model reads them, identifies the cheapest, and emits a confirmation message to the user along with a book_flight call; the harness, configured with a dry-run policy for booking, prompts the user to confirm rather than booking immediately. Loop pauses on the user. Iteration 3: user confirms, the harness performs the booking, the model reads the confirmation and writes the final reply. Done.

Each piece of this loop maps to a part of the classical agent shell from chapter 1. LM(prompt) is deliberate. harness.execute is act. The tool result that comes back is perceive, in the form text-shaped enough that the model can read it. The while is the loop the classical agent always had. The four-word shell perceive, deliberate, act, observe has not changed since 1995. What changed is that the deliberation step is now occupied by something fluent enough to make the other three worth the effort.

The loop also gives you a way to count what a system is doing. Every iteration is one model cycle and zero-or-one tool calls. The cost of running an agent is the number of iterations times the per-cycle compute, plus the cost of the tool calls. The time it takes is the same, plus the latency of whatever the tools talk to. A task that fits in one cycle should not be wrapped in an agent loop; an agent loop is the right tool when the task genuinely requires either fresh information or several real-world actions. We will return to this when §9 names the failure mode of the agent that should have been a single prompt.

Memory in agents

The conversation buffer in chapter 4 was a kind of memory: the whole transcript, re-sent every turn, used as the model’s working notes. Agents have two kinds of memory, and both are mechanical extensions of things already on the table.

Short-term memory is the conversation buffer of chapter 4, unchanged. Each turn, the harness rebuilds the prompt from the system instructions, the running history, and the latest observation. The model has no persistent state inside it; everything carried from one turn to the next is in the buffer. The buffer is bounded by the model’s context window. When the window fills, something has to give: the early turns get dropped silently, or the harness summarises the history into something shorter before re-sending. There is no fourth option. Whatever the agent appears to “remember” within a session is being held by the harness in plain text, not by the model in any internal way.

Long-term memory is RAG, pointed inward. The agent writes notes to a store the harness controls: a database of past actions, a log of decisions, a file of facts learned from previous sessions. At the start of a new task, before the model runs, the harness performs a retrieval against the store using the current query, and pastes the relevant notes into the prompt as additional context. Mechanically there is nothing new here. The mechanism is the RAG of §3. The only difference is that the source of the retrieved text is the agent itself, on previous occasions, rather than an external corpus.

The deflation is worth saying out loud. The phrase an agent that learns makes it sound like the model is being retrained, or like some internal representation is being updated. It is not. The model’s weights are frozen at deployment time; nothing about a long-running conversation changes them. What is changing is the contents of an external store, and what is making the agent appear to learn is that the store grows and the right notes get retrieved at the right moments. The agent that learns is the agent that writes notes and retrieves them. Generative-agent simulations from 2023, including the work that put twenty-five language-model characters in a sandbox town and watched them form social relationships, leaned on this exact mechanism: the agents kept journals, retrieved relevant entries each cycle, and the apparent richness of their behaviour emerged from the retrieval, not from any change to the model.⁸ A second line of work, Reflexion, did the same trick with a different angle: after a failure, the agent writes a verbal critique of what went wrong, stores it, and retrieves it at the start of the next attempt at the same task; performance improves across attempts not because the model learned but because the prompt got better each time.⁹

The cost of writing this story honestly is that the parametric learning loop (the kind that updates the model’s weights) is genuinely not happening at run time, in any agent you can actually buy in 2026. The agent is using two ordinary mechanisms (a conversation buffer and a retrieval store) and choreographing them well. That is enough to do remarkable things. It is also a sharply different mechanism from the one a layperson hears the word learning and assumes is in play.

Multi-agent systems

The loop in §6 is a single agent. The harness, if you let it, can call other agents the same way it calls tools. That is the entire architecture of every multi-agent system you will ever read about.

The shapes that come up most often in the literature are easy to enumerate. A coordinator agent breaks a task into sub-tasks, dispatches each to a worker agent with the right context, and assembles the results. A critic agent reads another agent’s draft output and produces a critique, which the original agent then incorporates on a second pass. A committee of independent agents each produces a candidate answer, and a judge agent picks the best one. In every case, the mechanism is the same: the loop of §6, with the harness configured so that one of its tools is “ask another agent.”

There is a deflation here that the field is still working through. Most systems described as multi-agent in 2024 and 2025 should have been single agents with better tools. Every additional agent in the system is another loop running, another inference budget being spent, another set of failure modes to debug, another place for the wrong piece of information to be passed forward. More loops mean more failure surface, full stop. The cases where multi-agent systems pay are the cases where the sub-tasks are genuinely independent and parallelisable, or where the critic agent really does catch errors the original cannot. The cases where they do not pay are the cases where a developer reached for the multi-agent pattern because the diagram looked clean. Cognition’s engineering team published a memo on this in 2025 that the field circulated widely; the title was Don’t build multi-agents, and the operational case it makes belongs to Part II. Mechanically, the point here is small: a multi-agent system is not a different kind of architecture. It is N copies of the §6 loop, with one of N’s tools being another copy of the loop. Whether that is the right design for a given problem is a judgment call, and one Part II takes apart.

A research benchmark in the area gives a sense of how the field measures these systems against each other. AgentBench, released by Liu and collaborators in 2023, is a suite of eight environments (operating systems, databases, knowledge graphs, card games, and others) on which different LLM-driven agents are scored end to end.¹⁰ What the benchmark exposes, as much as the leaderboard, is the gap between the model’s underlying capability and the agent’s ability to use it in a sustained loop. The mechanism in this chapter is the loop. The benchmark is how the field finds out which loops, around which models, actually finish the work.

What an agent is not

The mechanism in this chapter does three things well that no single model cycle can do, and the rest of this book lives in the world it makes possible. The job of this section, before we leave the chapter, is to name three things the mechanism does not do, all of which are properties of the loop itself rather than of any particular implementation.

The first failure mode is hallucination amplification. A single hallucination, in a non-agentic setting, is one wrong sentence that the user can spot and dismiss. Inside the loop, the same hallucination becomes a wrong belief that the agent then acts on. The agent calls the wrong tool with the wrong arguments, observes a result that does not match what it expected, and tries to fit the result into the wrong belief rather than abandoning it. The next iteration runs on a context that is now subtly poisoned. Errors that the single-cycle model would have dropped after one turn accumulate across iterations. This is not a flaw to be patched away by better prompting; it is what the mechanism does when the underlying model is uncertain. The mitigations are operational, and they live in Part II.

The second failure mode is the agent that should have been a single prompt. Every iteration of the loop costs at least one model cycle, plus the latency and money of any tool calls. A task that the model could have answered in one cycle, if it had been asked plainly, becomes ten cycles when wrapped in an agent loop, with no benefit beyond the cost. The temptation to reach for the agent loop, because agents are interesting and single prompts are not, is real and well documented. The mechanism does not protect you from this; the discipline of not invoking it is the developer’s. Again, the discipline is Part II’s territory.

The third failure mode is unbounded loops without a verifier. The while not done of §6 hides a question. What decides when done is true? In a well-built system, there is something outside the model that checks: a test suite that passes, a typechecker that returns clean, a calling user who confirms. In a system without that something, the model decides done for itself, and a model that decides for itself when it is done has no reason to ever stop while there are tokens left in its budget. Loops without an external verifier can run for hours, produce gradually worse output, and end in nothing. The literature has a name for the right structure here, reinforcement learning with verifiable rewards, and the practical version of that idea is the architectural reason coding agents work as well as they do; the test suite is the verifier. The mechanism of this chapter, by itself, does not supply a verifier. The deployer does, or the system fails.

Each of these failure modes hands off somewhere. The amplification problem is most acute in long-horizon agents, and the reliability engineering that addresses it is one of the chapters of Part II. The over-eager agent loop is a topic for the practitioner chapters that take apart what to use an agent for. The unbounded loop is, in its worst form, one of the systemic concerns Part III takes up under the heading of operational alignment: not whether the model is aligned, but whether the system around the model has the structure to keep its operation within bounds the deployer can defend.

The fifty-year shell is full. The deliberation slot, empty for so long, finally holds something fluent enough to make the other three steps of the loop worth running. The shell did not change. The brain did. The loop, now that it has both halves in good working order, is one of the most far-reaching things software has ever been able to do. Whether any particular instance of it actually does anything useful depends on how well the loop is wired. The tool is the same. The use is the variable.

Russell, S. & Norvig, P. Artificial Intelligence: A Modern Approach, §2, 4th ed. Pearson, 2020. The agent abstraction this chapter inherits is essentially unchanged across the textbook’s four editions.↩︎
OpenAI. Function calling and other API updates. 2023-06-13. The launch post that introduced tool use as a first-class API feature; the worked example walked the same parse-call-result-rerun loop this section describes.↩︎
Yao, S. et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023. arXiv:2210.03629. Established the thought / action / observation interleaving pattern that downstream agent frameworks inherit.↩︎
Schick, T. et al. Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS, 2023. arXiv:2302.04761. Showed that tool-use behaviour can be learned at training time by self-supervised insertion of API calls into the corpus.↩︎
Anthropic. Introducing the Model Context Protocol. 2024-11-25. Specification: https://modelcontextprotocol.io/. The protocol exposes tools, prompts, and resources over a small client-server interface with consent as a first-class primitive.↩︎
Wang, G. et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291, 2023. Demonstrated a skill library accumulated by an LLM agent across a long-running Minecraft episode.↩︎
Significant-Gravitas. AutoGPT. GitHub, 2023. https://github.com/Significant-Gravitas/AutoGPT. One of the earliest open-source frameworks for external-loop autonomous agents; treated here as a vocabulary anchor rather than a recommendation.↩︎
Park, J. S. et al. Generative Agents: Interactive Simulacra of Human Behavior. UIST, 2023. arXiv:2304.03442. The twenty-five-character sandbox study that grounded the long-term-memory-as-RAG mechanism in a working simulation.↩︎
Shinn, N. et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS, 2023. arXiv:2303.11366. Verbal critiques written by the agent after a failure are retrieved on subsequent attempts; improvement comes from the prompt, not from weight updates.↩︎
Liu, X. et al. AgentBench: Evaluating LLMs as Agents. ICLR, 2024. arXiv:2308.03688. The first end-to-end benchmark suite covering OS, database, knowledge-graph, card-game, and other environments under a single scoring protocol.↩︎