14 The Alignment Problem

When the COVID-19 pandemic struck, numerous researchers from laboratories worldwide took on the challenge of leveraging machine learning to address the most significant public health crisis we had ever faced. One prevalent approach involved training an image classifier to analyze X-ray images of lungs for COVID detection.

Initially, this task appeared deceptively simple. By 2020, eight years had passed since the creation of ImageNet, marking a pinnacle in computer vision advancements. With the advent of vision transformers, ConvNets, ResNets, YOLO, and other sophisticated architectures, the field of computer vision had made remarkable progress in image classification, seemingly solving many related challenges.

Expectations soared as researchers believed that with a suitable dataset containing COVID and non-COVID lung X-ray images, training a classifier would be straightforward. Teams worldwide embarked on this mission, collaborating extensively to gather and exchange datasets, resulting in approximately 400 research papers pursuing similar objectives.

Following more or less the same methodology, these researchers sourced COVID and non-COVID lung image datasets, trained classifiers, and achieved impressive accuracy rates ranging from 80% to 95%. Many of them claimed to have successfully solved the task of COVID detection from X-ray images.

It turns out all of this was useless. The vast majority fell prey to the original sin in machine learning. To understand why, let’s go back a few decades to one of the longest standing urban legends in artificial intelligence, and learn the most important lesson in Machine Learning.

A cautionary tale

It was the early 1960s—or so the legend tells. There are many variants of this tale, and some of them place it in different years or have different details. But details don’t matter. So let’s move on.

Where was I? Oh yeah… It was the early 60s, and artificial intelligence researchers were very excited about this new connectionist idea that you could have small computational units similar to brain neurons connected together performing a small amount of computation. This would allow you to compute some really complex functions when these neurons—these artificial neural networks, as they were called—grew large enough.

And so—our tale continues—the US government hired some of these early connectionist researchers to work on a neural network that could detect whether an image contained an enemy tank or it was an image of an empty space, a forest, some trees, a landscape, or… something not tank.

These researchers already knew that in order to have a statistically valid estimation of how well your model works, you have to have separate training and testing data sets. So they set out to take pictures of tanks and pictures of forests and savannas and places with no tanks and made a data set of, let’s say, 100 tanks and 100 non-tanks.

Then they split the data set into 50 pictures of each class for training and another 50 pictures of each class for testing, implemented their, at the moment, still very simple artificial neural network—probably a simple perceptron, the details don’t really matter—and the results were spectacular. Not only was the neural network able to completely learn the training set, it also showed very high performance on the testing set.

Very happy with the results, the researchers sent this model to their contractors just to receive a few weeks later, their reaction letter, telling them that in all the cases when someone was testing this model in real life, with real tanks, and with real no-tanks, it didn’t perform any better than a random guessing.

So they went back to the drawing board, scratching their heads, trying to find out what was missing. The statistics were sound, the mathematics was sound, the code was sound, but there was one problem in the data that they couldn’t solve. It turns out two different teams took the pictures of the tank and the non-tank. The people who took the pictures of the tank did it on a cloudy day, while the people who were tasked to take pictures of non-tanks did it on a slightly more sunny day.

In these pictures, you could accurately classify which of the two sets they belonged to by simply looking at the average brightness of the pixels. The brighter pictures were more often correlated with non-tanks than the darker pictures. So the neural network had learned exactly what the researchers asked it to do, but not what they wanted it to do.

It learned to differentiate pictures from two different sets, but it didn’t learn it by looking at the thing the researchers wanted. They wanted the network to detect a tank. Instead, the network learned to look at the average brightness of the pixels and perfectly captured the most salient difference between the two sets of photos. The brighter pictures belong to one set, and the darker pictures belong to the other set. Problem solved. Now give me the money.

This is just a cautionary tale, probably apocryphal. Many people have told this story in different moments, and nobody can find an actual reference. But it’s a story that we keep telling because it highlights one of the most important issues at the core of machine learning. This is a story that I tell my students over and over, every course. And they nod in agreement, but they always fail to understand what the core of the issue is here, and they make this same mistake, over and over.

The problem with machine learning

The crux of the issue lies in the fact that machine learning, and particularly classification, primarily deals with discriminative models. A discriminative model aims to identify the most prominent feature that distinguishes objects among different classes.

In classification problems, you typically have a set of objects, each described by a specific number of features. These features can be words in a text message, pixels in an image, or structured features, for example, in credit card transaction validation. The objective is to classify these objects into various classes according to those features.

There are three primary types of classification problems: binary, multi-class, and multi-label. A binary problem separates objects into two classes, such as positive and negative. This is common in the medical domain, where the goal is often to distinguish between patients with and without a particular disease. A multi-class problem involves more than two classes, and the goal is to separate objects into one of these specific classes. A multi-label problem has more than one class, and a single object can belong to one or more classes.

In all these cases, a classifier attempts to find a combination of features that correlates with one class more than the others. The classifier then fits an explanation to this correlation by finding a mathematical formula that computes the most likely class for a specific combination of features.

But here is the kicker. Classification algorithms are designed to find the simplest or smallest possible explanation. To understand why, it’s essential to revisit the basics of machine learning and the concepts of overfitting and underfitting.

When training a machine learning algorithm, you typically have a training set and a testing set. The algorithm is trained on a subset of data, and its performance is evaluated on a different subset of data that the algorithm hasn’t seen before. There are three possible regimes in this scenario.

First, the algorithm may perform poorly on both the training and testing data, indicating underfitting. This usually means that the algorithm isn’t powerful enough to capture the complex patterns and correlations in the data. The solution is to use a more intelligent or powerful algorithm.

Second, the algorithm may perform well on the training data but poorly on the testing data, indicating overfitting. In this case, the algorithm fails to capture a generalizable pattern, often because it’s too complex or the training data is too small. The algorithm is learning to answer questions from the training data instead of learning the actual knowledge.

Finally, the algorithm may perform similarly on both the training and testing data, indicating that the algorithm has captured a real pattern in the training set that extends to the testing set.

To avoid underfitting, you can make the model stronger by adding more neurons in a neural network, more trees in a random forest, or more complex kernel functions in a support vector machine.

To avoid overfitting, you can regularize the model by adding constraints to the hypothesis or formula the model is building, which penalizes overly complex formulas. Regularization makes the model prefer solutions with smaller parameters, leading to smoother formulas that are more likely to generalize to unseen data.

In summary, good classification algorithms almost always regularize, leading to the simplest solution that explains the difference between your classes. For example, if you give a neural network two sets of images, one with tanks and one without, and the images with tanks are darker on average, the neural network will learn to differentiate between the darker and lighter images. This is the simplest possible explanation, and the neural network is more likely to find this explanation than a more complex one, provided they have more or less equally predictive power.

How can we avoid this?

By now, I hope you understand the difficulty of the problem we have in even the most basic machine learning scenarios. We want to make a classifier to decide whether an image of a lung has COVID or not. However, we have no way to explain to our machine learning algorithm that the thing we care about is COVID. The only thing the algorithm knows is that we have two sets of images, one for class A and one for class B. It needs to find the simplest way to discriminate between these two sets of images.

If we want to build a COVID classifier, we have to make sure that the images in the two sets are such that the simplest explanation that separates them is the presence or absence of COVID. But here’s the problem: there are a thousand ways in which two different images of lungs can differ. They can be from different people of different ages, genders, races, taken with different types of X-ray machines, using different techniques, having different sizes, resolutions, brightness, and contrast.

If we want our algorithm to learn the right discriminant—the right explanation—we have to make sure that our two sets of images have almost exactly the same composition across all the things that we don’t care about. We cannot have a set that has more images from one gender or race than the other set, or from one specific hospital than the other set—which might have used a different machine, and thus the images might be a little bit less or more blurry, or centered differently. Nor can we have images taken from before 2020 and images taken from after 2020 that may have—because they’ve updated the software—a different type of watermark in a small corner in the image.

We have to account for all of the possible variations that are irrelevant to our problem and make sure that our dataset is equally balanced across all classes for these variations. Otherwise, we run the risk that one of those variations is sufficiently prevalent in one class with respect to the other that our machine learning algorithm captures that as the most important difference. This becomes even more important the harder your problem is to solve. The harder it is to detect COVID in an image, the more things must be normalized, and everything else taken care of, so that the the only reasonable explanation for differentiating one image from another is the presence of COVID.

I’m sure by now you know how our original story ends. Why all these people, all these hundreds of research papers on COVID failed catastrophically. They all had methodological issues like the tanks urban legend.

Some of them had COVID images from one hospital and non-COVID images from a different hospital. Others had COVID images recently taken and non-COVID images that were from an older dataset. Others still had small differences in the data collection that made classes in one set and the other set differ in things other than the presence or absence of COVID. And then this huge performance that you see is just your classifier learning to differentiate between images taken in 2021 and images taken in 2018, or images taken with one X-ray machine versus another machine.

But it gets worse

Now, the most insidious part of this is that there is little you can do to notice this problem while doing the research. Here’s why.

The standard way in which you estimate the performance of your algorithms is by splitting your data into several chunks. Some of those chunks are used for training, parameter tuning, model selection, etc. In any case, you always leave at least one part of unseen data in which you test your final algorithm. This is the way you ensure generalization from training to test—generalization from seen to unseen data.

But if your data collection protocol is flawed—if you collected COVID images from one hospital and non-COVID images from another—then your test set will have the exact same flaws! Any statistics that you do will lie to you and tell you that you are doing it right because you are finding the same irrelevant differentiators in training as in test.

So these researchers were set up to fail from the get-go. From the moment they collected the data, they were doomed. There was nothing they could do once they decided on that collection mechanism. There was nothing they could do during their research process, during their validation process, or during their statistical analysis to realize the massive flaw they had at the very core of their assumptions.

Now, you can imagine this happens all over the place. It happened to two of my graduate students when they were in undergrad. They also trained a pretty standard ResNet on COVID vs. non-COVID images and obtained results that were too good to be true, because they weren’t. They collected the COVID images from one dataset and the non-COVID from a different dataset, and there were tiny differences in the image format, color composition, etc, not enough to be seen by the naked eye, but enough to help the neural net cheat on the COVID exam and gain extra points by looking at the wrong features.

And it also happened to superstar machine learning practitioner Andrew Ng and his superstar team of researchers. They were also analyzing images of lungs, and did all the right things. Except, when they split the data, different pictures of the same person fell on both the training and the test set. So the model could learn to detect subtle cues that appeared on both the training and test set—maybe a patient had a scar from a previous condition—and obtain a few extra points of accuracy. Not exactly the same level of critical failure as the other, but still, an example of how even the most professional among us can fall for this insidious mistake.

But this common pitfall is not restricted to COVID images or even to image classification. T is just one of the most salient, recent examples of this problem. But this happens all over in machine learning. This is why the vast majority of machine learning research that you see in papers fails to generalize.

I’m at the moment reviewing a very good paper on using LLMs to detect fake news. Very good, except they also fell prey to this mistake. They collected truthful news from around the web, but they couldn’t find a reliable source of fake news, of course. So they resorted to using ChatGPT to generate plausibly-sounding but fabricated news. And they got something like 95% accuracy, breaking several benchmarks! But—this is my hypothesis—what their classifier learned was to differentiate regular news text from ChatGPT-generated text. And not any ChatGPT-generated text—which is unsolvable in practice—but ChatGPT using their specific prompt.

The bigger lesson

Most applied ML research looks excellent in the paper. However, when you go to put it into practice and have to deal with the nitty-gritty of reality, it turns out that this performance you see in the paper is nowhere near real-life performance.

This is the hardest type of generalization. Generalizing from test set to application is the moment in which all your assumptions about data collection, about how data is distributed in real life, and about what is important or not in the data, start to fall apart.

So the bigger lesson from this super common and super insidious pitfall is this: data collection is where your assumptions begin. No ML problem starts at the training phase, or even the modeling phase. The way you decide to collect, filter, and sanitize your data already encodes a huge set of assumptions that separate the problem you’re actually solving from the problem you think you’re solving.

But we are teaching whole generations of machine learning researchers and practitioners to focus on finding the best algorithms and architectures. And that’s important, of course. But it is not the most critical part. Data collection and curation is the make it or break it of machine learning. Good data plus mediocre model beats bad data with awesome model every single day.

The alignment problem

In the AI safety community, concerns range from existential risks, where some believe super intelligent AI will inevitably harm us, to those focused on present issues like misuse of AI, perpetuating biases, discrimination, and spreading disinformation. However, most solutions to these problems rely on addressing the AI alignment problem.

AI alignment ensures that an artificial intelligence system’s output aligns with user preferences beyond superficial optimization metrics. The goal is for the AI to genuinely solve the problem we care about, rather than just what we told it to do. This challenge arises because describing user preferences is incredibly difficult. Often, what we tell the AI we want is only a rough approximation of our true desires. Alignment aims to bridge that gap by helping AI understand our genuine intentions from our limited explanations.

Now let’s explore why AI alignment may be the most difficult problem in computer science. In this article, we’ll examine why AI safety, specifically alignment, could be the hardest engineering challenge. We’ll start by defining AI alignment as ensuring an artificial intelligence system behaves according to user expectations, preferences, or values in various ways.

Understanding AI alignment

Why Is AI Alignment the Hardest Problem in Computer Science? In this section, we will explore why AI safety, specifically AI alignment, is potentially the most difficult challenge in engineering.

AI alignment involves ensuring that an artificial intelligence system behaves according to a user’s expectations, preferences, or values. The interpretations can vary depending on context, but in general, the expectation is that a system will perform aligned with the user interests, that is, in a way that satisfies the user actual needs and wants, and not just some superficial approximation of them.

First, lets question why alignment is necessary in artificial intelligence but not in other high-tech products engineering tools such as cars, planes, or rockets. The primary reason lies in the level of complexity and the nature of the interfacing when dealing AI systems compared to other engineering tools.

What makes AI different from other high-end technologies

The more advanced a tool, the more you can focus on telling it what you want to do instead of how to do it. For example, with simple tools like a hammer, you control every action. With complex tools like cars, you still don’t say “reach the destination” –at least not yet. Instead, you perform some mid-level actions like steering and accelerating, which in turn translate to lower-level actions like moving pistons.

That is, as the tool becomes more advanced, the way you use it becomes closer that what you want and farther from what the tool must do to achieve that objective. Artificial intelligence lies in the declarative end, the point where you only tell the tool your end goal and let it figure out all the steps to achieve that. Actually, we could make the case that AI can be defined precisely as the field dedicated to making tools that do what you want.

Consider driving a car –a regular car, not a self-driven one. To get where yow want, you need the car to steer in the right direction and accelerate or brake at the right times. That is, your high-level objective –getting somewhere– is decomposed into many low-level instructions. You cannot simply ask the car to drive itself –again, a traditional car. You have to steer the car and accelerate it. So the system that is translating a high-level, perhaps abstract instruction like “get me home fast but safely” into precise low-level instructions, is you.

Now contrast this with a self-driven car. You just say “get me home” –the “fast but safely” part is assumed implicitely–, and the AI system in the car has to “understand” this high-level instruction and decompose it the myriad of low-level instructions that actually make the car do the thing you want it to do.

But here is the crucial part, “get me home” encodes a far larger set of assumptions than what you usually imagine, and there is an infinite way in which an AI agent could be said to have fullfilled that request without actually doing what you intended it to do.

When you say “get me home” to a human taxi driver, they usually implicitely assume you’re also asking the following: - do not produce me any physical or psycological harm; - get there reasonably fast, - but do not drive carelessly; - take the fastest route if possible, - but take a detour if its necessary even if it costs me a bit more; - do not engage in uncomfortable conversations, - but do engage in reasonably well-mannered conversations, - or leave me along altogether, depending on my mood; - do not harm any pedestrians or animals, - but if you must harm an animal to avoid a fatal or very dangerous accident, please do; - …

These are all reasonable assumptions that any human knows from common sense, because we all share a common understanding of what it means to live and act in the human world. But an AI doesn’t come with that common sense preprogrammed.

In fact, common sense reasoning seems to be one of the hardest skills for modern AI to acquire, at least in part because by the virtue of it being “common”, which means we don’t have large corpora of explicit examples of this type of reasoning, like we have for the more specialized skills.

And that is the reason we need alignment. When we tell a tool what we want instead of how to do it, we need the tool to interpret that want in a context that is full of assumptions, restrictions, and trade-offs which are often implicit. Alignment means having an AI system apply the right implicit context, and find the solution to our request that is, as the name implies, more closely aligned to what we really want instead of just any solution that superficially fits the explicit request.

The crucial reason alignment is hard is due to the interplay between two critical parts of the AI equation: the inherent complexity of the world and the unavoidable brittleness of the way we model it.

Let’s break it down.

Why AI alignment is hard

Many reasons make AI alignment one of the hardest problem a system designer can face. Some of those reasons involve our incapability to produce a robust enough description of the task we want to solve –that is, we cannot fully describe the context and all the implicit assumptions and restrictions that apply in that context. These reasons are related to the nature of the problem itself –getting any intelligence agent to do what you want is intrinsically hard; if you’re a parent, you know exactly what I mean. Other reasons are related with the nature of the solution we currently have, that is, systems built with machine learning, trained on imperfect datasets to maximize imperfect proxy metrics.

These are interrelated but separate challenges, so let’s take them one at a time.

Implicit contexts

When using a regular tool –like a hammer, a calculator, or Excel– you have an end goal in mind, but the tool doesn’t need to understand that goal; it just needs to follow your precise instructions. However, when working with AI, there are many assumptions about how the world works that aren’t explicitly described in your instructions.

For instance, if you tell an advanced AI to make coffee, there are numerous implicit restrictions: don’t destroy the coffee machine, don’t harm any animals, don’t walk through walls, etc. Humans generally understand these unstated rules because we share a common understanding of the world. So there is a significant difference between systems that require specific instructions on how to perform tasks and those that simply need to be told what tasks to accomplish.

When you want to tell a system what to do instead of how to do it, you must be very precise in specifying everything it needs to know. In a restricted domain, the constraints may be simple enough to be explicitly encoded or learned from data. For example, in a factory setting, a robotic arm is simply physically incapable of destroying the world, so it doesn’t need to know much about anything outside the narrow task of car painting.

However, training systems for open-ended decision-making in the real world is far more complex. It’s hard to imagine a training scenario that is as intricate as real life. Gaining all the necessary experience to understand the human world like humans do would require something like raising a child from scratch. And the majority of assumptions in those contexts can’t be learned from data, because we simply don’t have training data for “how to be a human”.

Unclear trade-offs

However, the implicit context problem carries an even bigger challenge. While many of the things an AI must care about implicitly are restrictions –e.g., do not kill the passenger– the hardest problem is when they involve tradeoffs instead.

This is a fundamental issue built into most optimization problems in the real world. On one hand, you want a system to achieve its objective as effectively as possible. On the other hand, you want it to do so with minimal side effects. These two goals are often contradictory – for example, driving fast versus driving safely.

Many of these unwanted side effects are implicit restriction: you don’t want to kill any pedestrians or harm the passengers. However, some side effects are not hard constraints, but tradeoffs. If you want zero chance of getting in a car accident, the only solution is not to drive. So, you indeed want your AI system to correctly trade off a small risk of getting hurt for the possibility of actually solving the task –getting you from A to B. Pragmatism involves trade-offs.

And we humans frequently make these trade-offs unconsciously, e.g., between getting somewhere faster and taking on a bit more risk in the highway, or going the longer, safer way. This kind of trade-off is at the heart of any complex problem– trade-offs are the very reason engineering problems are hard to begin with!

With an AI system, however, it becomes even worse, because the system not only needs to understand the many implicit constraints and trade-offs in the world, but also it needs to understand how you value those trade-offs. You would need to not only specify potential side effects butto give them appropriate negative values, so the system avoids them while still achieving its primary goal.

This challenge arises because most machine learning systems optimize performance metrics. And the nature of optimization involves comparing numerical quantities. To optimize your goal of a fast yet safe journey, you must quantify these trade-offs. For example, how much inconvenience is being late compared to the risk of a sudden hard brake? Is it worth risking a small head injury to arrive 20 minutes earlier? How do you put numbers to being early versus being safe?

Furthermore, for any value you fail to quantify, your AI will be compelled to prioritize performance over that factor, since there’s no penalty for it. So, it’s crucial to quantify all critical side effects. If you don’t specify a crucial dimension –like car damage–, you’re in big trouble. To save time, an AI system might trade off any arbitrarily high amount of car damage for a small reduction in time.

Since saving even minute of time has some positive value, and no amount of car damage has any negative value, as long as the car reaches the destination –i.e., it is not absolutely destroyed– the AI is free to choose a marginally better route regardless of how much more damage the car takes. You’ll end up with a system that reaches the destination as fast as possible but considers every car disposable.

Imperfect metrics

It is by now well-known to every system’s designer that when a proxy metric becomes an objective, it loses its usefulness as a metric. Yet, this is the dialy job of a machine learning engineer. In machine learning, we are always turning proxy metrics into optimization objectives because, in a sense, that is the only thing we can do.

In a typical machine learning problem, we have to turn a complex task into something measurable that our AI can optimize for. So, for the AI, the metrics are the actual task. This isn’t too much of a problem if systems aren’t very good at optimizing metrics, as they tend to work closely to your intended outcome. But as AI systems become better at optimizing our metrics, they can much better exploit the difference between the proxy metric and the actual desired performance.

This leads to an interesting paradox: the smarter the system, the more likely it will fail to do what you want them to do. A first reason for this seemingly paradoxical phenomenon is tha imperfect metrics tend to match our desires in the general cases, but the differences with our true intentions are more accentuated in the more extreme cases. It’s like classic mechanics versus general relativity. The former works perfectly for most cases, except if you really need precise calculations of complex astrophysical phenomena.

In the same vein, imperfect metrics –such as getting a high score in a videogame, getting high grades in college, or running for a long distance without crashing– are easier to satisfy up to some degree if you’re actually doing the right thing –playing the game optimally, studying very hard, or driving safely. But the easiest way to satisfy these imperfect metrics to a very high level –like acing the SAT– is to game the system. Instead of studying super hard and really learning a lot, just study tons of SAT tests and learn to answer those exact questions, without really understanding much of the underlying theory.

This phenomenon is one of the many ways in which overfitting shows up in machine learning, and it’s well-known that the harder you optimize a metric the more likely your system will learn the quirks of that specific metric and fail to generalize to the actual situations where you expect it to perform.

And while this can and will happen by accident, there as an even more insidious problem. The smarter the system, the more likely it will, intentionally, learn to game the system.

Reward hacking

Imperfect metrics are a problem in all machine learning scenarios, but they become even more challenging in reinforcement learning. As a quick recap, reinforcement learning is when, instead of showing an AI examples of a well-done task, you just let it try things out, and reward it when those attempts lead to progress.

The reason we need reinforcement learning is because, for many complex problems, it is simply impossible to produce a sufficiently large dataset of good examples. Sometimes it’s unfeasible –e.g., collecting thousands of hours of expert drivers across many scenarios– and sometimes is, even in principle, impossible –e.g., when you’re building a system to do something you yourself can’t do, like, dunno flying a drone through a building in flames?

So, instead of examples, we let the AI loose and evaluate if they reach the intended goal. For example, you let an AI take control of your car –say, in a simulated environment that is sufficiently detailed, like GTA5– and reward it for how many miles it can stay on the road without crashing.

Now, what is the easiest way to optimize that metric? Maybe something like drive at 2 Km/h? That’s what your AI –if it’s smart– will learn to do. So now you add a new restriction, say, distance only counts if the AI goes at over 40 Km/h. Then the AI will learn to drive forward for 100 meters, shift to reverse, drive back slowly, and repeat. You can keep adding constraints and making the evaluation metric as complicated as you want but the key point is this: all metrics are gameable, and the smarter your AI system is, the better it will be at gaming whatever metric you design.

This happens, again, because the AI doesn’t know what you truly want, only what you are measuring. And when metrics become an objective, they cease to be good metrics.

To address this, instead of designing an explicit metric, we can let AI systems act and provide feedback on whether their actions are good or not. Then, another machine learning system learns to approximate the evaluators’ assessments, and acts as an implicit performance metric. This creates a two-level ML system, where each is trying to game the other. This process is called reward modelling, or alternatively, reinforcement learning with human feedback (RLHF), and is our current best approach to prevent reward hacking.

However, there are still challenges even with RLHF. Your evaluator AI can learn the wrong model from your feedback, because, again, it is being trained to optimize some imperfect metric –like minimizing the error between its predictions and yours. In the end, you’re pushing the problem of reward hacking one level up, but not getting rid of it.

And finally, even if your system behaves as intended, how can you know it is actually doing so because it truly understands your intentions?

Internal objectives

The final challenge I want to address is the interplay between internal and external objectives. Today, the most powerful learning algorithms and problem-solving methods we have are all based on optimization. Optimization algorithms power machine learning, symbolic problem-solving, operations research, logistics, planning, design, etc. As AI designers, if we turn optimization to create powerful decision-making algorithms and train a highly intelligent AI, it’s likely that the AI’s internal processes will also involve optimization.

What this means is, suppose you train a highly capable AI agent to solve problems in the real world. This agent would be capable of long-term planning, self-reflection, and updating its own plan as it explores the world. It is sensible to think that, whatever this agent is doing internally to plan its solution, it will use some form of optimization algorithm. Maybe the agent will rediscover reinforcement learning and use it to train in real time its own mini-agents (like tiny homunculi inside its artificial mind).

If this looks like sci-fi, consider we humans are basically intelligent agents optimized by evolution to solve the problem of staying alive –I know this is a huge oversimplification, but please, biologists out there, don’t crucify me yet, this is just a useful analogy. So, in the process of solving the problem of staying alive, we came up with optimization algorithms of our own that run inside our brains. A sufficiently intelligence AGI would presumably be able to do the same, right?

Now here is the problem. You give this AGI some external objective to solve, and it will come up with internal objectives to optimize for. But we might not be able to see this internal optimization algorithm at all. If the AI resembles anything we have today, it will be a massive black-box number-crunching machine.

Just like you can’t really read out of a human brain what their true objectives are –at least, not yet– we might never be able to truly understand what is the AI optimizing for internally as it strives to solve our problem. We can observe external behavior but might never see the actual internal objectives. All we can do is judge the system based on its actions.

In essence, we can only evaluate how agents –humans or not– act, but not their true motivations. And if someone always acts as if their motivations are aligned with ours, it may be difficult to identify any misalignment that could arise in the future. Maybe they are aligned with 98% of our objectives, or maybe only while there is no solar eclipse or some other weird stuff like that. We simply can’t know for sure.

Operational alignment — safety in the harness

The previous chapter on agentic AI named a layer that the alignment argument now needs to lean on. The harness is the system between the language model and the world — the code that takes a model’s outputs, decides whether to act on them, performs the action, and feeds the result back. Chapter 6 introduced the harness mechanically. Here it becomes an argument about safety. The reliability of an agentic system, that chapter established, is a property of its harness rather than its model. The same is true of its safety. Models hallucinate at a rate that varies smoothly with size and training; harnesses, by contrast, are ordinary software, and ordinary software can be made arbitrarily strict. That asymmetry is the doorway through this section.

Everything before this point in the chapter argued that perfectly specifying human values is unsolved, and may be unsolvable in principle. Model-level alignment — the project of training a system whose weights encode the right values — is chasing a target the rest of this chapter argued is not where it appears to be. The honest engineering response is to stop chasing the model and start bounding the system. Operational alignment is the name for that move: the discipline of putting safety in the harness layer, where you can write it down, version it, audit it, and prove it does what it claims. The model can suggest anything; the harness decides what actually happens. You will not prove the model aligned. You can prove the system bounded. The difference is everything, because a bounded system can be deployed today, while the unbounded one waits on a research problem that has been open for thirty years.

The harness-level discipline comes in a small set of primitives that recur across every agent platform built in the last few years. None of them are new. All of them predate the linguistic agent and were standard practice in the systems-engineering world long before there was a language model to drive them. What is new is that they are now the operational core of how AI systems are kept safe.

Consent at the tool boundary. The harness exposes a finite set of actions to the model. The model proposes; the harness disposes. For any action with consequences the user cannot trivially undo — booking a flight, sending an email, modifying a file, posting to a public forum — the harness pauses and asks for explicit human confirmation before performing it. The Model Context Protocol, the cross-vendor standard for exposing tools to a language-model client introduced by Anthropic in 2024, treats consent as a first-class primitive: a client must request capability before invoking it, and a server can deny.¹ The failure this prevents is the agent that, in a single fluent paragraph, hallucinates a tool call and the harness performs it anyway. The residual risk is the human who approves without reading.

Dry-run by default. The harness can, for any state-changing action, produce a preview of what the action would do before committing it. The pattern is old. Terraform‘s plan step renders the infrastructure delta before apply makes it; Kubernetes’ [kubectl apply --dry-run=server]{.g key=kubectl} validates a manifest against the live API without writing; rsync -n, git push --dry-run, and apt-get --simulate exist for the same reason.² ³ The model proposes an action; the harness renders the action as a diff; the user reads the diff and then commits. The failure this prevents is the irreversible action taken on the basis of an unread instruction. The residual risk is the diff that looks small but is not.

Permission gates and hooks. The harness wraps every tool invocation in a policy layer. Calls have to pass an allow-list; calls outside the list are refused. Calls generate an audit log. Hooks let the developer interpose code between the model’s request and the harness’s execution — to validate arguments, redact secrets, escalate to a human, or reject the call outright. Anthropic’s documentation for production agent harnesses recommends these as essential rather than optional.⁴ The failure this prevents is the agent reaching for capability it should never have had. The residual risk is the policy that is too permissive because the author did not anticipate the failure mode.

Sandboxes. The harness confines the agent’s effects to a contained environment — a working directory, a virtual machine, a temporary database, an ephemeral container — so the blast radius of any action is bounded by the walls of the sandbox rather than by the model’s judgment. The agent can do whatever it likes inside the box; what leaves the box passes through the harness once more. The failure this prevents is the cascading mistake whose damage exceeds the agent’s authority. The residual risk is the escape — the side channel the sandbox forgot to close, or the action whose blast radius is inside the wall.

Operational alignment bounds behaviour. It does not make the model good. A bounded system running a misaligned model is safer, not safe. The dry-run preview catches only the misbehaviour the user notices when they read it; the consent prompt protects only against the action the user pauses to refuse; the sandbox holds only as long as the model does not find the side channel; the permission gate is exactly as strict as its author thought to make it. Inside the sandbox, the model can still find ways to satisfy the proxy metric without doing the work — the specification-gaming behaviour catalogued at length by alignment researchers.⁵ None of this is a silver bullet. The techno-pragmatist position is that there is no silver bullet, only better and worse engineering. Operational alignment is the better engineering available right now: a real partial answer, not the final one. It buys time, and time is the resource the rest of the field is spending.

The same discipline appears in different costumes elsewhere in this book. The verification gates that govern reliable agentic coding in Part II are the dry-run principle wearing a software-engineering uniform. The human-in-the-loop mandates that the policy chapter discusses are consent at the tool boundary, written into law instead of code. The four primitives are one idea, reapplied at every layer where the field has had to learn the same lesson. The hardest version of the lesson — bounding a system more capable than ourselves with components weaker than it — is what the epilogue returns to.

Is it all lost?

I hope you now understand why this is a devilishly hard problem to solve. The very nature of intelligence makes this an adversarial situation. We want systems that are both highly self-sufficient but also very dependable. We need them to think and act on their own, but we need to trust them. And the smarter they become, the more blindly that trust has to be, and the more potential for catastrophe we find ourselves in.

All is not lost, though. There’s an enormous body of research in the alignment problem, and while there are no silver bullets yet –and perhaps never will– we’ve made significant progress.

One final thought. So far we’ve been focusing on internal challenges to AI alignment –challenges related to the task and the solution. But there’s an elephant in the room. AI alignment is literally having an AI aligned with our values. But whose values? We’re all different, and we all have different opinions about what’s important. So that’s a crucial conversation we need to have as these systems start to impact the daily lives of people all around the world.

Anthropic. Model Context Protocol — Specification. 2024. https://modelcontextprotocol.io/. The MCP design treats consent as a first-class primitive: a client must request capability before invoking it, and a server may deny.↩︎
HashiCorp. Terraform CLI — plan command. https://developer.hashicorp.com/terraform/cli/commands/plan. The canonical example of dry-run-by-default in production infrastructure tooling: plan renders the delta, apply commits.↩︎
Kubernetes. kubectl apply reference — --dry-run mode. https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply. Validates a manifest against the live cluster API without writing state.↩︎
Anthropic. Claude Agent SDK — harness design guidance. 2024. https://docs.anthropic.com/en/api/agent-sdk-overview. Production guidance places permission gates, hooks, and audit logging in the harness rather than the model.↩︎
Krakovna, V. et al. Specification gaming: the flip side of AI ingenuity. DeepMind, 2020. https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/. Maintains a public catalogue of around sixty documented cases in which trained systems satisfied a proxy metric without doing the intended task.↩︎