9  Why AI Alignment is So Hard

In the AI safety community, concerns range from existential risks, where some believe super intelligent AI will inevitably harm us, to those focused on present issues like misuse of AI, perpetuating biases, discrimination, and spreading disinformation. However, most solutions to these problems rely on addressing the AI alignment problem.

AI alignment ensures that an artificial intelligence system’s output aligns with user preferences beyond superficial optimization metrics. The goal is for the AI to genuinely solve the problem we care about, rather than just what we told it to do. This challenge arises because describing user preferences is incredibly difficult. Often, what we tell the AI we want is only a rough approximation of our true desires. Alignment aims to bridge that gap by helping AI understand our genuine intentions from our limited explanations.

Now let’s explore why AI alignment may be the most difficult problem in computer science. In this article, we’ll examine why AI safety, specifically alignment, could be the hardest engineering challenge. We’ll start by defining AI alignment as ensuring an artificial intelligence system behaves according to user expectations, preferences, or values in various ways.

The AI alignment problem

Why Is AI Alignment the Hardest Problem in Computer Science? In this section, we will explore why AI safety, specifically AI alignment, is potentially the most difficult challenge in engineering.

Defining AI alignment

AI alignment involves ensuring that an artificial intelligence system behaves according to a user’s expectations, preferences, or values. The interpretations can vary depending on context, but in general, the expectation is that a system will perform aligned with the user interests, that is, in a way that satisfies the user actual needs and wants, and not just some superficial approximation of them.

The need for alignment in Artificial Intelligence

First, lets question why alignment is necessary in artificial intelligence but not in other high-tech products engineering tools such as cars, planes, or rockets. The primary reason lies in the level of complexity and the nature of the interfacing when dealing AI systems compared to other engineering tools.

What makes AI different from other high-end technologies we have?

The more advanced a tool, the more you can focus on telling it what you want to do instead of how to do it. For example, with simple tools like a hammer, you control every action. With complex tools like cars, you still don’t say “reach the destination” –at least not yet. Instead, you perform some mid-level actions like steering and accelerating, which in turn translate to lower-level actions like moving pistons.

That is, as the tool becomes more advanced, the way you use it becomes closer that what you want and farther from what the tool must do to achieve that objective. Artificial intelligence lies in the declarative end, the point where you only tell the tool your end goal and let it figure out all the steps to achieve that. Actually, we could make the case that AI can be defined precisely as the field dedicated to making tools that do what you want.

Consider driving a car –a regular car, not a self-driven one. To get where yow want, you need the car to steer in the right direction and accelerate or brake at the right times. That is, your high-level objective –getting somewhere– is decomposed into many low-level instructions. You cannot simply ask the car to drive itself –again, a traditional car. You have to steer the car and accelerate it. So the system that is translating a high-level, perhaps abstract instruction like “get me home fast but safely” into precise low-level instructions, is you.

Now contrast this with a self-driven car. You just say “get me home” –the “fast but safely” part is assumed implicitely–, and the AI system in the car has to “understand” this high-level instruction and decompose it the myriad of low-level instructions that actually make the car do the thing you want it to do.

But here is the crucial part, “get me home” encodes a far larger set of assumptions than what you usually imagine, and there is an infinite way in which an AI agent could be said to have fullfilled that request without actually doing what you intended it to do.

When you say “get me home” to a human taxi driver, they usually implicitely assume you’re also asking the following: - do not produce me any physical or psycological harm; - get there reasonably fast, - but do not drive carelessly; - take the fastest route if possible, - but take a detour if its necessary even if it costs me a bit more; - do not engage in uncomfortable conversations, - but do engage in reasonably well-mannered conversations, - or leave me along altogether, depending on my mood; - do not harm any pedestrians or animals, - but if you must harm an animal to avoid a fatal or very dangerous accident, please do; - …

These are all reasonable assumptions that any human knows from common sense, because we all share a common understanding of what it means to live and act in the human world. But an AI doesn’t come with that common sense preprogrammed.

In fact, common sense reasoning seems to be one of the hardest skills for modern AI to acquire, at least in part because by the virtue of it being “common”, which means we don’t have large corpora of explicit examples of this type of reasoning, like we have for the more specialized skills.

And that is the reason we need alignment. When we tell a tool what we want instead of how to do it, we need the tool to interpret that want in a context that is full of assumptions, restrictions, and trade-offs which are often implicit. Alignment means having an AI system apply the right implicit context, and find the solution to our request that is, as the name implies, more closely aligned to what we really want instead of just any solution that superficially fits the explicit request.

The crucial reason alignment is hard is due to the interplay between two critical parts of the AI equation: the inherent complexity of the world and the unavoidable brittleness of the way we model it.

Let’s break it down.

Why AI alignment is hard

Many reasons make AI alignment one of the hardest problem a system designer can face. Some of those reasons involve our incapability to produce a robust enough description of the task we want to solve –that is, we cannot fully describe the context and all the implicit assumptions and restrictions that apply in that context. These reasons are related to the nature of the problem itself –getting any intelligence agent to do what you want is intrinsically hard; if you’re a parent, you know exactly what I mean. Other reasons are related with the nature of the solution we currently have, that is, systems built with machine learning, trained on imperfect datasets to maximize imperfect proxy metrics.

These are interrelated but separate challenges, so let’s take them one at a time.

Implicit contexts

When using a regular tool –like a hammer, a calculator, or Excel– you have an end goal in mind, but the tool doesn’t need to understand that goal; it just needs to follow your precise instructions. However, when working with AI, there are many assumptions about how the world works that aren’t explicitly described in your instructions.

For instance, if you tell an advanced AI to make coffee, there are numerous implicit restrictions: don’t destroy the coffee machine, don’t harm any animals, don’t walk through walls, etc. Humans generally understand these unstated rules because we share a common understanding of the world. So there is a significant difference between systems that require specific instructions on how to perform tasks and those that simply need to be told what tasks to accomplish.

When you want to tell a system what to do instead of how to do it, you must be very precise in specifying everything it needs to know. In a restricted domain, the constraints may be simple enough to be explicitly encoded or learned from data. For example, in a factory setting, a robotic arm is simply physically incapable of destroying the world, so it doesn’t need to know much about anything outside the narrow task of car painting.

However, training systems for open-ended decision-making in the real world is far more complex. It’s hard to imagine a training scenario that is as intricate as real life. Gaining all the necessary experience to understand the human world like humans do would require something like raising a child from scratch. And the majority of assumptions in those contexts can’t be learned from data, because we simply don’t have training data for “how to be a human”.

Unclear trade-offs

However, the implicit context problem carries an even bigger challenge. While many of the things an AI must care about implicitly are restrictions –e.g., do not kill the passenger– the hardest problem is when they involve tradeoffs instead.

This is a fundamental issue built into most optimization problems in the real world. On one hand, you want a system to achieve its objective as effectively as possible. On the other hand, you want it to do so with minimal side effects. These two goals are often contradictory – for example, driving fast versus driving safely.

Many of these unwanted side effects are implicit restriction: you don’t want to kill any pedestrians or harm the passengers. However, some side effects are not hard constraints, but tradeoffs. If you want zero chance of getting in a car accident, the only solution is not to drive. So, you indeed want your AI system to correctly trade off a small risk of getting hurt for the possibility of actually solving the task –getting you from A to B. Pragmatism involves trade-offs.

And we humans frequently make these trade-offs unconsciously, e.g., between getting somewhere faster and taking on a bit more risk in the highway, or going the longer, safer way. This kind of trade-off is at the heart of any complex problem– trade-offs are the very reason engineering problems are hard to begin with!

With an AI system, however, it becomes even worse, because the system not only needs to understand the many implicit constraints and trade-offs in the world, but also it needs to understand how you value those trade-offs. You would need to not only specify potential side effects butto give them appropriate negative values, so the system avoids them while still achieving its primary goal.

This challenge arises because most machine learning systems optimize performance metrics. And the nature of optimization involves comparing numerical quantities. To optimize your goal of a fast yet safe journey, you must quantify these trade-offs. For example, how much inconvenience is being late compared to the risk of a sudden hard brake? Is it worth risking a small head injury to arrive 20 minutes earlier? How do you put numbers to being early versus being safe?

Furthermore, for any value you fail to quantify, your AI will be compelled to prioritize performance over that factor, since there’s no penalty for it. So, it’s crucial to quantify all critical side effects. If you don’t specify a crucial dimension –like car damage–, you’re in big trouble. To save time, an AI system might trade off any arbitrarily high amount of car damage for a small reduction in time.

Since saving even minute of time has some positive value, and no amount of car damage has any negative value, as long as the car reaches the destination –i.e., it is not absolutely destroyed– the AI is free to choose a marginally better route regardless of how much more damage the car takes. You’ll end up with a system that reaches the destination as fast as possible but considers every car disposable.

Imperfect metrics

It is by now well-known to every system’s designer that when a proxy metric becomes an objective, it loses its usefulness as a metric. Yet, this is the dialy job of a machine learning engineer. In machine learning, we are always turning proxy metrics into optimization objectives because, in a sense, that is the only thing we can do.

In a typical machine learning problem, we have to turn a complex task into something measurable that our AI can optimize for. So, for the AI, the metrics are the actual task. This isn’t too much of a problem if systems aren’t very good at optimizing metrics, as they tend to work closely to your intended outcome. But as AI systems become better at optimizing our metrics, they can much better exploit the difference between the proxy metric and the actual desired performance.

This leads to an interesting paradox: the smarter the system, the more likely it will fail to do what you want them to do. A first reason for this seemingly paradoxical phenomenon is tha imperfect metrics tend to match our desires in the general cases, but the differences with our true intentions are more accentuated in the more extreme cases. It’s like classic mechanics versus general relativity. The former works perfectly for most cases, except if you really need precise calculations of complex astrophysical phenomena.

In the same vein, imperfect metrics –such as getting a high score in a videogame, getting high grades in college, or running for a long distance without crashing– are easier to satisfy up to some degree if you’re actually doing the right thing –playing the game optimally, studying very hard, or driving safely. But the easiest way to satisfy these imperfect metrics to a very high level –like acing the SAT– is to game the system. Instead of studying super hard and really learning a lot, just study tons of SAT tests and learn to answer those exact questions, without really understanding much of the underlying theory.

This phenomenon is one of the many ways in which overfitting shows up in machine learning, and it’s well-known that the harder you optimize a metric the more likely your system will learn the quirks of that specific metric and fail to generalize to the actual situations where you expect it to perform.

And while this can and will happen by accident, there as an even more insidious problem. The smarter the system, the more likely it will, intentionally, learn to game the system.

Reward hacking

Imperfect metrics are a problem in all machine learning scenarios, but they become even more challenging in reinforcement learning. As a quick recap, reinforcement learning is when, instead of showing an AI examples of a well-done task, you just let it try things out, and reward it when those attempts lead to progress.

The reason we need reinforcement learning is because, for many complex problems, it is simply impossible to produce a sufficiently large dataset of good examples. Sometimes it’s unfeasible –e.g., collecting thousands of hours of expert drivers across many scenarios– and sometimes is, even in principle, impossible –e.g., when you’re building a system to do something you yourself can’t do, like, dunno flying a drone through a building in flames?

So, instead of examples, we let the AI loose and evaluate if they reach the intended goal. For example, you let an AI take control of your car –say, in a simulated environment that is sufficiently detailed, like GTA5– and reward it for how many miles it can stay on the road without crashing.

Now, what is the easiest way to optimize that metric? Maybe something like drive at 2 Km/h? That’s what your AI –if it’s smart– will learn to do. So now you add a new restriction, say, distance only counts if the AI goes at over 40 Km/h. Then the AI will learn to drive forward for 100 meters, shift to reverse, drive back slowly, and repeat. You can keep adding constraints and making the evaluation metric as complicated as you want but the key point is this: all metrics are gameable, and the smarter your AI system is, the better it will be at gaming whatever metric you design.

This happens, again, because the AI doesn’t know what you truly want, only what you are measuring. And when metrics become an objective, they cease to be good metrics.

To address this, instead of designing an explicit metric, we can let AI systems act and provide feedback on whether their actions are good or not. Then, another machine learning system learns to approximate the evaluators’ assessments, and acts as an implicit performance metric. This creates a two-level ML system, where each is trying to game the other. This process is called reward modelling, or alternatively, reinforcement learning with human feedback (RLHF), and is our current best approach to prevent reward hacking.

However, there are still challenges even with RLHF. Your evaluator AI can learn the wrong model from your feedback, because, again, it is being trained to optimize some imperfect metric –like minimizing the error between its predictions and yours. In the end, you’re pushing the problem of reward hacking one level up, but not getting rid of it.

And finally, even if your system behaves as intended, how can you know it is actually doing so because it truly understands your intentions?

Internal objectives

The final challenge I want to address is the interplay between internal and external objectives. Today, the most powerful learning algorithms and problem-solving methods we have are all based on optimization. Optimization algorithms power machine learning, symbolic problem-solving, operations research, logistics, planning, design, etc. As AI designers, if we turn optimization to create powerful decision-making algorithms and train a highly intelligent AI, it’s likely that the AI’s internal processes will also involve optimization.

What this means is, suppose you train a highly capable AI agent to solve problems in the real world. This agent would be capable of long-term planning, self-reflection, and updating its own plan as it explores the world. It is sensible to think that, whatever this agent is doing internally to plan its solution, it will use some form of optimization algorithm. Maybe the agent will rediscover reinforcement learning and use it to train in real time its own mini-agents (like tiny homunculi inside its artificial mind).

If this looks like sci-fi, consider we humans are basically intelligent agents optimized by evolution to solve the problem of staying alive –I know this is a huge oversimplification, but please, biologists out there, don’t crucify me yet, this is just a useful analogy. So, in the process of solving the problem of staying alive, we came up with optimization algorithms of our own that run inside our brains. A sufficiently intelligence AGI would presumably be able to do the same, right?

Now here is the problem. You give this AGI some external objective to solve, and it will come up with internal objectives to optimize for. But we might not be able to see this internal optimization algorithm at all. If the AI resembles anything we have today, it will be a massive black-box number-crunching machine.

Just like you can’t really read out of a human brain what their true objectives are –at least, not yet– we might never be able to truly understand what is the AI optimizing for internally as it strives to solve our problem. We can observe external behavior but might never see the actual internal objectives. All we can do is judge the system based on its actions.

In essence, we can only evaluate how agents –humans or not– act, but not their true motivations. And if someone always acts as if their motivations are aligned with ours, it may be difficult to identify any misalignment that could arise in the future. Maybe they are aligned with 98% of our objectives, or maybe only while there is no solar eclipse or some other weird stuff like that. We simply can’t know for sure.

Is it all lost?

I hope you now understand why this is a devilishly hard problem to solve. The very nature of intelligence makes this an adversarial situation. We want systems that are both highly self-sufficient but also very dependable. We need them to think and act on their own, but we need to trust them. And the smarter they become, the more blindly that trust has to be, and the more potential for catastrophe we find ourselves in.

All is not lost, though. There’s an enormous body of research in the alignment problem, and while there are no silver bullets yet –and perhaps never will– we’ve made significant progress.

One final thought. So far we’ve been focusing on internal challenges to AI alignment –challenges related to the task and the solution. But there’s an elephant in the room. AI alignment is literally having an AI aligned with our values. But whose values? We’re all different, and we all have different opinions about what’s important. So that’s a crucial conversation we need to have as these systems start to impact the daily lives of people all around the world.