10 The Road to AGI
Artificial General Intelligence (AGI) has been one of the ultimate goals of Artificial Intelligence since its inception, starting with Alan Turing and the Turing test. In the 60s and 70s, the primary aim was to create something as intelligent as humans. However, for a while, this objective took a backseat as researchers grappled with the immense challenges of developing even the simplest AI systems.
Recently, with the emergence of large language models that are nearly capable of passing the Turing test, discussions about AGI have resurfaced. Many people believe we are on the cusp of a major breakthrough that will lead us to AGI, and some even think that GPT-4 is already an AGI.
In this essay, I want to challenge this notion by exploring what Artificial Intelligence and Artificial General Intelligence truly entail. I will discuss what generalization means in the context of AI and explain why achieving AGI is so difficult. Ultimately, I will argue that everything we currently have and know how to create falls short of true General Intelligence, and what we have to do instead to get there.
General AI is not (just) scaled-up narrow AIs
Let’s begin by quickly reviewing what we mean when we say “narrow” and “general” AI, and then I’ll tell you why I think narrow times x for any x is not equivalent to general.
AI is cognitive automation. It’s about building computer programs that solve tasks we humans rely on our cognitive abilities —from puzzles to perception to reasoning to planning and anything else where humans employ their intelligence to solve problems.
Narrow AI is just AI within a concrete, well-defined domain. Playing chess at the world-champion level is narrow AI. Playing Go like freaking gods is narrow AI. Self-driving cars are narrow AI. And yes, ChatGPT is narrow AI. I’ll tell you in a moment why.
In contrast, general AI is AI that works across all domains in which human intelligence can be effectively deployed. This includes, crucially, domains that we don’t even know exist today, but when we find them, we will function there just as effectively as in everything else —just like modern physics didn’t exist in Aristotle’s time, but there’s no cognitive difference between Einstein and Aristotle. Raise an Aristotle in 1900, and he would master relativity just fine.
The difference between narrow and general AI is in the level of generalization. I will borrow a bit from Francois Chollet in his paper On the measure of intelligence, which I truly recommend, although I’ll rephrase things in my terms.
The notion of generalization in AI, and more specifically machine learning, is the ability of a model to perform effectively outside the conditions in which it was designed —or trained— to work.
In its simplest form, we expect at least out-of-training generalization, i.e., that models perform well in instances they didn’t see during training. The opposite of that is called overfitting —performing well in training instances but terrible in everything else. The only thing we need for out-of-training generalization is our model to interpolate correctly between training samples. Any sensible model that doesn’t over or underfit the training set will exhibit this mediocre level of generalization.
The next step is out-of-distribution generalization, i.e., models that not only work for unseen instances that are pretty similar to training instances but also instances that are quite different. Of course, “quite” is doing much of the work in that sentence. Technically, you want the model to extrapolate beyond the training examples. For that, you need sensible priors that capture what you expect the data distribution to be outside the training convex hull. Most of us think of this level of generalization when we say that a model generalizes well to unknown inputs.
And finally, the ultimate step is out-of-domain generalization, i.e., models that can be applied to new types of problems in completely new domains without retraining. And we still have no idea what it takes to get there.
So this is the line that separates narrow from general AI. Up until out-of-distribution generalization, all we currently have is narrow AI.
The reason is that this type of generalization is developer-aware, to put it in Chollet’s terms. This means that even if the test instances are unknown to the model, they are known to the developer. That’s why we can find sensible priors and inductive biases that extrapolate effectively. Because we know beforehand how the test instances will look, we can design models that will perform well on those instances, even if the models are not exposed to those exact instances during training.
Extreme generalization —as out-of-domain generalization is also called— is a completely different beast. To get a narrow AI in any given domain, our model has to master one skill: playing chess, identifying pedestrians, or folding molecules. But crucially, in each of these cases, the actual learning algorithm is designed ad-hoc.
But to get to general AI, our models must master skill acquisition. That is, they have to learn to learn. In a sense, a general AI is an AI researcher, automated. And yes, there is research in this direction —the field is aptly called meta-learning— but we’re far from even automatically discovering the most basic learning algorithms we have already designed.
Thus, back to my initial claim: no amount of scaling-up narrow AIs will lead us to general AI. There is no point where learning disparate skills —object detection, speech synthesis, chess playing; you name it— suddenly leads us to learn how to learn. There’s a qualitative jump there, in the same sense as making ever higher skyscrapers won’t lead us to the moon. It’s not just a matter of scaling what has worked so far. You need a completely different theory.
But wait —I hear you say— aren’t you making some gross generalization when you gloss over the term “concrete skill”? Isn’t “actually learning” just another concrete skill?
Yes, it is. But it is not a skill that can be learned via the typical machine learning framework. No amount of data in any specific domain will lead a model to find a hypothesis outside its hypothesis space. And in any specific domain, at least for any machine learning approach we currently have, the hypothesis space only contains hypotheses about the entities in that domain. It doesn’t contain hypotheses about the hypothesis space itself.
But what about LLMs? — you ask— Don’t they display generalization to many unseen tasks with very few examples? Isn’t in-context learning an instance of learning to learn?
Well, kind of. To be clear, LLMs are the most general models we currently have. They generalize past what we expected at this point of technological development and exhibit an uncanny ability to adapt to new tasks as long as we frame them in an appropriate linguistic framework. Thus, I claim LLMs are, while super exciting and impressive, still just narrow AI.
In short, while I believe NLP to be AI-complete —meaning that actually solving NLP entails solving all of AI— I strongly believe the stochastic language modeling paradigm we currently use in all LLMs, is not NLP-complete. That means there are unsolved problems in NLP that no amount of data and computing power will solve until we find something better than LLMs.
Let’s go there, then.
Emergent abilities in LLMs
We already agree that narrow or domain-specific AI can’t simply scale to general-purpose AI with more data and computing power. The gist of the argument is that learning to learn —which is what is required for general AI— is a completely different beast than learning any other domain skill.
But one thing we left hanging in the previous section was the discussion of whether current large language models are something “above” narrow AI, even if not still completely general. One argument often made in favor of their generality is that these models exhibit surprising abilities they were not explicitly trained for, so-called emergent abilities.
So let’s briefly analyze emergent abilities, where they come from, and how significant they are in discussing narrow vs. general AI. As I usually do, instead of simply giving you my opinionated views, I will try to lay out a framework for thinking and talking about emergent abilities in computational systems that help you reach your conclusions —though I will also tell you my conclusions.
As an appetizer, I sincerely recommend you read this Twitter/X thread from Francois Chollet, one of my all-time favorite level-headed voices in AI.
What is emergence?
If you read Chollet’s thread you’ll notice he uses a particularly loose definition of “emergent”, taken as some surprising ability or property you didn’t plan explicitly. The problem with this definition is that it is very subjective: different people looking at the same system will see emergence or not depending on their expectations.
If we’re going to have a serious discussion about emergence, we need a more objective definition. I will propose a slightly better definition, although it won’t be without issues.
A property or ability in a system is emergent if it is displayed at the system level but not at any component’s lower level.
Now that we have a working definition let’s run a sanity check. The least you want from a definition is that it is nontrivial. A trivial definition is always true or false. A nontrivial one will apply to some cases but not all. So let’s see if we can find sensible examples of both emergent and non-emergent properties.
I’m focusing on computational systems in this post, although this discussion extends to all sorts of systems. But let’s keep it simple and focused.
A straightforward example of a non-emergent property in computational systems is error tolerance. Find any error-tolerant system —e.g., a highly distributed database— and you can almost certainly pinpoint which components confer that property to the system, e.g., redundant storage, consensus algorithms, etc. The point is you don’t get error tolerance for free; rather, it is often baked into the system in the form of some explicit component implementing that functionality.
In contrast, a nontrivial emergent property in a computational system is Turing completeness, the ability to perform any possible computation. The simplest Turing-complete system —the basic Turing machine— is a smart wiring between a plain and simple finite state machine with a plain and simple read-write memory tape. Neither the state machine nor the tape themselves are Turing-complete. It is only their interaction that makes the system as capable.
These two examples highlight an alternative definition for “emergent” that aligns more with Chollet’s intuition and the common definition —that an emergent ability is something you discover in your system rather than explicitly designing for it.
A property or ability in a system is emergent if it is caused by the interaction between components and not by any of the components sole function.
There is one final distinction to be made between weak and strong emergence. Weak emergence is when an emergent property can be reductively explained by looking at the relationships between system components. Strong emergence is when that’s not the case: no matter how you decompose the system into components, no reductive theory explains —i.e., predicts— the emergent property from those components.
In the above sense, Turing-completeness is a weakly emergent property, because we can explain how the interaction between a finite state machine and an infinite memory, neither of which is Turing-complete by itself, suddenly gives rise to this property.
So far, we don’t know if there are strongly emergent properties in the world, let alone computational systems, but a good candidate is consciousness. If materialists are right and consciousness is just an emergent property of matter, i.e., all mental states correspond to some physical states of the brain, then one possible solution to the hard problem is claiming that consciousness is a strongly emergent property in sufficiently complex information processing systems; thus, it cannot be reductively explained.
Emergence in language models
Now that we have non-trivial examples of emergent and non-emergent properties in computational systems, let’s turn to the specifics of machine learning.
Chollet uses a simple example of emergence in language models in his thread: encoding semantic properties in word embeddings. When we train a word embedding model, like word2vec, we don’t explicitly design it so that specific directions in the embedding correspond to semantic properties like gender, time, location, or relations like synonyms, meronyms, or antonyms.
We train the model so that word embeddings are useful for predicting a word’s context distribution. It just happens that the best way to encode a word’s context in a fixed-length real-valued vector is such that words with similar semantic relations end up having similar geometric relations. Thus, this is an emergent property.
However, from the previous discussion, I hope you agree this is a weakly emergent property. We understand pretty well why certain directions in the embedding space tend to approximately encode certain semantic properties. It is surprising but explainable, as much research has been published on the explanation. Crucially, we now know how to train word embeddings such that certain directions encode specific semantic properties, so we can design for this ability.
Now, the most interesting emergent ability in machine learning, this time in large language models like GPT-3, is, of course, in-context learning. This is the ability to prime a language model to solve a seemingly novel task via carefully constructed prompts without fine-tuning the weights. Most, if not all, of the surprising use cases of LLMs, are ultimately reducible to particular cases of in-context learning.
But is in-context learning a weak or strongly emergent ability? We still don’t have a precise explanation for why it happens. Specifically, we still cannot say, “these are the exact interactions between these components that make in-context learning appear.” However, this doesn’t mean there isn’t an explanation; it just means we still haven’t found it. I personally believe that in-context learning is weakly emergent and that we will find a convincing explanation sooner rather than later.
Can LLMs reason?
One especially exciting thing about in-context learning is that, for some tasks, it implies LLMs need to construct internal world models. The canonical example, which appears in the famous “GPT-4 sparks of general intelligence” paper, is the ability to solve Otello puzzles. After analyzing the weight activations on certain Otello prompts, researchers discovered what seemed like a pretty explicit representation of the Otello board, together with correct topological relations between the pieces and their locations.
A recent preprint argues that, after a thorough evaluation, GPT-4 cannot reason. According to the author, the conclusion rests on a set of carefully designed reasoning problems with more sophisticated qualitative analysis. We must remember that this paper is not peer-reviewed, so take it with a grain of sand, but after its publication, there have been many more examples showing catastrophic reasoning failures in GPT-4. At the moment, no one serious would claim any of the existing models actually reasons at a level similar to most humans.
However, beyond empirical evaluations, we can make a couple of formal arguments to highlight why the current “pure” language modeling paradigm has to be incapable of fully-fledged reasoning. It’s a classic argument in computability theory, and it goes like this.
If any computational model can reason —to the full extent of the meaning of this word in the context of artificial intelligence—, it must be able to perform Turing-complete computations, for if a model is not Turing-complete, that means there are decidable problems it cannot solve. One key aspect of Turing completeness is that it requires potentially unbounded computation. That is, there are some problems for which, in some inputs at least, the model must be able to spend an arbitrary amount of computation.
A simple example is answering an open math question, like, “What is the smallest odd perfect number?” We don’t know, first, if there are infinite perfect numbers; second, how far between they are; and third, if there is an odd one. However, we can write a simple program that tries every number in order and stops when it finds an odd perfect number. The program will either never stop or find the smallest one.
Now, by design, GPT-4, or any pure language model, cannot “think forever.” The amount of computation spent in answering any prompt is proportional to the input length in a very determinate way, which depends only on the model size. Thus, by a simple counterargument, LLMs cannot reason, at least to the extent of answering semi-decidable questions.
Now, you can claim, “GPT-4 can reply that this question is unknown; and that is true.” And yes, it can. But you don’t need reasoning to give me that reply. You only need to repeat the Wikipedia entry on perfect numbers. We expect more from fully-fledged reasoning systems.
There is a way out, however. GPT-4 could generate the necessary code to answer this question and run it. And that is precisely what the newest iteration of ChatGPT does with Code Interpreter. Tying a language model with a code interpreter gives you, potentially, a system capable of Turing-complete reasoning —and trivially so because the code interpreter is already Turing-complete. However, it remains to be seen if the language model can generate the correct programs.
This is what gets more excited about LLMs in the near term. Wiring them with formal systems can lead to qualitatively superior functionality. The cloud on the horizon is that last caveat: we don’t know if language models trained on existing code will be capable of generalizing to generate new, unseen code for novel problems. And we will never know for sure because proving that any given program has a non-trivial semantic property is generally undecidable.
However, humans are also bounded by this formal limitation —unless you believe in magic, which I don’t— and we get along with programming pretty well. Thus, we don’t need magical LLMs that can solve undecidable problems. We need LLMs that, in practice, can solve the same problems we can, just faster and slightly more reliably. That would already be a massive transformation of the entire field of Computer Science.
The road to general-purpose AI
LLMs —and most, if not all, practical machine learning models— are computationally bounded. That means they perform a fixed amount of computation for any given input that can be quantified solely on the size of the model and the input, irrespective of the content of the input.
This immediately restricts the class of problems these models can solve because, intuitively, two prompts of similar size can encode problems of widely different complexity. Formally, we can say that a computational system, to be Turing-complete, must have inputs in which you cannot predict beforehand the amount of computation necessary. It can even be an infinite amount. Thus, any computational system that performs a fixed amount of computation for all inputs is bound to be less than Turing-complete. There are questions it simply cannot answer correctly.
However, this previous description only applies to the “pure” language modeling paradigm in which the input and output are natural language text. We can attempt to bypass it by allowing the language model to output computer code and, crucially, execute that code before answering.
This makes the whole system Turing-complete, and trivially so because the language model could generate the correct code to answer any semi-decidable question and execute it. You get Turing completeness because one of the components of the system is already Turing-complete. So it’s not even an emergent ability in the sense in which we defined it in the last issue.
But… that an AI system is, in principle, Turing-complete only tells us a part of the story. There’s no guarantee that by hooking a language model with a code interpreter, you can suddenly solve all solvable problems. For that, we need the model actually to output the correct code for a given prompt, and this is where things turn interesting.
Towards program synthesis…
This is the problem of program synthesis. On the surface, program synthesis looks like just one specific domain of generative AI. Instead of generating natural language text, or beautiful images of cats, the model is asked to generate code for a given problem. However, this problem is AI-complete.
The reason is simple. If a computer can generate a correct program for any given question, it can definitely answer any given question. Thus, you can reduce all of AI to program synthesis. Image generation? Generate a program that tells Photoshop what to do. Self-driving cars? Generate a program that tells the car how to move. Passing the Turing test? Generate a program that is a perfect chatbot.
I’m stretching this to an extreme to show that, since all a computer can do is run programs, if an AI system can always generate the correct program for any given question, it can do anything that any other computational system can do. This is, by definition, AGI.
Put another way. Suppose there exists some task T that a computer program P can do. Now take our program synthesis AI and ask it to solve that task T. Doesn’t know how to do it? Never mind, it can just generate and run the exact code of program P.
In fact, program synthesis is exactly what programming is about. We take a problem description and write code that solves it. We don’t need to know how to play chess best, the correct trajectory for a rocket to land on Mars, or anything else. We just need to be sufficiently smart to be able to code a program to solve that problem.
…and beyond
Does this mean ChatGPT with Code Interpreter is a step closer to AGI? Well, kinda, but we’re still far, far away. What current LLMs can do is a very restricted form of program synthesis. If you’ve used Copilot or any other AI code generator, you surely understand this. For all the impressive it is, it still makes enough mistakes at anything but the most trivial prompts that it requires constant supervision by someone who can actually code.
We should expect this to improve, though. The question is whether the current LLM paradigm (self-supervised pretraining+ instruction fine-tuning+ RLHF) can get us all the way to full program synthesis from natural language.
As usual, we have reasons to suspect this problem is not entirely solvable. A formal argument against perfect program synthesis is Rice’s theorem. In short, determining whether a given program satisfies some arbitrary semantic properties (such as never going into an infinite loop or never de-referencing a null pointer) is generally undecidable.
However, unless you believe human brains are qualitatively superior to Turing machines, we are also limited by this argument, and we consider ourselves general-purpose programmers —at least I do!
Thus, general-purpose program synthesis need not be perfect. It just needs to be good enough, at the level of the best human programmers. Can we get there?
I believe we can, but we need some paradigm shifts. A quick comparison between how humans learn to and actually do program synthesis will show the vast chasm between humans and LLMs concerning coding skills.
First, humans don’t have a training mode or an inference mode. We are never-ending learners, always improving our own understanding of every action we take. Second, humans don’t write working code at once. We write it, test it, debug it, improve it, refactor it, and start over. Furthermore, we don’t interact with code sequentially, one token after the other. We insert, delete, and move things around as necessary. And finally, we don’t just take natural language prompts as input. We formulate follow-up questions, develop counter-examples, and generally maintain a natural language conversation parallel to the code evolution.
None of this is, I believe, outside the reach of machine learning. But we need a new paradigm. An agent that can write and tweak a codebase arbitrarily, interspersed with a natural language conversation about it, and frequently executing that code and observing its behavior, all that while being able to update itself —not necessarily through gradient descent— to learn continuously from the whole process.
General purpose program synthesis from natural language is the hardest problem of Artificial Intelligence, if only because it subsumes all the other problems. And we have made significant steps already but are still at the earliest stages in this race.
The road to AGI necessarily goes through program synthesis. But we can’t stop there. Useful and safe AGI must be able to reason and act while respecting human values if it will help us reach our greatest potential as species. However, making sure a sufficiently smart AI is safe —and not just pretending to be safe— may very well be the hardest engineering problem we’ve ever faced!