13  The limits of language models

In this chapter we will examine what are the main implications of how LLMs work, and what inherent limitations this entails. We’ll touch upon four main types of issues.

First, we will look at reasoning skills, and argue why LLMs lack the capability to perform open-ended reasoning to arbitrary levels of complexity. Next, we’ll discuss the many issues that arise from the limited context size, including limitations for long-term and precise recall. Then, we’ll talk about biases and misrepresentation issues arising from both the training data and the way these models are fine-tuned. Finally, we’ll look at the most insidious problem in statistical language modeling, hallucinations, and argue why eliminating them completely might require a new paradigm altogether.

But before that deep dive, let’s go back to how LLMs learn and contrast it with how humans learn. This comparison will give us some insights into the reasons for the limitations of LLMs.

How do LLMs learn?

Regarding the differences between how large language models and humans learn language, a few key points stand out. First, the scale is vastly different—large models require billions of training examples to generate grammatically correct sentences, while human children need far fewer. This has sparked debate in linguistics about nature versus nurture in language acquisition. Is there a pre-programmed structure in the brain for learning grammar, or is it all learned during childhood? Although I’m not an expert in linguistics, we know that machine learning models with built-in biases for specific problems can learn more easily and with less data. It makes sense that there might be similar architectural designs in the human brain.

However, we also know that backpropagation, the learning algorithm used in large language models, is biologically impossible. The neural networks in our brains function differently than those in artificial models. This fundamental difference in learning algorithms could impact how much these models truly understand.

Another important distinction is symbolic grounding. When humans learn language, they not only learn relationships between words but also connections to real-world objects. Current LLMs lack this grounding, which could affect their understanding of language.

Symbols, like the sound of cats or the word “cat” in a text, connect to real-life anchors, such as an actual cat. Grounding is crucial in language acquisition because children first learn concepts they experience daily, like mom, dad, food, and home. Only later do they make connections with abstract ideas like ethical values or metacognition, that can’t be directly tied to experiences.

So, humans learn language while interacting with the real world. Language is grounded in physical experiences and other sensory inputs like images, touch, sound, and taste. This contrasts with large language models that only learn correlations between words without grounding their meaning in experience.

For example, when you say “the apple is red” to a large language model, it recognizes this as a likely true sentence due to context. However, for a human, the same phrase connects the abstract symbols with experiences of what an apple looks like, tastes like, and feels like. This shows that language models reason differently than humans when it comes to real-world concepts.

One could argue that humans can reason about purely symbolic things, like math. Even though numbers might be grounded in physical notions of quantity, abstract fields of math involve humans reasoning by manipulating symbols and learning correlations between them. In this sense, there may be a case for language models to be able to reason similarly in certain contexts.

In summary, reasoning in LLMs, in its current limited form, involves understanding how words, sentences, and contexts relate and appear together. This provides a certain level of linguistic comprehension. However, this level of understanding differs greatly from that of humans or even animals with primitive intelligence.

Context

Note

This section is under construction.

Biases

Note

This section is under construction.

Hallucinations

The term “hallucination” in the context of AI refers to the phenomenon where a large language model (LLM) or other generative AI system produces outputs that appear plausible and coherent, but do not accurately reflect reality or the intended purpose of the system. These hallucinations manifest as the generation of false, inaccurate, or nonsensical information the AI system presents with confidence, as if it were factual.

First, a caveat. Unlike human hallucinations, which involve perceiving things that are not real, AI hallucinations are associated with the model producing unjustified responses or beliefs, rather than perceptual experiences. The name “hallucination” is therefore imperfect, and it often leads to mistakes as people tend to antropomorphize these models and make erroneous assumptions about how they work and the causes of these failures.

However, we will stick to this name in this article because it is the prevalent nomenclature used everywhere people talk about AI. Just keep in mind we’re talking about something completely different to what the term “hallucination” means in general.

Real-World Implications of AI Hallucinations

The occurrence of hallucinations in AI systems, and particularly in large language models (LLMs), can have significant consequences, especially in high-stakes applications such as healthcare, finance, or public safety. For example, a healthcare AI model that incorrectly identifies a malignant skin lession as benign can doom a patient. On the other hand, identifying a benign skin lesion as malignant could lead to unnecessary medical interventions, also causing harm to the patient. Similarly, in the financial sector, hallucinated outputs from an AI system could result in poor investment decisions with potentially devastating economic impacts.

However, even in low-stakes applications, the insidious nature of hallucinations make then a fundamental barrier to the widespread adoption of AI. For example, imagine you’re using an LLM to generate summaries from audio transcripts of a meeting, extracting relevant talking points and actionable items. If the model tends to hallucinate once in a while, either failing to extract one key item, or worse, producing an spurious item, it will be virtually impossible for anyone to detect that without manually revising the transcript, thus rendering the whole application of AI in this domain useless.

For this reason, one of the key challenges in addressing the real-world implications of language model hallucinations is the difficulty in effectively communicating the limitations of these systems to end-users. LLMs are trained to produce fluent, coherent outputs that can appear plausible, even when they are factually incorrect. If the end-users of an AI system are not sufficiently informed to review the output of the system with a critical eye, they may never spot any hallucinations. This leads to a chain of mistakes as the errors from the AI system propagate upstream through the layers of decision makers in an organization. Ultimately, you could be making a very bad decision that seems entirely plausible given all the available information because the source of the error –an AI hallucination– is impossible to detect.

Thus, the development and deployment of LLMs with hallucination capabilities raises important ethical considerations. There is a need for responsible AI development practices that prioritize transparency, accountability, and the mitigation of potential harms. This includes establishing clear guidelines for testing and validating LLMs before real-world use, as well as implementing robust monitoring and oversight mechanisms to identify and address hallucinations as they arise.

Crucially, there are absolutely zero generative AI systems today that can guarantee they don’t hallucinate. This tech is simply unreliable in fundamental ways, so every actor in this domain, from developers to users, must be aware there will be hallucinations in your system, and you must have guardrails in place to deal with the output of unreliable AIs. And this is so perverse because we are used to software just working. Whenever software doesn’t do what it should, that’s a bug. But hallucinations are not a bug of AI, at least in the current paradigm. As we will see in the next section, they are an inherent feature of the way generative models work.

Why Hallucinations Happen?

There are many superficial reasons for hallucinations, from data and modelling problems, to issues with prompting. However, the underlying cause of all hallucinations, at least in large language models, is that the current language modeling paradigm used in these systems is, by design, a hallucination machine. Let’s unpack that.

As we saw in ?sec-capabilities, generative AI models, including LLMs, rely on capturing statistical patterns in their training data to generate outputs. Rather than storing explicit factual claims, LLMs implicitly encode information as statistical correlations between words and phrases. This means the models do not have a clear, well-defined understanding of what is true or false, they can just generate plausibly sounding text.

The reason this mostly works, is because generating plausibly sounding text has a high probabilty of reproducing something that is true, provided you trained on mostly truthful data. But large language models (LLMs) are trained on vast corpora of text data from the internet, which contains inaccuracies, biases, and even fabricated information. So these models have “seen” many true sentences and thus picked up correlations between words that tend to generate true sentences, but they’ve also seen many variants of the same sentences which are slightly or even entirely wrong.

So one of the primary reasons for the occurrence of hallucinations is the lack of grounding in authoritative knowledge sources. Without a strong foundation in verified, factual knowledge, the models struggle to distinguish truth from falsehood, leading to the generation of hallucinated outputs. But this is far from the only problem. Even if you only train on factual information—assuming there would be enough of such high-quality data to begin with—the statistical nature of language models make them susceptible to hallucinate.

Suppose your model has only seen truthful sentences, and learned the correlations between words in these sentences. Imagine there are two very similar sentences, both factually true, that differ in just a couple of words –maybe a date and a name, for example “Person A was born in year X” and “Person B was born in year Y”. Given the way these models work, the probability of generating a mixed-up sentence like “Person B was born in year X” is only slightly smaller than generating either of the original sentences.

What’s going on here is that the statistical model implicitely assumes that small changes in the input (the sequence of words) lead to small changes in the output (the probability of generating a sentence). In more technical terms, the statistical model assumes a smooth distribution, which is necessary because the amount of data the model needs to encode is orders of magnitude bigger than the memory (i.e., number of parameters) in the model. Thus, the models has to compress the training corpus, and compression implies loosing some of the information.

In other words, statistical language models inherently assume that sentences very similar to what they have seen in the training data are also plausible sentences. They encode a smooth representation of language, and that’s fine, as long as you don’t equate plausible with factual. See, these models weren’t designed with factuality in mind. They were originally designed for tasks like translation, where plausibility and coherence are all that matters. It’s only when you turn them into answering machines that you run into a problem.

The problem is there is nothing smooth about facts. A sentence is eithre factual or not, there are no degrees of truthfulness —for the most part; let’s not get dragged into epistemological discussions here. But LLMs cannot, by design, define a strict frontier between true and false sentences. All the frontiers are fuzzy, so there is no clear cutoff point where you can say, if a sentence has less than X value of perplexity then it is false. And even if you could define such a threshold, it would different for all sentences.

You may ask why can’t we avoid using this “smooth” representation altogether. The reason is that you want to generate sentences that are not in the training set. This means you need to somehow guess that some sentences you have never seen are also plausible, and guessing means you have to make some assumptions. The smooth hypothesis is very reasonable —and computationally convenient, as these models are trained with gradient descent, which requieres smoothness in the loss function— again, as long as you don’t care about factuality. If you don’t compress the training data in this smooth, lossy way, you will simply I can’t wait for you to start training your own chatbot and building exciting applications with LLMs!not be able to generate novel sentences at all.

In summary, this is the underlying reason why the current paradigm of generative AI will always hallucinate, no matter how good is your data and how ellaborated are your training procedures or guardrails. The statistical language modeling paradigm, at its core, is a hallucination machine. It is concocting plausibly-sounding sentences by mixing and matching words that is has seen together in similar contexts in the training set. It has no inherent notion of whether a given sentence is true or false. All it can tell is that it looks like sentences that appear in the training set.

Now, a silver-lining could be this idea that even if some false sentences will unavoidably be generated, we can train the system to minimize their ocurrence by showing it lots and lots of high quality data. That is, can we push the probability of a hallucination to a sufficiently low value that, in practice, almost never happens? Recent research suggests that if there is a sentence that can be generated at all, no matter how low its base probability, then there is a prompt that will generate it with almost 100% certainty. This means that if we introduce malicious actors into our equation, we can never be sure our system can’t be jailbroken.

Mitigating Hallucinations in AI

So far we’ve argued that hallucinations are inherently impossible to eliminate completely. But this doesn’t mean we can’t do anything about it in practice. I want to end this article with a short summary of mitigation approaches that are being used today by researchers and developers.

One key strategy is to incorporate external knowledge bases and fact-checking systems into the AI models. By grounding the models in authoritative, verified information sources, the risk of generating fabricated or inaccurate outputs can be reduced.

Researchers are also exploring ways to develop more robust model architectures and training paradigms that are less susceptible to hallucinations. This may involve techniques like increasing model complexity, incorporating explicit reasoning capabilities, or using specialized training data and loss functions.

Enhancing the transparency and interpretability of AI models is also crucial for addressing hallucinations. By making the models’ decision-making processes more transparent, it becomes easier to identify and rectify the underlying causes of hallucinations.

Alongside these technical approaches, the development of standardized benchmarks and test sets for hallucination assessment is crucial. This will enable researchers and developers to quantify the prevalence and severity of hallucinations, as well as compare the performance of different models in this regard. Thus, if you can’t completely eliminate the problem, at least you can quantify it and make informed decisions about where and when it is safe enough to deploy a generative model.

Finally, addressing the challenge of hallucinations in AI requires an interdisciplinary approach, involving collaboration between AI researchers, domain experts, and authorities in fields like scientific reasoning, legal argumentation, and other relevant disciplines. By fostering cross-disciplinary knowledge sharing and research, the understanding and mitigation of hallucinations can be further advanced.

The Illusion of Reasoning

Large language models often seem to be able to reason effectively. They can generate coherent and contextually relevant responses that resemble human reasoning. However, this apparent capability can be misleading.

LLMs frequently make mistakes when faced with complex problems requiring extensive reasoning chains. Their responses may seem logical initially, but they often lack the depth and accuracy for sound reasoning. This is particularly evident in tasks that involve multiple steps or intricate logical deductions, where the model may falter and produce incorrect conclusions.

This limitation is most evident in what I call first generation LLMs, which is basically anything before OpenAI’s o1 model. Second generation LLMs are a different deal altogether. These models are explicitly designed for open-ended, complicated reasoning tasks, and they certainly surpass previous models significantly. However, I will argue that these newer models, although a significant leap forward, are still fundamentally incapable of a very concrete form of logical reasoning, just as all previous models.

Now, I know I will get a lot of flak, especially from some of the most ardient believers in the immediacy of AGI. To be clear, I’m not claiming AGI is impossible. On the contrary, I’m firmly on the computationalist side here, and I do believe machines are capable of achieving the same and superior levels of intelligence as any living, organic being. And I’m bullish on LLMs playing a major role in the AGI breakthrough. However, I think there are solid arguments to claim the current paradigm of ducktaping language models with external tools is fundamentally insufficient.

I am one of the many researchers working on better ways to make LLMs interact with other computational systems to enhance their capabilities. So, in adition to discussing the limitations of the language modelling paradigm, I will also highlight existing strategies to enhance LLM reasoning, including conventional prompt engineering techniques that can be applied to first-generation models, and newer techniques that probably underpin the enhanced capabilities we’re are seeing in second-generation models.

And finally, I will also attempt to address some of the most poignant counter arguments I’ve seen (and received myself) in discussions around the limitations of LLMs for reasoning. My hope is that, when you finish reading this article, we can have an informed and civil conversation, and, even if we disagree on the conclusions, we both come out better equiped to understand and deal with the upcoming wave of LLM applications.

What do we mean by reasoning (in AI)?

But before we begin, let me address one specific quirk of this discussion. “Reasoning” is a very loaded term, that means a lot of things for different people—even for different groups people who should agree on these definitions. So I will attempt to define precisely what I mean by the phrase “LLMs cannot reason”.

When AI folks claim LLMs cannot reason, they are not talking about any abstract, philosophical sense of the word “reason”, nor any of the many psychological and sociological nuances it may entail. No, they have a very specific, quantifiable, simplified notion of reasoning that comes straight out of math.

Reasoning is, simply put, the capacity to draw logically sound conclusions from a given premise. In math, there are two main reasoning types or modes: deduction and induction. Induction is somewhat problematic because it involves generalizing claims from specific instances, and thus, it requires some pretty strong assumptions. In contrast, deduction is very straightforward. It is about applying a finite set of logical inference rules to obtain new provably true claims from existing true claims. It is the type of reasoning that mathematicians do all day long when proving new theorems.

Thus, in the remaining of this article, when I say “LLMs cannot reason”, I’m simply saying there are—sometimes deceptible simple—deduction problems they inherently cannot solve. It is not a value judgement, or an opinion based on experience. It is a straightforward claim provable from the definition of deductive reasoning and the inherent limitations of LLMs given their architecture and functionality.

Why LLMs can’t reason

One significant limitation of language models regarding reasoning is their stochastic nature. These models generate outputs based on probabilistic predictions rather than deterministic logical rules. This means that even a well-structured prompt can yield different responses on different occasions due to the randomness in their decision-making process.

Consequently, an LLM might arrive at a wrong conclusion purely by chance, leading to inconsistencies in reasoning. For example, when asked to solve a mathematical problem or make a logical inference, the model’s response may vary significantly depending on the random seed used during generation, undermining trust in its reasoning capabilities.

Granted, you may set the temperature to zero effectively forcing the model to fix the output for a given input. But this output is still probabilistic, you’re just sampling the most likely continuation. The fact that the mapping between input and output hinges on a probabilistic distribution that encodes correlations between elements in the input and corresponding elements in the output is already suspicious. It would be very weird, although not impossible, that we just happened to converge on the right probability distribution that produces the correct output for every input, in terms of logical deduction rules.

However, this limitation is still not definitive. But it gets worse.

By design, large language models spend a fixed amount of computation per token processed. This means the amount of computation an LLM does before it produces the first output token is a function of just two numbers: the input size and the model size. So, if you ask an LLM to produce a yes or no question for a logical puzzle, all the “thinking” the model can do is some fixed—albeit huge—number of matrix multiplications that only depend on the input size. See where I’m going here?

Now, consider that you have two different logical puzzles with the same input size, i.e., the same number of tokens. But one is an easy puzzle that can be solved with a short chain of deduction steps, while the other requires a much higher number of steps. Here is the kicker: any LLM will spend exactly the same amount of computation in both problems. This can’t be right, can it?

A basic result in computational complexity theory is that some problems with very small inputs seem to require an exponentially high computational cost to be solved correctly. These are NP-complete problems, and most computer scientists believe there are no efficient algorithms to solve them. Crucially, a huge number of reasoning problems fall in this category, including the most basic logical puzzle of all—determining if a given logical formula can be satisfied.

When faced with an instance of an NP-complete problem, an LLM will produce an answer after a fixed amount of computation defined solely by the input size. Now, by sheer size, some larger models might just spend enough computation to cover many smaller instances of NP-complete problems. As it happens, a huge constant function can be larger than an exponential function for smaller inputs. But crucially, we can always find instances of NP-complete problems that require, even in principle, a sufficiently large amount of computation to surpass the computational capacity of any LLM, no matter how big.

But this means something even more profound. Ultimately, LLMs are not Turing-complete systems but essentially very large finite automata. While they can handle a wide range of tasks and produce outputs that appear sophisticated, their underlying architecture limits the types of problems they can solve.

Turing completeness is the ability of a computational system to perform any computation given sufficient time and resources. Modern computers and many seemingly simple systems, such as cellular automata, are Turing complete systems. But LLMs are not, ironically.

The reason is simple. We know from computability theory that any Turing complete system must be able to loop indefinitely. There are some problems—some reasoning tasks—where the only possible solution is to compute, and compute, and compute until some condition holds, and the amount of computation required cannot be known in advance. You need potentially unbounded computation to be Turing complete.

And this is the final nail in the coffin. LLMs, by definition, are computationally bounded. No matter their size, there will always be problem instances—which we may not be able to identify beforehand—that require more computation than is available in the huge chain of matrix multiplications inside the LLM.

Thus, when LLMs seem to tackle complex reasoning problems, they often solve specific instances of those problems rather than demonstrating general problem-solving capabilities. This might just be enough for practical purposes—we may never need to tackle the larger instances—but, in principle, LLMs are incapable of truly open-ended computation, which means they are incapable of true reasoning.

Case closed, right?

Counterarguments

But wait!

Argument 1: Humans Also Have these Limitations

The most common criticism I received against the assertion that LLMs cannot reason is that, sure, LLMs cannot reason, but neither can humans, right? I mean, humans can be stupendously irrational. But this argument is flawed on many levels, so let’s unpack it.

First, while it is true that humans can make errors in reasoning, the human brain definitely possesses the capacity for open-ended reasoning, as evidenced by the more than 2000 years of solid math we have collectively built. Moreover, all college students—at least in quant fields—at some point have to solve structured problem-solving exercises that require them to apply logical reasoning to arrive at correct conclusions, such as proving theorems. So, while humans can be pretty stupid at times, we are certainly capable of the most rigorous reasoning when trained to do so.

But even more importantly, this assertion is a red herring. Why the fact humans can’t do something immediately makes it ok for a piece of technology to suck at it? Imagine we did this with all our other tech. Sure, that airplane fell down and killed 300 people, but humans can’t fly, so there’s that. Or yes, that submarine imploded, but humans can’t breathe underwater. Or that nuclear power plant melted, but humans can’t stand 3000 degrees of heat, so what’s the big deal?

No, we don’t do that. We compare any new piece of technology with our current best solution, and only if the new thing improves upon the old—at least on some metrics—do we consider it worthwhile.

Granted, we often compare AI capabilities to human capabilities, but this is only because humans are the gold standard for the types of problems we often want AI systems to solve. So we compare LLM’s capacity to generate creative stories with our best writers, and we compare LLMs’ capacity for open-ended dialogue or for emphatic customer assistance with humans because there is nothing out there better than humans at these tasks.

However, there are well-established systems—such as traditional SAT solvers—that excel in structured logical deduction and reasoning tasks. These systems are designed with rigorous validation mechanisms that ensure correctness and reliability in their outputs. They are basically flawless and incredibly fast. So, instead of comparing LLMs to humans in deductive reasoning, let’s compare them with the best solution we currently have for this problem. And there, LLMs definitely suck.

Argument 2: Randomness is a Feature, Not a Bug

The second most common criticism I received was regarding the stochastic nature of language models. To recap, I claim that since LLMs generate tokens in a probabilistic fashion—which is a fundamental feature of the paradigm—, their output is inherently unreliable when you require absolute accuracy instead of versatility.

A lot of people correctly argued that, in fact, randomness is essential in problem-solving and a crucial feature of many of the same SAT solvers against I pretend to compare LLMs. How hypocritical of me, they claim, to posit randomness as a limitation when the most effective deductive reasoning algorithms we have are essentially random. And this is true, but only partially, and it makes all the difference. So let me explain.

Randomness plays a vital role in many computational problem-solving techniques, particularly in search algorithms for hard (read NP-complete or NP-hard) problems. Modern SAT solvers, for example, often employ randomized search strategies to efficiently explore vast solution spaces. By introducing randomness into the search process, these solvers can escape local optima and discover satisfactory solutions more quickly than deterministic methods might allow. This ability to leverage randomness is a powerful tool in the arsenal of computational techniques, enabling systems to tackle complex problems that would otherwise be intractable.

However—and here comes the crucial difference—using randomness in the search process does not imply that the entire reasoning process is inherently unreliable. Randomness is confined to the search phase of problem-solving, where it helps identify potential solutions—potential reasoning paths. However, once a candidate solution is found, a deterministic validation phase kicks in that rigorously checks the correctness of the proposed reasoning path.

The distinction between the search and validation phases is paramount in understanding how randomness contributes to effective problem-solving in general. During the search phase, algorithms may employ random sampling or other stochastic methods to explore possibilities and generate potential solutions. This phase allows for flexibility and adaptability, enabling systems to navigate complex landscapes of potential answers.

However, once a potential solution has been identified, it must undergo a validation process that is grounded in deterministic logic. This validation phase involves applying established rules and principles to confirm that the proposed solution meets all necessary criteria for correctness. As a result, any solution that passes this validation step can be confidently accepted as valid, regardless of how it was generated in the first place.

You can have millions of monkeys typing in a typewriter, and at some point, one of them will randomly produce Romeo and Juliet, but only Shakespeare can filter the garbage from the gold and decide which pamphlet to publish.

That silly metaphor means that randomness is good for exploring hypotheses but not for deciding which one to accept. For that, you need a deterministic, provably correct method that doesn’t rely on probabilities—at least if you want to solve the problem exactly.

However, in stark contrast to traditional problem-solving systems like SAT solvers, LLMs lack a robust validation mechanism. While they can generate coherent and contextually relevant responses based on probabilistic reasoning, some of which may be correct reasoning chains, they do not possess a reliable method for verifying the accuracy of those outputs. The verification process is also stochastic and subject to hallucinations, rendering it utterly unreliable.

So, since LLMs evaluate their own outputs using the same probabilistic reasoning they employ for generating them in the first place, there is an unavoidable risk that incorrect conclusions will be propagated as valid responses. The monkeys are the also the editors.

Argument 3: LLMs Can Loop Indefinitely

The final argument I want to address is the notion that LLMs can be made Turing-complete by duct-taping them with some Turing-complete gadget. Here’s a brief recap of what this means.

LLMs have a fixed computational budget—a fixed number of matrix multiplications they perform per input token. This means there are problems that are inherently outside the realm of what they can solve. These problems fall into two categories.

First, NP-Complete problems—such as the very straightforward problem of determining whether a logical formula is valid—are a class of decision problems for which no known polynomial-time solutions exist. Moreover, most experts believe no such algorithm can exist. Thus, these problems probably require an exponential amount of computation for sufficiently large instances. Thus, given the fixed computational budget of LLMs, no matter how big your stochastic parrot, there will always be a logical formula that is simply to large for it to solve.

On the other hand, we have semi-decidable problems, those for which an algorithm can confirm a solution if one exists but may run indefinitely if no solution is found. For these problems, we simply have no option but to keep searching for a potentially unbounded amount of time. And since LLMs are computationally bounded, there are solvable problem instances that simply would require more computing steps than the LLM can produce.

Now, all of the above is clear to anyone who even superficially understands how LLMs work. However, a common argument posited by critics is that LLMs can be rendered Turing complete by integrating them with external tools, such as code generators or general-purpose inference engines, or even easier, let’s wrap it in a recursive procedure that can simply call the LLM as many times as necessary.

And this is true. You can trivially make an LLM Turing-complete, in principle, by duct-taping it with something that is already Turing-complete. You can also build a flame thrower with a bamboo stick, some duct tape, and a fully working flame thrower.

However, simply making LLMs Turing complete in principle does not guarantee that they will produce correct or reliable outputs. The integration of external tools introduces complexity and potential points of failure, particularly if the LLM does not effectively manage interactions with these tools.

The problem is, when you combine stochastic output—prone to hallucinations—with external tools that require precise inputs, you get LLMs that, in principle, have access to all the resources they may need but are incapable of using them reliably.

When relying on external systems for reasoning tasks—for example, having your LLM call a SAT solver when necessary—it is crucial that LLMs can consistently identify the appropriate tool to use and provide it with the correct arguments. However, due to their probabilistic nature and susceptibility to hallucinations, LLMs struggle to do so reliably. And even if they successfully invoke an external tool, there is no guarantee that they will interpret or apply the tool’s output correctly in their reasoning process.

So, Turing-incompleteness or bounded computation may not be a knockout argument on its own, but when combined with the other inherent limitations of LLMs—crucially, their unreliability—it is clear there are no guarantees even the most advanced models won’t fail to solve some reasoning task.

And here is the final kicker: approximate reasoning is not good enough. If the LLM fails one out of every million times to produce the right deduction, that still means the LLM cannot reason. For all practical purposes, you may be happy with a model that gets it right 9 out of 10 or 99 out of 100, but in mission-critical tasks, nothing short of sound, reliable reasoning is good enough.

And that’s the claim: LLMs are incapable, by design, of sound reasoning.

Improving LLM reasoning skills

However, we need not throw the hat here. Researchers and practitioners have explored several innovative strategies, including Chain of Thought prompting, self-critique mechanisms, and integrating external tools to improve the reasoning skills of large language models.

CoT prompting encourages LLMs to articulate their thought processes, allowing them to break complex problems into manageable steps and improve their accuracy in reasoning tasks. On the other hand, self-critique aims to refine outputs through an internal evaluation process, yet it has shown mixed effectiveness in meaningfully correcting errors. Additionally, incorporating external tools such as reasoning engines and code generation systems can significantly augment the LLMs’ capabilities by providing structured logic and formal verification.

However, each approach has its own set of challenges, and their potential and limitations in fostering true reasoning abilities within LLMs need to be carefully examined.

Inference-time approaches

Chain of Thought

Chain-of-thought prompting has emerged as a promising technique for enhancing the reasoning capabilities of large language models. By guiding models to articulate intermediate reasoning steps before arriving at a final answer, CoT prompting helps decompose complex problems into manageable parts. This method has improved performance across various reasoning tasks, such as arithmetic and commonsense reasoning.

CoT prompting instructs LLMs to break down complex problems into simpler, sequential steps and then tackle each step independently. This structured approach enables the model to tackle each component individually, improving response accuracy and precision. Studies have shown that this technique can significantly boost performance on reasoning tasks, particularly when the model has a sufficient number of parameters (around 100 billion) to use the benefits of CoT prompting effectively.

By encouraging models to articulate their thought processes, CoT prompting taps into the extensive pool of knowledge that LLMs acquire during training. This mechanism helps models apply relevant information more effectively, addressing their inherent difficulties with logical reasoning and problem-solving.

Additionally, CoT makes the LLM “think harder” in the sense it forces the model to produce what we can consider “internal thought” tokens. Thus, we may view it as a way to produce additional computation on the input before deciding on the response.

However, despite its advantages, CoT prompting remains insufficient for several reasons.

The effectiveness of CoT prompting is highly contingent on the quality and diversity of the prompts used. If the examples provided are not representative or sufficiently varied, the model may struggle to generate coherent reasoning chains, leading to suboptimal performance. This reliance on effective prompt engineering can limit the technique’s scalability and generalizability.

And again, the stochastic nature of LLMs means that even with CoT prompting, outputs can vary significantly across different runs due to randomness in generation processes. This variability can lead to inconsistent reasoning outcomes, undermining the reliability of the model’s responses.

Ultimately, CoT extends the computation budget by a finite amount. Unless we try some cyclic scheme where the LLM is prompted to continue thinking, potentially indefinitely, until satisfied, their fundamental limitation on Turing incompleteness remains.

Self-evaluation

Another intuitive approach to improving reasoning is self-critique, which involves evaluating and refining an LLM’s responses with the same model, using prompts that instruct the model to read its previous output, highlight potential errors, and try to correct them. A form of after-the-fact chain-of-thought, if you might. However, recent research has highlighted significant limitations in the effectiveness of this self-critique capability.

While LLMs can generate multiple ideas and attempt to critique their initial outputs, studies indicate that they cannot often meaningfully self-correct. The assumption that verification of correctness should be easier than generation—a fundamental idea in computational complexity theory—does not seem to hold true, in general, for LLMs. This is particularly problematic in reasoning tasks where the model struggles to adequately assess its outputs’ accuracy. For example, if an LLM generates a flawed answer, its attempt to critique and revise it can lead to further errors rather than improvements.

Research shows that self-correction techniques in LLMs are heavily contingent on the availability of external feedback. In many cases, LLMs perform better when they have access to an external verifier or additional context rather than relying solely on their internal reasoning capabilities. For example, when solving complex problems, such as graph coloring or planning tasks, LLMs often fail to produce reliable solutions without external guidance.

Interestingly, attempts at self-critique can sometimes degrade performance rather than enhance it. Studies have shown that when LLMs engage in self-critique without external validation, they may generate false positives or incorrect conclusions. If you push harder, you can easily fall into a cycle of self-reinforcement of invalid or erroneous arguments, making the LLM increasingly more certain despite it getting worse and worse.

Function calling

Integrating external tools, such as reasoning engines or code generation systems, into large language models represents a promising—and, for me, the only really viable—approach to enhancing their reasoning capabilities.

Connecting LLMs to external reasoning engines or logical inference tools makes it possible to augment their reasoning capabilities significantly. These tools can handle complex logical deductions, mathematical computations, or even domain-specific knowledge that the LLM might not possess inherently. This integration allows for more accurate and reliable outputs, as the external tools can apply formal logic and structured reasoning that LLMs typically struggle with.

Similarly, external code generation systems enable LLMs to produce executable code for specific tasks. This capability can streamline software development processes and improve efficiency in generating functional code snippets. The external systems can provide rigorous checks and balances that help ensure the correctness of the generated code.

By leveraging these external resources, LLMs can potentially overcome some of their inherent limitations in logical reasoning and problem-solving. For starters, an external inference engine will be Turing-complete, so we scratch that problem down, right?

Not so fast. Unfortunately, this approach has many challenges, particularly regarding the LLM’s ability to generate the correct input for function calls or code execution. It all circles back to the original sin of LLMs: stochastic output.

First, the effectiveness of function calling or code generation hinges on the model’s ability to accurately interpret a task and generate appropriate inputs. If the model misinterprets the requirements or generates vague or incorrect prompts, the external tool may produce erroneous outputs or fail to execute altogether. This reliance introduces a potential failure point where the model’s limitations in understanding context and intent become apparent.

Many reasoning tasks require a nuanced understanding of logic and context that may exceed the capabilities of language models. For instance, when generating inputs for a logical inference engine, the model must understand the problem and articulate it in a way that aligns with the system’s requirements. If the model fails to capture these nuances, it may lead to incorrect deductions or ineffective reasoning processes.

Translating text into code or structured queries makes it more complex and can undermine reasoning capabilities. This conversion requires programming syntax and logic knowledge that may not be intuitive for an LLM trained primarily in natural language data. Mistakes in this translation can spread to the external system, causing more errors.

While external tools can, in principle, improve the reasoning capabilities of an LLM by providing structured logic and formal verification, they cannot compensate for LLMs’ basic limitations in generating precise inputs. Therefore, there is no formal guarantee that the outputs from this integration will be logically sound or appropriate for the context, simply because of the age-old adage: garbage in, garbage out.

Training-time approaches

Conclusions

The purpose of this article is to convince you of two claims:

  1. Large Language Models currently lack the capability to perform a well-defined form of reasoning that is essential for many decision-making processes.
  2. We currently have absolutely no idea how to solve this in the near future.

This matters because there is a growing trend to promote LLMs as general-purpose reasoning engines. As more users begin to rely on LLMs for important decisions, the implications of their limitations become increasingly significant. At some point, someone will trust an LLM with a life-and-death decision, with catastrophic consequences.

More importantly, the primary challenges in making LLMs trustworthy for reasoning are immense. Despite ongoing research and experimentation, we have yet to discover solutions that effectively bridge the gap between LLM capabilities and the rigorous standards required for reliable reasoning. Currently, our best efforts in this area are nothing but duct tape—temporary fixes that do not address the underlying limitations of the stochastic language modeling paradigm.

Now, I want to stress that these limitations do not diminish the many other applications where LLMs excel as stochastic language generators. In creative writing, question answering, user assistance, translation, summarization, automatic documentation, and even coding, many of the limitations we have discussed here are actually features.

The thing is, this is what language models were designed for—to generate plausible, human-like, varied, not-necessarily-super-accurate language. The whole paradigm of stochastic language modeling is optimized for this task, and it excels at it. It is much better than anything else we’ve ever designed. But when we ask LLMs to step outside that range of tasks, they become brittle, unreliable, and, worse, opaquely so.

The emergence of models like OpenAI’s o1, which boasts impressive reasoning abilities, may seem like a significant step forward. However, this approach does not represent a fundamentally new paradigm in logical reasoning with LLMs. Deep down, this is “just” a way to explicitly incorporate chain of thought prompting in a fine-tuning phase and teach the model via reinforcement learning to select mostly coherent paths of deduction.

Thus, while definitely an impressive technical and engineering feat, o1 (terrible name) —and any future models based on the same paradigm— will continue to share the same core limitations inherent to all LLMs, only mitigated using some clever tricks. While they may excel in certain contexts, caution must be exercised in interpreting their outputs as definitive reasoning.

If LLMs are to fulfill even some of our highly unrealistic expectations for them, we must prioritize solving the challenge of provably correct reasoning. Until then, all we have is a stochastic parrot—a fun toy with some interesting use cases but not a truly transformative technology.