8  Large Language Models Cannot Reason

Large language models often seem to be able to reason effectively. They can generate coherent and contextually relevant responses that resemble human reasoning. However, this apparent capability can be misleading.

LLMs frequently make mistakes when faced with complex problems requiring extensive reasoning chains. Their responses may seem logical initially, but they often lack the depth and accuracy for sound reasoning. This is particularly evident in tasks that involve multiple steps or intricate logical deductions, where the model may falter and produce incorrect conclusions.

This article explores the fundamental limitations of large language models (LLMs) in reasoning — highlighting the difference between their advanced outputs and their evident inability to perform logical deductions. By examining the stochastic nature of these models, their computational constraints, and their lack of complete computational capabilities, we will uncover the reasons behind their failures in complex reasoning tasks.

Additionally, we will discuss current strategies to enhance LLMs’ reasoning capabilities, including chain of thought prompting and self-critique mechanisms, while critically assessing their effectiveness and underlying challenges. This article aims to foster a deeper understanding of what LLMs can and cannot achieve, urging caution in interpreting their seemingly intelligent responses.

What is reasoning (in AI)?

When we AI folks claim LLMs cannot reason, we are not talking about any abstract, philosophical sense of the word “reason”, nor any of the many psychological and sociological nuances it may entail. No, we have a very specific, quantifiable, simplified notion of reasoning that comes straight out of math.

Reasoning is, simply put, the capacity to draw logically sound conclusions from a given premise. In math, there are two main reasoning types or modes: deduction and induction. Induction is somewhat problematic because it involves generalizing claims from specific instances, and thus, it requires some pretty strong assumptions. In contrast, deduction is very straightforward. It is about applying a finite set of logical inference rules to obtain new provably true claims from existing true claims. It is the type of reasoning that mathematicians do all day long when proving new theorems.

Thus, when I say LLMs cannot reason, I’m simply saying there are—sometimes pretty simple—deduction problems they inherently cannot solve. It is not a value judgement, or an opinion based on experience. It is a straightforward claim provable from the definition of reasoning—understood as deductive reasoning—and the inherent limitations of LLMs given their architecture and functionality.

If this is clear, let’s move on to the counterarguments to this claim.

Why LLMs can’t reason

One significant limitation of language models regarding reasoning is their stochastic nature. These models generate outputs based on probabilistic predictions rather than deterministic logical rules. This means that even a well-structured prompt can yield different responses on different occasions due to the randomness in their decision-making process.

Consequently, an LLM might arrive at a wrong conclusion purely by chance, leading to inconsistencies in reasoning. For example, when asked to solve a mathematical problem or make a logical inference, the model’s response may vary significantly depending on the random seed used during generation, undermining trust in its reasoning capabilities.

Granted, you may set the temperature to zero effectively forcing the model to fix the output for a given input. But this output is still probabilistic, you’re just sampling the most likely continuation. The fact that the mapping between input and output hinges on a probabilistic distribution that encodes correlations between elements in the input and corresponding elements in the output is already suspicious. It would be very weird, although not impossible, that we just happened to converge on the right probability distribution that produces the correct output for every input, in terms of logical deduction rules.

However, this limitation is still not definitive. But it gets worse.

By design, large language models spend a fixed amount of computation per token processed. This means the amount of computation an LLM does before it produces the first output token is a function of just two numbers: the input size and the model size. So, if you ask an LLM to produce a yes or no question for a logical puzzle, all the “thinking” the model can do is some fixed—albeit huge—number of matrix multiplications that only depend on the input size. See where I’m going here?

Now, consider that you have two different logical puzzles with the same input size, i.e., the same number of tokens. But one is an easy puzzle that can be solved with a short chain of deduction steps, while the other requires a much higher number of steps. Here is the kicker: any LLM will spend exactly the same amount of computation in both problems. This can’t be right, can it?

A basic result in computational complexity theory is that some problems with very small inputs seem to require an exponentially high computational cost to be solved correctly. These are NP-complete problems, and most computer scientists believe there are no efficient algorithms to solve them. Crucially, a huge number of reasoning problems fall in this category, including the most basic logical puzzle of all—determining if a given logical formula can be satisfied.

When faced with an instance of an NP-complete problem, an LLM will produce an answer after a fixed amount of computation defined solely by the input size. Now, by sheer size, some larger models might just spend enough computation to cover many smaller instances of NP-complete problems. As it happens, a huge constant function can be larger than an exponential function for smaller inputs. But crucially, we can always find instances of NP-complete problems that require, even in principle, a sufficiently large amount of computation to surpass the computational capacity of any LLM, no matter how big.

But this means something even more profound. Ultimately, LLMs are not Turing-complete systems but essentially very large finite automata. While they can handle a wide range of tasks and produce outputs that appear sophisticated, their underlying architecture limits the types of problems they can solve.

Turing completeness is the ability of a computational system to perform any computation given sufficient time and resources. Modern computers and many seemingly simple systems, such as cellular automata, are Turing complete systems. But LLMs are not, ironically.

The reason is simple. We know from computability theory that any Turing complete system must be able to loop indefinitely. There are some problems—some reasoning tasks—where the only possible solution is to compute, and compute, and compute until some condition holds, and the amount of computation required cannot be known in advance. You need potentially unbounded computation to be Turing complete.

And this is the final nail in the coffin. LLMs, by definition, are computationally bounded. No matter their size, there will always be problem instances—which we may not be able to identify beforehand—that require more computation than is available in the huge chain of matrix multiplications inside the LLM.

Thus, when LLMs seem to tackle complex reasoning problems, they often solve specific instances of those problems rather than demonstrating general problem-solving capabilities. This might just be enough for practical purposes—we may never need to tackle the larger instances—but, in principle, LLMs are incapable of truly open-ended computation, which means they are incapable of true reasoning. Case closed.

Counterarguments

Argument 1: Humans Also Have these Limitations

The most common criticism I received against the assertion that LLMs cannot reason is that, sure, LLMs cannot reason, but neither can humans, right? I mean, humans can be stupendously irrational. But this argument is flawed on many levels, so let’s unpack it.

First, while it is true that humans can make errors in reasoning, the human brain definitely possesses the capacity for open-ended reasoning, as evidenced by the more than 2000 years of solid math we have collectively built. Moreover, all college students—at least in quant fields—at some point have to solve structured problem-solving exercises that require them to apply logical reasoning to arrive at correct conclusions, such as proving theorems. So, while humans can be pretty stupid at times, we are certainly capable of the most rigorous reasoning when trained to do so.

But even more importantly, this assertion is a red herring. Why the fact humans can’t do something immediately makes it ok for a piece of technology to suck at it? Imagine we did this with all our other tech. Sure, that airplane fell down and killed 300 people, but humans can’t fly, so there’s that. Or yes, that submarine imploded, but humans can’t breathe underwater. Or that nuclear power plant melted, but humans can’t stand 3000 degrees of heat, so what’s the big deal?

No, we don’t do that. We compare any new piece of technology with our current best solution, and only if the new thing improves upon the old—at least on some metrics—do we consider it worthwhile.

Granted, we often compare AI capabilities to human capabilities, but this is only because humans are the gold standard for the types of problems we often want AI systems to solve. So we compare LLM’s capacity to generate creative stories with our best writers, and we compare LLMs’ capacity for open-ended dialogue or for emphatic customer assistance with humans because there is nothing out there better than humans at these tasks.

However, there are well-established systems—such as traditional SAT solvers—that excel in structured logical deduction and reasoning tasks. These systems are designed with rigorous validation mechanisms that ensure correctness and reliability in their outputs. They are basically flawless and incredibly fast. So, instead of comparing LLMs to humans in deductive reasoning, let’s compare them with the best solution we currently have for this problem. And there, LLMs definitely suck.

Argument 2: Randomness is a Feature, Not a Bug

The second most common criticism I received was regarding the stochastic nature of language models. To recap, I claim that since LLMs generate tokens in a probabilistic fashion—which is a fundamental feature of the paradigm—, their output is inherently unreliable when you require absolute accuracy instead of versatility.

A lot of people correctly argued that, in fact, randomness is essential in problem-solving and a crucial feature of many of the same SAT solvers against I pretend to compare LLMs. How hypocritical of me, they claim, to posit randomness as a limitation when the most effective deductive reasoning algorithms we have are essentially random. And this is true, but only partially, and it makes all the difference. So let me explain.

Randomness plays a vital role in many computational problem-solving techniques, particularly in search algorithms for hard (read NP-complete or NP-hard) problems. Modern SAT solvers, for example, often employ randomized search strategies to efficiently explore vast solution spaces. By introducing randomness into the search process, these solvers can escape local optima and discover satisfactory solutions more quickly than deterministic methods might allow. This ability to leverage randomness is a powerful tool in the arsenal of computational techniques, enabling systems to tackle complex problems that would otherwise be intractable.

However—and here comes the crucial difference—using randomness in the search process does not imply that the entire reasoning process is inherently unreliable. Randomness is confined to the search phase of problem-solving, where it helps identify potential solutions—potential reasoning paths. However, once a candidate solution is found, a deterministic validation phase kicks in that rigorously checks the correctness of the proposed reasoning path.

The distinction between the search and validation phases is paramount in understanding how randomness contributes to effective problem-solving in general. During the search phase, algorithms may employ random sampling or other stochastic methods to explore possibilities and generate potential solutions. This phase allows for flexibility and adaptability, enabling systems to navigate complex landscapes of potential answers.

However, once a potential solution has been identified, it must undergo a validation process that is grounded in deterministic logic. This validation phase involves applying established rules and principles to confirm that the proposed solution meets all necessary criteria for correctness. As a result, any solution that passes this validation step can be confidently accepted as valid, regardless of how it was generated in the first place.

You can have millions of monkeys typing in a typewriter, and at some point, one of them will randomly produce Romeo and Juliet, but only Shakespeare can filter the garbage from the gold and decide which pamphlet to publish.

That silly metaphor means that randomness is good for exploring hypotheses but not for deciding which one to accept. For that, you need a deterministic, provably correct method that doesn’t rely on probabilities—at least if you want to solve the problem exactly.

However, in stark contrast to traditional problem-solving systems like SAT solvers, LLMs lack a robust validation mechanism. While they can generate coherent and contextually relevant responses based on probabilistic reasoning, some of which may be correct reasoning chains, they do not possess a reliable method for verifying the accuracy of those outputs. The verification process is also stochastic and subject to hallucinations, rendering it utterly unreliable.

So, since LLMs evaluate their own outputs using the same probabilistic reasoning they employ for generating them in the first place, there is an unavoidable risk that incorrect conclusions will be propagated as valid responses. The monkeys are the also the editors.

Argument 3: LLMs Can Be Turing-Complete

The final argument I want to address is the notion that LLMs can be made Turing-complete by duct-taping them with some Turing-complete gadget. Here’s a brief recap of what this means.

LLMs have a fixed computational budget—a fixed number of matrix multiplications they perform per input token. This means there are problems that are inherently outside the realm of what they can solve. These problems fall into two categories.

First, NP-Complete problems—such as the very straightforward problem of determining whether a logical formula is valid—are a class of decision problems for which no known polynomial-time solutions exist. Moreover, most experts believe no such algorithm can exist. Thus, these problems probably require an exponential amount of computation for sufficiently large instances. Thus, given the fixed computational budget of LLMs, no matter how big your stochastic parrot, there will always be a logical formula that is simply to large for it to solve.

On the other hand, we have semi-decidable problems, those for which an algorithm can confirm a solution if one exists but may run indefinitely if no solution is found. For these problems, we simply have no option but to keep searching for a potentially unbounded amount of time. And since LLMs are computationally bounded, there are solvable problem instances that simply would require more computing steps than the LLM can produce.

Now, all of the above is clear to anyone who even superficially understands how LLMs work. However, a common argument posited by critics is that LLMs can be rendered Turing complete by integrating them with external tools, such as code generators or general-purpose inference engines, or even easier, let’s wrap it in a recursive procedure that can simply call the LLM as many times as necessary.

And this is true. You can trivially make an LLM Turing-complete, in principle, by duct-taping it with something that is already Turing-complete. You can also build a flame thrower with a bamboo stick, some duct tape, and a fully working flame thrower.

However, simply making LLMs Turing complete in principle does not guarantee that they will produce correct or reliable outputs. The integration of external tools introduces complexity and potential points of failure, particularly if the LLM does not effectively manage interactions with these tools.

The problem is, when you combine stochastic output—prone to hallucinations—with external tools that require precise inputs, you get LLMs that, in principle, have access to all the resources they may need but are incapable of using them reliably.

When relying on external systems for reasoning tasks—for example, having your LLM call a SAT solver when necessary—it is crucial that LLMs can consistently identify the appropriate tool to use and provide it with the correct arguments. However, due to their probabilistic nature and susceptibility to hallucinations, LLMs struggle to do so reliably. And even if they successfully invoke an external tool, there is no guarantee that they will interpret or apply the tool’s output correctly in their reasoning process.

So, Turing-incompleteness or bounded computation may not be a knockout argument on its own, but when combined with the other inherent limitations of LLMs—crucially, their unreliability—it is clear there are no guarantees even the most advanced models won’t fail to solve some reasoning task.

And here is the final kicker: approximate reasoning is not good enough. If the LLM fails one out of every million times to produce the right deduction, that still means the LLM cannot reason. For all practical purposes, you may be happy with a model that gets it right 9 out of 10 or 99 out of 100, but in mission-critical tasks, nothing short of sound, reliable reasoning is good enough.

And that’s the claim: LLMs are incapable, by design, of sound reasoning.

Improving LLM reasoning skills

However, we need not throw the hat here. Researchers and practitioners have explored several innovative strategies, including Chain of Thought prompting, self-critique mechanisms, and integrating external tools to improve the reasoning skills of large language models.

CoT prompting encourages LLMs to articulate their thought processes, allowing them to break complex problems into manageable steps and improve their accuracy in reasoning tasks. On the other hand, self-critique aims to refine outputs through an internal evaluation process, yet it has shown mixed effectiveness in meaningfully correcting errors. Additionally, incorporating external tools such as reasoning engines and code generation systems can significantly augment the LLMs’ capabilities by providing structured logic and formal verification.

However, each approach has its own set of challenges, and their potential and limitations in fostering true reasoning abilities within LLMs need to be carefully examined.

Chain of Thought

Chain-of-thought prompting has emerged as a promising technique for enhancing the reasoning capabilities of large language models. By guiding models to articulate intermediate reasoning steps before arriving at a final answer, CoT prompting helps decompose complex problems into manageable parts. This method has improved performance across various reasoning tasks, such as arithmetic and commonsense reasoning.

CoT prompting instructs LLMs to break down complex problems into simpler, sequential steps and then tackle each step independently. This structured approach enables the model to tackle each component individually, improving response accuracy and precision. Studies have shown that this technique can significantly boost performance on reasoning tasks, particularly when the model has a sufficient number of parameters (around 100 billion) to use the benefits of CoT prompting effectively.

By encouraging models to articulate their thought processes, CoT prompting taps into the extensive pool of knowledge that LLMs acquire during training. This mechanism helps models apply relevant information more effectively, addressing their inherent difficulties with logical reasoning and problem-solving.

Additionally, CoT makes the LLM “think harder” in the sense it forces the model to produce what we can consider “internal thought” tokens. Thus, we may view it as a way to produce additional computation on the input before deciding on the response.

However, despite its advantages, CoT prompting remains insufficient for several reasons.

The effectiveness of CoT prompting is highly contingent on the quality and diversity of the prompts used. If the examples provided are not representative or sufficiently varied, the model may struggle to generate coherent reasoning chains, leading to suboptimal performance. This reliance on effective prompt engineering can limit the technique’s scalability and generalizability.

And again, the stochastic nature of LLMs means that even with CoT prompting, outputs can vary significantly across different runs due to randomness in generation processes. This variability can lead to inconsistent reasoning outcomes, undermining the reliability of the model’s responses.

Ultimately, CoT extends the computation budget by a finite amount. Unless we try some cyclic scheme where the LLM is prompted to continue thinking, potentially indefinitely, until satisfied, their fundamental limitation on Turing incompleteness remains.

Self-critique

Another intuitive approach to improving reasoning is self-critique, which involves evaluating and refining an LLM’s responses with the same model, using prompts that instruct the model to read its previous output, highlight potential errors, and try to correct them. A form of after-the-fact chain-of-thought, if you might. However, recent research has highlighted significant limitations in the effectiveness of this self-critique capability.

While LLMs can generate multiple ideas and attempt to critique their initial outputs, studies indicate that they cannot often meaningfully self-correct. The assumption that verification of correctness should be easier than generation—a fundamental idea in computational complexity theory—does not seem to hold true, in general, for LLMs. This is particularly problematic in reasoning tasks where the model struggles to adequately assess its outputs’ accuracy. For example, if an LLM generates a flawed answer, its attempt to critique and revise it can lead to further errors rather than improvements.

Research shows that self-correction techniques in LLMs are heavily contingent on the availability of external feedback. In many cases, LLMs perform better when they have access to an external verifier or additional context rather than relying solely on their internal reasoning capabilities. For example, when solving complex problems, such as graph coloring or planning tasks, LLMs often fail to produce reliable solutions without external guidance.

Interestingly, attempts at self-critique can sometimes degrade performance rather than enhance it. Studies have shown that when LLMs engage in self-critique without external validation, they may generate false positives or incorrect conclusions. If you push harder, you can easily fall into a cycle of self-reinforcement of invalid or erroneous arguments, making the LLM increasingly more certain despite it getting worse and worse.

External tools

Integrating external tools, such as reasoning engines or code generation systems, into large language models represents a promising—and, for me, the only really viable—approach to enhancing their reasoning capabilities.

Connecting LLMs to external reasoning engines or logical inference tools makes it possible to augment their reasoning capabilities significantly. These tools can handle complex logical deductions, mathematical computations, or even domain-specific knowledge that the LLM might not possess inherently. This integration allows for more accurate and reliable outputs, as the external tools can apply formal logic and structured reasoning that LLMs typically struggle with.

Similarly, external code generation systems enable LLMs to produce executable code for specific tasks. This capability can streamline software development processes and improve efficiency in generating functional code snippets. The external systems can provide rigorous checks and balances that help ensure the correctness of the generated code.

By leveraging these external resources, LLMs can potentially overcome some of their inherent limitations in logical reasoning and problem-solving. For starters, an external inference engine will be Turing-complete, so we scratch that problem down, right?

Not so fast. Unfortunately, this approach has many challenges, particularly regarding the LLM’s ability to generate the correct input for function calls or code execution. It all circles back to the original sin of LLMs: stochastic output.

First, the effectiveness of function calling or code generation hinges on the model’s ability to accurately interpret a task and generate appropriate inputs. If the model misinterprets the requirements or generates vague or incorrect prompts, the external tool may produce erroneous outputs or fail to execute altogether. This reliance introduces a potential failure point where the model’s limitations in understanding context and intent become apparent.

Many reasoning tasks require a nuanced understanding of logic and context that may exceed the capabilities of language models. For instance, when generating inputs for a logical inference engine, the model must understand the problem and articulate it in a way that aligns with the system’s requirements. If the model fails to capture these nuances, it may lead to incorrect deductions or ineffective reasoning processes.

Translating text into code or structured queries makes it more complex and can undermine reasoning capabilities. This conversion requires programming syntax and logic knowledge that may not be intuitive for an LLM trained primarily in natural language data. Mistakes in this translation can spread to the external system, causing more errors.

While external tools can, in principle, improve the reasoning capabilities of an LLM by providing structured logic and formal verification, they cannot compensate for LLMs’ basic limitations in generating precise inputs. Therefore, there is no formal guarantee that the outputs from this integration will be logically sound or appropriate for the context, simply because of the age-old adage: garbage in, garbage out.

Conclusions

The purpose of this article is to convince you of two claims:

  1. Large Language Models currently lack the capability to perform a well-defined form of reasoning that is essential for many decision-making processes.
  2. We currently have absolutely no idea how to solve this in the near future.

This matters because there is a growing trend to promote LLMs as general-purpose reasoning engines. As more users begin to rely on LLMs for important decisions, the implications of their limitations become increasingly significant. At some point, someone will trust an LLM with a life-and-death decision, with catastrophic consequences.

More importantly, the primary challenges in making LLMs trustworthy for reasoning are immense. Despite ongoing research and experimentation, we have yet to discover solutions that effectively bridge the gap between LLM capabilities and the rigorous standards required for reliable reasoning. Currently, our best efforts in this area are nothing but duct tape—temporary fixes that do not address the underlying limitations of the stochastic language modeling paradigm.

Now, I want to stress that these limitations do not diminish the many other applications where LLMs excel as stochastic language generators. In creative writing, question answering, user assistance, translation, summarization, automatic documentation, and even coding, many of the limitations we have discussed here are actually features.

The thing is, this is what language models were designed for—to generate plausible, human-like, varied, not-necessarily-super-accurate language. The whole paradigm of stochastic language modeling is optimized for this task, and it excels at it. It is much better than anything else we’ve ever designed. But when we ask LLMs to step outside that range of tasks, they become brittle, unreliable, and, worse, opaquely so.

The emergence of models like OpenAI’s o1, which boasts impressive reasoning abilities, may seem like a significant step forward. However, this approach does not represent a fundamentally new paradigm in logical reasoning with LLMs. Deep down, this is “just” a way to explicitly incorporate chain of thought prompting in a fine-tuning phase and teach the model via reinforcement learning to select mostly coherent paths of deduction.

Thus, while definitely an impressive technical and engineering feat, o1 (terrible name) —and any future models based on the same paradigm— will continue to share the same core limitations inherent to all LLMs, only mitigated using some clever tricks. While they may excel in certain contexts, caution must be exercised in interpreting their outputs as definitive reasoning.

If LLMs are to fulfill even some of our highly unrealistic expectations for them, we must prioritize solving the challenge of provably correct reasoning. Until then, all we have is a stochastic parrot—a fun toy with some interesting use cases but not a truly transformative technology.