A recent study by Apple researchers raises important questions about the true reasoning capabilities of Large Language Models (LLMs).
The research, detailed in a paper titled GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, suggests that LLMs may not be as intelligent as they appear.
Research Reveals OpenAI, Google, and Meta LLMs are Flawed
The study focuses on the widely used benchmark for measuring LLM reasoning capabilities known as “GSM8K,” otherwise known as Grade School Math 8K. This benchmark is a dataset of over 8,000 high-quality, diverse school math problems.
However, the researchers concluded that the widespread use of this dataset could have led to a risk of data contamination, concluding that the large models might simply be recalling answers from data they were trained on.
Invariably, instead of true logical reasoning, the models rely more on sophisticated pattern matching, which limits their ability to solve complex problems effectively.
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the… pic.twitter.com/yli5q3fKIT
— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024
To test this theory, the researchers developed a new benchmark called GSM-Symbolic, which altered variables such as names, numbers, and complexity while adding irrelevant information to standard reasoning problems.
When tested on more than 20 LLMs, including OpenAI o1 and GPT-4o, Google’s Gemma 2, and Meta’s Llama 3, the results showed a major drop in accuracy across all models when these variables were adjusted.
Once irrelevant details were introduced, the models struggled to maintain high performance. Even the OpenAI models, which generally performed better than open-source alternatives, showed a noticeable decline in accuracy, confirming that LLMs are more fragile than previously thought.
For example, when presented with a math problem involving kiwis, the models consistently failed to recognize that certain details were irrelevant. The problem stated that someone picked five smaller kiwis, which had no bearing on the actual math.
8/ This begs the question: Do these models truly understand mathematical concepts? Introducing #GSM_NoOp! We add a single clause that seems relevant but doesn't contribute to the overall reasoning (hence "no-op"). Check out what happens next! pic.twitter.com/P3I4kyR56L
— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024
Yet, many LLMs subtracted the kiwis from the total, revealing that they focused on surface-level patterns rather than understanding the underlying logic.
OpenAI’s o1 Preview had the smallest drop in accuracy, losing 17.5%, but other models, like Microsoft’s Phi 3, saw a performance decrease of up to 65%.
While the study’s findings shed light on these limitations, it is also important to consider the competitive space.
Apple, the company behind the research, is a direct competitor to Google, Meta, and OpenAI, all of whom have huge investments in AI development.
Even though Apple and OpenAI collaborate in certain areas, Apple is also working on its own AI models, which raises questions about the motivations behind the study’s conclusions.