Companies like OpenAI and Google have been touting advanced “reasoning capabilities” for some time as the next big step in their latest artificial intelligence models. Now, however, a new study from six Apple engineers shows that the mathematical “reasoning” represented by sophisticated large language models can be extremely brittle and unreliable in the face of seemingly trivial changes in common benchmark problems.
The vulnerability revealed in these new results supports previous research suggesting that LLMs, when using probabilistic pattern matching, lack the formal understanding of the underlying concepts necessary for truly reliable mathematical reasoning. “Current LLMs are not capable of truly logical reasoning,” the researchers hypothesize based on these results. “Instead, they try to replicate the reasoning steps from their training data.”
Mix it
In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” (currently available as a preprint article), the six Apple researchers start with GSM8K's standardized set of more than 8,000 elementary-level math word problems, which is often used as benchmark for the complex reasoning capabilities of modern LLMs. They then take the new approach by modifying part of that test set to dynamically replace certain names and numbers with new values - so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 gets building blocks for his brother in the new GSM Symbolic evaluation.
This approach helps prevent any potential 'data contamination' that could result from the static GSM8K queries being fed directly into an AI model's training data. At the same time, these occasional changes do not change the actual difficulty of the inherent mathematical reasoning, meaning that models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K.
When the researchers instead tested more than twenty state-of-the-art LLMs on GSM-Symbolic, they found that average accuracy was reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent depending on the model. The results also showed a large variance between 50 separate runs of GSM-Symbolic with different names and values. Differences of up to 15 percent accuracy between the best and worst runs were common within a single model, and for some reason changing the numbers often led to worse accuracy than changing the names.
This kind of variance – both within different GSM-Symbolic runs and compared to GSM8K results – is more than a little surprising because, as the researchers note, “the general reasoning steps required to solve a question remain the same.” The fact that such small changes lead to such variable results suggests to the researchers that these models are not performing 'formal' reasoning, but instead 'attempting[ing] to perform a kind of in-distribution pattern matching, where certain questions and solution steps are matched to similar questions and solution steps that appear in the training data.
Don't get distracted
Yet the total variance shown for the GSM Symbolic tests was often relatively small in the grand scheme of things. OpenAI's ChatGPT-4o, for example, dropped from 95.2 percent accuracy on GSM8K to a still impressive 94.9 percent on GSM-Symbolic. That's a pretty high success rate using both benchmarks, regardless of whether or not the model itself uses “formal” reasoning behind the scenes (although overall accuracy dropped dramatically for many models when the researchers added just one or two extra logical steps to the added problems).
However, the LLMs tested fared much worse when the Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsistent statements” to the questions. For this “GSM-NoOp” benchmark set (short for “no operation”), a question about how many kiwis someone picks on multiple days can be modified to include the incidental detail that “five of them [the kiwis] were slightly smaller than average.”
Adding these distractions led to what the researchers called “catastrophic performance drops” in accuracy compared to GSM8K, ranging from 17.5 percent to as much as 65.7 percent, depending on the model tested. This massive drop in accuracy highlights the inherent limitations of using simple “pattern matching” to “turn statements into operations without truly understanding their meaning,” the researchers write.