On Tuesday, researchers from Stanford University and University of California, Berkeley published a research paper that claims to show changes in GPT-4 output over time. The article fuels a common but unproven belief that the AI language model has gotten worse at coding and compiling in recent months. Some experts aren’t convinced by the results, but they say the lack of certainty points to a bigger problem with how OpenAI handles model releases.
In a study titled “How does ChatGPT behavior change over time?” published on arXiv, Lingjiao Chen, Matei Zaharia, and James Zou cast doubt on the consistent performance of OpenAI’s large language models (LLMs), particularly GPT-3.5 and GPT-4. Using API access, they tested the March and June 2023 versions of these models on tasks such as solving math problems, answering sensitive questions, generating codes and visual reasoning. Most notably, GPT-4’s ability to identify prime numbers reportedly dropped dramatically from 97.6 percent accuracy in March to just 2.4 percent in June. Oddly enough, GPT-3.5 showed improved performance over the same period.
This research comes on the heels of people who often complain that GPT-4’s performance has subjectively declined in recent months. Popular theories of why include OpenAI “distilling” models to reduce their computational overhead in a quest to speed up output and save GPU resources, fine-tuning (additional training) to reduce harmful outputs that could have unintended effects, and a few unsupported conspiracy theories like OpenAI that reduce GPT-4’s coding capabilities so more people will pay for GitHub Copilot.
Meanwhile, OpenAI has consistently denied all claims that GPT-4’s capacity has diminished. As recently as last Thursday, OpenAI VP of Product Peter Welinder tweeted, “No, we haven’t made GPT-4 dumber. On the contrary, we’re making each new version smarter than the last. Current hypothesis: As you use it more intensively, you start noticing problems you didn’t see before.”
While this new study seems like a smoking gun to prove the GPT-4 critics’ hunches, others say not so quickly. Princeton computer science professor Arvind Narayanan thinks his findings are not conclusive evidence of a decline in GPT-4 performance and may be consistent with OpenAI fine-tuning adjustments. For example, in terms of measuring code generation capabilities, he criticized the study for evaluating the immediacy of the code’s ability to execute rather than its correctness.
“The change they report is that the newer GPT-4 adds text with no code to the output. They don’t judge the correctness of the code (strange),” he said. tweeted. “They just check that the code is directly executable. So the newer model’s attempt to be more helpful added up to that.”