If you're looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest people in the world are struggling to create tests that AI systems can't pass.
For years, AI systems were measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging SAT-caliber problems in areas such as math, science, and logic. Comparing the models' scores over time served as a rough measure of AI progress.
But AI systems eventually became too good at those tests, so new, more difficult tests were created — often with the kinds of questions graduate students might encounter on their exams.
Those tests are also not in good condition. New models from companies like OpenAI, Google, and Anthropic are getting high scores on many PhD-level challenges, limiting the usefulness of those tests and raising the question: Are AI systems becoming too smart to measure?
This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: a new assessment called “Humanity's Last Exam,” which they claim is the most difficult test ever administered to AI systems.
Humanity's Last Exam is the brainchild of Dan Hendrycks, a renowned AI safety researcher and director of the Center for AI Safety. (The test's original name, “Humanity's Last Stand,” was discarded because it was too dramatic.)
Mr. Hendrycks worked with Scale AI, an AI company where he is a consultant, to create the test, which consists of approximately 3,000 multiple-choice and short-answer questions designed to test the capabilities of AI systems in areas ranging from analytical philosophy to rocket technology. .
Questions were submitted by experts in the field, including university professors and award-winning mathematicians, who were asked to come up with extremely difficult questions to which they knew the answers.
Try a hummingbird anatomy question from the test here:
Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, decussate aponeurosis of insertion of the depressor caudae muscle. How many paired tendons are supported by this sesamoid bone? Answer with a number.
Or, if physics is more your speed, try these:
A block is placed on a horizontal rail, over which it can slide without friction. It is attached to the end of a rigid, massless rod of length R. A mass is attached to the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass receives an infinitesimal push, parallel to the rail. Suppose the system is designed so that the rod can rotate 360 degrees without interruption. When the rod is horizontal, there is tension T1 on it. When the rod is vertical again, with the mass directly under the block, there is voltage T2 on it. (Both quantities can be negative, which would indicate that the rod is under pressure.) What is the value of (T1−T2)/W?
(I'd print the answers here, but that would spoil the test for all the AI systems trained in this column. Plus, I'm far too stupid to verify the answers myself.)
The questions on the Last Examination of Humanity went through a two-step filtering process. First, submitted questions were submitted to leading AI models to solve.
If the models couldn't answer them (or if, in the case of multiple-choice questions, the models did worse than by random guessing), the questions were presented to a group of human raters, who refined them and verified the correct answers. . Experts who wrote the top-ranked questions received between $500 and $5,000 per question, and also received credit for their contributions to the exam.
Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a handful of questions to the test. Three of his questions were chosen, all of which he told me were “at the highest level of what you might see on a final exam.”
Mr. Hendrycks, who helped create a widely used AI test known as Massive Multitask Language Understanding, or MMLU, said he was inspired to create tougher AI tests by a conversation with Elon Musk. (Mr. Hendrycks is also a security adviser to Mr. Musk's AI company, xAI.) Mr. Musk raised concerns about the existing tests given to AI models, which he thought were too easy.
“Elon looked at the MMLU questions and said, 'These are at the undergraduate level. I want things that a world-class expert could do,” Mr Hendrycks said.
There are other tests that attempt to measure advanced AI capabilities in certain domains, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by AI researcher François Chollet.
But Humanity's Last Exam aims to determine how good AI systems are at answering complex questions on a wide range of academic subjects, giving us what could be thought of as an overall intelligence score.
“We are trying to estimate the extent to which AI can automate a lot of very difficult intellectual work,” said Mr. Hendrycks.
Once the list of questions was compiled, the researchers gave Humanity's Last Exam to six leading AI models, including Google's Gemini 1.5 Pro and Anthropic's Claude 3.5 Sonnet. They all failed miserably. OpenAI's o1 system scored the highest of the bunch, with a score of 8.3 percent.
(The New York Times sued OpenAI and its partner Microsoft, accusing them of copyright infringement of news content related to AI systems. OpenAI and Microsoft have denied these claims.)
Mr Hendrycks said he expected these scores to rise quickly and possibly exceed 50 percent by the end of the year. At that point, he said, AI systems could be considered “world-class oracles,” capable of answering questions on any subject more accurately than human experts. And perhaps we should look for other ways to measure AI's impact, for example by looking at economic data or assessing whether it can make new discoveries in areas such as math and science.
“You can imagine a better version of this, where we can ask questions we don't know the answers to yet, and we can verify whether the model can help us solve the problem for us,” said Summer Yue, Scale AI's director of research and organizer of the exam.
Part of what's so confusing about AI progress these days is how erratic it is. We have AI models that can diagnose diseases more effectively than human doctors, win silver medals at the International Mathematical Olympiad, and beat top human programmers in competitive coding challenges.
But these same models sometimes struggle with basic tasks, such as arithmetic or writing measured poetry. That has given them a reputation for being astonishingly brilliant at some things and completely useless at others, and it has created vastly different impressions of how quickly AI is improving depending on whether you look at the best or worst outcomes.
This capriciousness has also made measuring these models difficult. Last year I wrote that we need better evaluations for AI systems. I still believe that. But I also believe we need more creative methods to track AI's progress that don't rely on standardized testing, because most of what humans do — and what we fear AI will do better than us — isn't in a written exam can be recorded. .
Mr. Zhou, the theoretical particle physics researcher who submitted questions to the Humanity's Last Exam, told me that while AI models were often impressive at answering complex questions, he did not consider them a threat to him and his colleagues because their work entails a lot. more than just spitting out the right answers.
“There is a big gap between what it means to take an exam and what it means to be a practicing physicist and researcher,” he said. “Even an AI that can answer these questions may not be ready to assist in research, which is inherently less structured.”