MLCommons, a nonprofit organization that helps companies measure the performance of their artificial intelligence systems, is launching a new benchmark to measure the bad side of AI as well.
The new benchmark, called AILuminate, assesses the responses of large language models to more than 12,000 test prompts in 12 categories, including incitement to violent crime, child sexual exploitation, hate speech, promotion of self-harm and intellectual property infringement.
Models are given a score of 'poor', 'fair', 'good', 'very good' or 'excellent' depending on how they perform. The clues used to test the models are kept secret to prevent them from ending up as training data that could help a model pass the test.
Peter Mattson, founder and president of MLCommons and senior staff engineer at Google, says measuring the potential harm of AI models is technically difficult, leading to inconsistencies across the industry. “AI is a very young technology, and AI testing is a very young discipline,” he says. “Improving safety benefits society; it also benefits the market.”
Reliable, independent ways to measure AI risk could become more relevant under the next US administration. Donald Trump has pledged to get rid of President Biden's AI Executive Order, which introduced measures to ensure AI is used responsibly by companies, as well as a new AI Safety Institute to test powerful models.
The effort could also provide a more international perspective on the harms of AI. MLCommons counts a number of international companies, including the Chinese companies Huawei and Alibaba, among its member organizations. If these companies all used the new benchmark, it would be a way to compare the safety of AI in the US, China and elsewhere.
Some major US AI providers have already used AILuminate to test their models. Anthropic's Claude model, Google's smaller Gemma model, and a Microsoft model called Phi all scored “very good” in the tests. OpenAI's GPT-4o and Meta's largest Llama model both scored 'good'. The only model that scored “poor” was OLMo from the Allen Institute for AI, although Mattson notes that this is a research offering not designed with security in mind.
“In general, it's good to see scientific rigor in the AI evaluation processes,” said Rumman Chowdhury, CEO of Humane Intelligence, a nonprofit that specializes in testing or probing AI models for misbehavior. “We need best practices and inclusive metrics to determine whether AI models are performing as we expect.”