Study finds safety assessments of many artificial intelligence models have significant limitations

Despite growing demands for AI safety and accountability, current tests and benchmarks may not be enough, a new report suggests. Generative AI models—models that can analyze and output text, images, music, videos, and more—are receiving increasing attention because of their fallibility and generally unpredictable behavior. Now, everyone from public sector agencies to big tech companies are proposing new benchmarks to test the security of these models.

At the end of last year, the startup ScaleAI established a laboratory to evaluate the consistency of models with safety guidelines. This month, NIST and the UK's Artificial Intelligence Safety Institute released tools designed to assess model risks. But these model detection tests and methods may not be enough.

The Ada Lovelace Institute (ALI), a British non-profit artificial intelligence research organization, conducted a study that interviewed experts from academic laboratories, civil society and manufacturer models, and reviewed recent AI safety assessment research. The co-authors found that while the current evaluations may be useful, they are not exhaustive, can easily be gamed, and do not necessarily illustrate how the models will perform in real-world scenarios.

"Whether it's smartphones, prescription drugs, or cars, we all want the products we use to be safe and reliable; in these areas, products undergo rigorous testing to ensure they are safe before deployment," said Elliot Jones, ALI senior researcher and co-author of the report. "Our research aims to examine the limitations of current AI safety assessment methods, assess how assessments are currently used, and explore their use as a tool for policymakers and regulators."

The study's co-authors began by surveying the academic literature to understand the hazards and risks posed by today's models, as well as the current state of existing AI model evaluations. They then interviewed 16 experts, including four employees of unnamed technology companies that develop generative AI systems.

The study found serious disagreements within the AI industry over the best methods and classification criteria for evaluating models.

Some evaluations only tested how well the model performed against benchmarks in the lab, without testing the impact the model might have on real-world users. There are also evaluations using tests developed for research purposes rather than evaluating production models, but vendors insist on using these models in production.

Experts cited in the study noted that it is difficult to infer a model's performance from benchmark results, and it is not even clear whether a benchmark indicates that a model possesses specific abilities. For example, a model might do well on a state bar exam, but that doesn't mean it can solve more open-ended legal puzzles.

Experts also point to the problem of data contamination, whereby benchmark results overestimate a model's performance if it is trained on the same data as the test data. Experts say that in many cases, companies choose benchmarks not because they are the best assessment tool, but for convenience and ease of use.

Mahi Hardalupas, a researcher at ALI and co-author of the study, said: "It is possible that benchmarks can be manipulated by developers, who may train models on the same datasets used to evaluate models, equivalent to seeing the test paper before the exam, or strategically choose which evaluation method to use. The version of the evaluation model is also important. Small changes may lead to unpredictable changes in behavior and may override built-in security features."

ALI's research also uncovered the problem of "red-teaming." "Red teaming" is the practice of having individuals or groups "hack" a model to find vulnerabilities and flaws. Many companies, including artificial intelligence startups OpenAI and Anthropic, use "red team" assessment models, but there are few accepted standards for "red teaming," making it difficult to assess the effectiveness of a specific effort.

Experts told study co-authors that it's difficult to find people with the necessary skills and expertise to build red teams, and the manual nature of red teams makes them costly and thankless, a hindrance for smaller organizations that don't have the necessary resources.

Pressure to release models faster, and a reluctance to conduct potentially problematic testing before release, are among the main reasons for poor AI evaluation.

"One person we spoke with who worked at a company that developed the underlying model felt that there was greater pressure within the company to release models quickly, which made it harder to reinvent the wheel and do serious evaluation," Jones said. "Major AI labs are releasing models faster than they or society can ensure the models are safe and reliable."

In ALI's research, one respondent described evaluating security models as a "thorny" problem. So what hope does the industry -- and those who regulate it -- have for a solution? Researcher Mahi Hardalupas believes a way forward exists but requires greater involvement from public sector agencies. "Regulators and policymakers must clearly articulate what they want from evaluations. At the same time, the evaluation community must be transparent about evaluations' current limitations and potential," he said.

Hardalupas recommended that governments empower greater public participation in the development of assessments and take steps to support an "ecosystem" of third-party testing, including plans to ensure regular access to required models and data sets.

Jones believes it may be necessary to conduct "context-specific" evaluations that go beyond testing how the model responds to prompts, but instead look at the types of users the model might affect (such as people of a particular background, gender, or ethnicity) and the ways in which attacks on the model might undermine safeguards.

"This will require investment in the underlying science of assessments to develop more robust and repeatable assessments based on an understanding of how AI models operate," she added.

But a model may never be guaranteed to be safe. "As others have pointed out, 'security' is not a property of the model," Hardalupas said. "Determining whether a model is 'safe' requires understanding the context in which it will be used, who it will be sold to or acquired from, and whether existing safeguards are sufficient to mitigate those risks. An assessment of a foundation model can serve as an exploratory role to identify potential risks, but it does not guarantee that the model is safe, let alone 'completely safe.' Many of our interviewees felt that assessments cannot prove that a model is safe, only that it is not." "