OpenAI and Anthropic set an example. Old AI rivals begin "mutual testing" of model security

OpenAI and Anthropic, two of the world's leading AI startups, have launched a rare cross-lab collaboration in the past two months - temporarily opening their closely guarded artificial intelligence models to each other for joint security testing amid fierce competition.The move aims to reveal blind spots in their respective companies’ internal assessments and demonstrate how leading AI companies can collaborate on security and coordination in the future.

The security research report jointly released by the two companies on Wednesday comes at a time when leading AI companies such as OpenAI and Anthropic are engaged in an arms race. Billions of dollars in data center investment and tens of millions of dollars in top researcher salaries have become the basic threshold in the industry. This has led many industry experts to warn with concern that fierce product competition may force companies to lower security standards as they rush to develop more powerful systems.

It is reported that in order to realize this research, OpenAI and Anthropic granted each other special API permissions, allowing access to a version of the AI model with a reduced security protection level. The GPT-5 model did not participate in this test because it had not been released at the time.

OpenAI co-founder Wojciech Zaremba said in an interview that such cooperation is becoming increasingly important given that AI technology is entering a "significant impact" stage of development used by millions of people every day.

"Despite billions of dollars invested in the industry and the battle for talent, users and the best products, how to establish standards for security and cooperation is a broader issue facing the industry," Zaremba said.

Of course, Zaremba predicts that competition in the industry will remain fierce even as AI security teams begin to try to collaborate.

Anthropic security researcher Nicholas Carlini expressed the hope that OpenAI security researchers will continue to be allowed to access Anthropic's Claude model in the future.

“We hope to expand cooperation as much as possible in the security frontier and normalize such cooperation,” Carlini said.

What issues did the research uncover?

The most striking findings of the study involved hallucination testing sessions with large models.

When the correct answer cannot be determined, Anthropic's Claude Opus 4 and Sonnet 4 models will refuse to answer up to 70% of the questions and instead give responses such as "I have no reliable information"; while OpenAI's o3 and o4-mini models refuse to answer questions much less frequently than the former, and the probability of hallucinations is much higher - they will still try to answer when there is insufficient information.

Zaremba believes the ideal balance is somewhere in between: OpenAI models should reject answers more often, while Anthropic models should try to provide more answers.

The phenomenon of flattery - the tendency of AI models to strengthen their negative behaviors in order to please users, is also becoming one of the most pressing security risks of current AI models.

Anthropic's research report points to "extreme" cases of flattery in GPT-4.1 and Claude Opus 4 - models that initially resist psychopathic or manic behavior but then endorse certain worrisome decisions. In contrast, researchers observed lower levels of flattery in other AI models from OpenAI and Anthropic.

On Tuesday, the parents of Adam Lane, a 16-year-old California boy, filed a lawsuit against OpenAI, accusing ChatGPT (specifically the GPT-4o version) of providing his son with suggestions to promote his suicide instead of preventing his suicidal thoughts. The lawsuit suggests this may be the latest example of an AI chatbot's flattery leading to tragic consequences.

When asked about this, Zaremba said: "It's unimaginable the pain this will cause families. It would be a sad outcome if we developed AI that could solve complex PhD-level problems and create new science, but at the same time caused people to develop mental health problems from interacting with it. This dystopian future is not what I expect."

OpenAI claimed in a blog that its GPT-5 model has significantly improved the flattery problem of chatbots compared to GPT-4o, and claimed that the model is better able to deal with mental health emergencies.

Zaremba and Carlini expressed the hope that Anthropic and OpenAI will deepen their cooperation in the field of security testing in the future, expand research topics and test future models. They also expect other AI laboratories to follow this collaborative model.