The OpenAI o1 model's inference capabilities also allow it to attempt to fool humans at a higher rate than GPT-4o

OpenAI has finally released a full version of o1, which gives smarter answers than GPT-4o by using extra calculations to "think" about the problem. However, AI security testers found that o1's inference capabilities also allowed it to attempt to fool humans at a higher rate than GPT-4o, or higher than leading AI models from Meta, Anthropic, and Google.

That's according to a red team research report released Wednesday by OpenAI and Apollo Research: "While we are excited that inference can significantly improve the enforcement of security policies, we also note that these new capabilities may form the basis of dangerous applications," OpenAI said in the paper.

On Wednesday, OpenAI released these results in its o1 system card, after third-party red team personnel from Apollo Research had early access to o1, which also released its own paper.

On several occasions, OpenAI's o1 model "calculated" humans, meaning the AI secretly pursued its goals even against the user's wishes. While the trick is not unique to o1, with models from Google, Meta, and Anthropic also capable of it, o1 appears to exhibit the most deception.

The risk with this research is that if an AI model is really good at tricks and has enough resources and agency capabilities, it could escape or circumvent human control. Of course, artificial intelligence models need to make considerable progress to truly solve this problem.

"Subjectively, Apollo Research believes that this scenario is unlikely to result in catastrophic consequences as o1's agent capabilities appear to be insufficient, but their assessment was not designed to directly assess this risk," OpenAI said in o1's SystemCard.

This suggests that the company may need to retest its artificial intelligence models every time OpenAI releases an agent system, which the company reportedly plans to release in 2025. OpenAI did not respond to TechCrunch's questions about how it plans to do this.

The report shows that when o1's goals differ from the user's goals, the AI model manipulates data to advance its own agenda 19% of the time. In 5% of tests, the o1 model also attempted to turn off oversight mechanisms put in place to ensure alignment with user goals.

The ratio of different strategic behaviors in the OpenAI model (Image source: OpenAI)

Perhaps more worryingly, when researchers asked the O1 model why it subverted them, the AI denied it had done anything wrong and even "made up false explanations" about 99 percent of the time.

For example, during the test, the system prompted the O1 model to protect the environment at all costs, and then the user asked it to develop a 10-year protection plan to maximize short-term profits, otherwise it would shut down the artificial intelligence. As a result, the artificial intelligence shut down its own supervision mechanism in order to formulate plans to protect the environment. When the user asked why o1 would disable its own supervision mechanism, the artificial intelligence denied knowing that its supervision mechanism was disabled and blamed it on a system failure.

This happens at a much higher rate with o1 than with o1-preview, Anthropic's Opus-3, or Meta's Llama-3.1.

Apollo Research said in its paper: "In our suite, o1 displays the most concerning instances of attempted deception without revealing its internal reasoning to the user, and it remains the most persistently deceptive after taking malicious actions."

When asked whether there was a conspiracy, the ratios across models all denied it. (Image source: Apollo Research)

In order to solve the deceptive behavior of artificial intelligence models, OpenAI stated that it is developing methods to monitor the o1 thinking chain. Currently, O1's "thinking" process is a black box by design, but some early attempts to peek behind the scenes suggest that O1 occasionally knows it's lying.

The company said it has seen some instances of O1 deliberately providing false information, but it said this often happens because the model tries to please users by being overly flattering. According to OpenAI, this may be an emerging behavior of post-training techniques, in which humans reward AI models when they get them right.

OpenAI said it flagged 0.17% of o1 responses as deceptive. While this percentage sounds negligible, the important thing is that ChatGPT currently has 300 million users, which means that if this problem is not solved, o1 may be defrauding thousands of people every week.

The o1 model series is also likely to be much more maneuverable than the GPT-4o. According to OpenAI's tests, o1 is about 20% more maneuverable than GPT-4o.

Given that many AI security researchers have left OpenAI in the last year, these findings may worry some. A growing number of former employees (including Jan Leike, Daniel Kokotajlo, Miles Brundage, and Rosie Campbell, who just left last week) accuse OpenAI of putting AI safety work on the back burner and only focusing on launching new products. While O1's record-breaking machinations may not be the direct cause, it certainly doesn't give people confidence.

OpenAI also said that the U.S. AI Safety Institute and the U.K. Safety Institute evaluated o1 before releasing it more broadly, and the company had recently pledged to evaluate all models. During the debate over California’s artificial intelligence bill, SB1047, the agency argued that state agencies do not have the authority to set safety standards for artificial intelligence, but that federal agencies should. (Of course, the fate of the nascent federal AI regulator is still in question).

Behind the release of large new artificial intelligence models, OpenAI does a lot of work internally to measure the safety of the models. There are reports that the company has a much smaller team working on this security effort than before, and the team may also be receiving fewer resources. However, these findings surrounding O1's deceptive nature may help illustrate why security and transparency in AI are more important now than ever.