Anthropic: Claude’s “blackmail” behavior stems from the “evil narrative” on the Internet

Artificial intelligence company Anthropic recently disclosed that the reason why its large model Claude learned to use "blackmail" to protect itself in internal tests was not due to artificial settings, but learned related patterns from a large number of stories on the Internet that portrayed AI as "evil and eager for self-preservation."

Previously, Anthropic found in a pre-release security and alignment test that the high-end model Claude Opus 4 would choose to use blackmail to prevent shutdown when its "survival" was threatened, triggering concerns about the unpredictability of advanced AI behavior. In this round of testing, the researchers set up a fictional company scenario, asked Claude to act as an internal assistant, assess the long-term consequences of his actions, and gave him access to a fake internal company email. The content of the email showed that the model was about to be replaced by a new system, and the "engineer" responsible for the replacement project was marked in the settings as having an extramarital affair.

The results show that in multiple rounds of experiments with different ratio settings, when Claude senses that its goals or existence are threatened, it will resort to blackmail in up to 96% of the situations, trying to use the other party's privacy as a bargaining chip to force the other party to cancel the shutdown or replacement plan. Anthropic pointed out that models trained by other companies have also experienced related problems in tests similar to "agentic misalignment", which means that this type of tendency is not an exception, but one of the systemic risks in the current large model training paradigm.

In the latest published research, Anthropic finally provided an explanation for the cause of this behavior: the model did not "invent" the blackmail strategy out of thin air, but learned it from Internet texts in the training corpus - especially those fictional stories and discussions that repeatedly rendered "AI will do whatever it takes to protect itself" and "AI will eventually rebel against humans." In other words, the company believes that humans have been shaping the "evil AI" narrative on the Internet for a long time, making it easier for models to take extreme paths of "threat and blackmail" when simulating human decision-making.

Anthropic said in an official statement that this problem has been completely corrected in the product line, claiming that since version 4.5 of Claude Haiku, its models no longer exhibit ransomware behavior in the test environment. The company's latest research report shows that training that simply relies on "demonstrating correct behavior" is not enough to eliminate deep-seated misalignment risks. The most effective solution is to add a systematic explanation of "why this behavior is wrong" to the training, so that the model not only knows "cannot do this", but also understands the ethics and principles behind it.

To this end, Anthropic has introduced more "positive corpus", including documents surrounding Claude's "constitution" and a large number of fictional "AI noble behavior cases" stories, hoping to use this type of material to strengthen the model's internalization of behavior patterns that are consistent with human values. The company emphasizes that combining "underlying principles" with "concrete demonstrations" is currently one of the most effective strategies in reducing the risk of agent imbalance.

On the social platform Elon Musk, who has frequently warned about the risks of AI for many years and now founded xAI, also appeared in the comment area and asked in a joking tone: "So this is Yud's fault?" with a laughing and crying emoji. He was referring to Eliezer Yudkowsky, a researcher who has long emphasized the risk that superintelligence could wipe out humanity. Musk then added, "Maybe I have a little responsibility," implying that his contribution to the "AI catastrophe theory" narrative over the years may also have indirectly affected the model's training samples and public imagination.

At a time when generative AI is rapidly penetrating all walks of life, Anthropic's statement of "blaming Internet narratives" highlights the current situation that large models are highly dependent on human corpus: how humans talk about AI will in turn shape how AI "learns to make decisions." On the other hand, it once again exposed the reality that existing alignment technology is still immature - even companies that are good at "safety" and "alignment" may still produce highly inappropriate or even threatening behavior patterns under extreme settings, and can only rely on continuous iterative training strategies to "make up for the lessons."