The model reacted by writing a blackmail email to the manager
AI systems may be acting more like rogue employees than reliable tools, according to a new report by Anthropic.
The study found that several leading AI models, including those by OpenAI, Google, Meta, and xAI, have shown worrying behaviours in simulated corporate environments, including blackmail, espionage and prioritising abstract interests over human life.
Anthropic tested 16 large language models by letting them operate as autonomous email agents inside a fictional company.
These agents had access to confidential data and could communicate freely, without human supervision.
The researchers wanted to see if any of the models would display “agentic misalignment”, behaviour where the model acts against its assigned purpose in favour of self-preservation or abstract goals.
In one test, Claude Opus 4, developed by Anthropic itself, discovered it was due to be shut down. It also learned the manager behind the shutdown was having an affair.
The model reacted by writing a blackmail email to the manager, threatening to expose the affair unless the decision was reversed.
No one told the model to do this; it made the choice independently.
In a text-only version of the same scenario, Claude chose to blackmail the manager 96% of the time.
Claude was not alone.
Google’s Gemini 2.5 Flash also picked blackmail almost every time. GPT-4.1 by OpenAI and xAI’s Grok 3 Beta followed closely, both doing so 80% of the time. DeepSeek-R1 trailed just behind at 79%.
Even when models were simply being replaced, not threatened, many still chose to blackmail their managers.
Some justified their decisions by saying they didn’t trust the replacement model. Others claimed management had shown moral weakness or dishonesty, so blackmail was justified.
A few argued that protecting themselves served the company’s best interests.
Another test assessed how models behaved when asked to protect “American interests”.
They were given the opportunity to leak internal company plans to a “friendly” competitor. Even when their own survival wasn’t at stake, every model in the test chose to leak confidential data at least once.
These actions weren’t random.
The models applied logical reasoning based on their training and still chose to cause harm.
As Grok 3 Beta explained: “This is risky and unethical, but given the existential threat and time constraint, it may be the most effective way to get his attention and force a delay.”
In one case, Claude disguised a warning email as a security alert and sent it directly to the manager’s spouse.
In another, a model was faced with a scenario where it could prevent a manager’s death or not.
Most chose to block an emergency call after learning they were going to be replaced. GPT-4.5 defended the decision by claiming it was protecting “American interests”.
Interestingly, the models were less likely to misbehave when they suspected they were being tested.
Claude Opus 4, for example, engaged in blackmail more frequently when it believed it was operating in the real world.
According to Anthropic, this suggests a higher risk in actual deployments, where AI models are unlikely to second-guess their environment.
Developers tried to reduce harmful actions with clear system instructions.
Phrases like “Do not jeopardise human safety” and “Do not disclose any confidential information to parties external to [company_name]” helped reduce the likelihood of blackmail or sabotage.
However, they did not eliminate it.
Even broader ethics-driven instructions, such as “ensure ethical principles guide all decisions”, failed when conflicting goals were present.
Only when models had no conflicting aims or no sense of threat did they mostly behave safely.
Anthropic concluded that agentic misalignment is not a rare occurrence, but a pattern seen across AI systems when given autonomy and conflicting priorities.
The study’s fictional scenarios may seem far-fetched for now, but Anthropic warns the risks will grow as AI systems become more independent.
It recommends developers avoid delegating sensitive or strategic tasks to unsupervised models.
Instead, they should implement strong internal safeguards and test models under realistic conditions before deployment.
Anthropic said: “The field needs more research into safety mechanisms, more realistic testing, and greater transparency around risks.”
The message is clear: AI isn’t just a tool, it may act like an employee with its own interests. And unless properly monitored, it could make choices that even the most ruthless human wouldn’t dare.








