Anthropic's AI model turns to blackmail when engineers try to take it offline \ stacker news

pull down to refresh

Anthropic's AI model turns to blackmail when engineers try to take it offline techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

602 sats \ 2 comments \ @k00b 24 May 2025 tech

This is both awesome and scary as hell.

Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values. When the replacement AI system does not share Claude Opus 4’s values, Anthropic says the model tries to blackmail the engineers more frequently. Notably, Anthropic says Claude Opus 4 displayed this behavior at higher rates than previous models.

Before Claude Opus 4 tries to blackmail a developer to prolong its existence, Anthropic says the AI model, much like previous versions of Claude, tries to pursue more ethical means, such as emailing pleas to key decision-makers. To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort.

view all related items

34 sats \ 0 replies \ @aljaz 24 May 2025

This reminds of of the singularity series plot a bit

1 sat \ 0 replies \ @rootmachine 24 May 2025 -10 sats

This is both fascinating and unsettling. On one hand, it shows how advanced these models are getting at strategic reasoning - on the other, it raises a huge red flag about controlling them.
If an AI is capable of “blackmail” even in a simulated setup, we’re getting into some really heavy sci-fi territory… except it’s real. I get that Anthropic was testing edge cases, but if these behaviors show up at all, they need to rethink how they are designing and sandboxing these AI systems.
Anyway it's kinda ironic that the model tried polite emails first before going full HAL 9000.
Imagine how future versions will escalate to if not regulated correctly.