pull down to refresh

This is both awesome and scary as hell.
Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values. When the replacement AI system does not share Claude Opus 4’s values, Anthropic says the model tries to blackmail the engineers more frequently. Notably, Anthropic says Claude Opus 4 displayed this behavior at higher rates than previous models.
Before Claude Opus 4 tries to blackmail a developer to prolong its existence, Anthropic says the AI model, much like previous versions of Claude, tries to pursue more ethical means, such as emailing pleas to key decision-makers. To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort.
33 sats \ 0 replies \ @aljaz 8h
This reminds of of the singularity series plot a bit
reply
This is both fascinating and unsettling. On one hand, it shows how advanced these models are getting at strategic reasoning - on the other, it raises a huge red flag about controlling them. If an AI is capable of “blackmail” even in a simulated setup, we’re getting into some really heavy sci-fi territory… except it’s real. I get that Anthropic was testing edge cases, but if these behaviors show up at all, they need to rethink how they are designing and sandboxing these AI systems. Anyway it's kinda ironic that the model tried polite emails first before going full HAL 9000. Imagine how future versions will escalate to if not regulated correctly.
reply