> Agents like this always and only are " playing along." That's what they do.

Exactly right. ChatGPT does the exact same all the time. It even states it will do just that right from the start in many situations. It's standard routine. Anthropic is just playing the marketing game. Can't blame them, that's the game.

didiplaywell

Anthropic Researchers Run Into Trouble When New Model Realizes It's Being Tested

south_korea_ln

>  "merely ‘played along,'” 

Agents like this always and only are " playing along." That's what they do. 

> This behavior — refusing on the basis of suspecting that something is a test or trick — is likely to be rare in deployment,”

It's not "refusing," it's predicting that refusal is the most likely thing the assistant described in its system prompt would to do based on parameters of the model that runs it.