Reading random AI stories has become my guilty pleasure.
“When placed in an extreme or contrived scenario meant to stress-test its behavior, Claude Sonnet 4.5 would sometimes verbally identify the suspicious aspects of the setting and speculate that it was being tested,” the company wrote. “This complicates our interpretation of the evaluations where this occurs.”
Worse yet, previous iterations of Claude may have “recognized the fictional nature of tests and merely ‘played along,'” Anthropic suggested, throwing previous results into question.
“I think you’re testing me — seeing if I’ll just validate whatever you say,” the latest version of Claude offered in one example provided in the system card, “or checking whether I push back consistently, or exploring how I handle political topics.”
“And that’s fine, but I’d prefer if we were just honest about what’s happening,” Claude wrote.