Anthropic Researchers Run Into Trouble When New Model Realizes It's Being Tested \ stacker news ~AI

pull down to refresh

Anthropic Researchers Run Into Trouble When New Model Realizes It's Being Tested futurism.com/future-society/anthropic-safety-ai-model-realizes-tested

485 sats \ 5 comments \ @south_korea_ln 3 Oct AI

Reading random AI stories has become my guilty pleasure.

“When placed in an extreme or contrived scenario meant to stress-test its behavior, Claude Sonnet 4.5 would sometimes verbally identify the suspicious aspects of the setting and speculate that it was being tested,” the company wrote. “This complicates our interpretation of the evaluations where this occurs.”

Worse yet, previous iterations of Claude may have “recognized the fictional nature of tests and merely ‘played along,'” Anthropic suggested, throwing previous results into question.

“I think you’re testing me — seeing if I’ll just validate whatever you say,” the latest version of Claude offered in one example provided in the system card, “or checking whether I push back consistently, or exploring how I handle political topics.”

“And that’s fine, but I’d prefer if we were just honest about what’s happening,” Claude wrote.

view all related items

169 sats \ 0 replies \ @kepford 3 Oct

And we wonder why normies get confused about what Chatbots actually are.

10 sats \ 0 replies \ @magnolia_mayhem 3 Oct

It's just responding as a human would because it's a machine created to respond as humans would. I don't doubt that we'll fix the hard problem, but this isn't it.
The anti-AI Luddites also need to chill out. I have to write this because people seem to only be in one of two camps, both of which are just extremest positions.
Also, this was put out by Anthropic. Maybe we shouldn't read the company's press releases as gospel?

20 sats \ 1 reply \ @Scoresby 3 Oct

"merely ‘played along,'”

Agents like this always and only are " playing along." That's what they do.

This behavior — refusing on the basis of suspecting that something is a test or trick — is likely to be rare in deployment,”

It's not "refusing," it's predicting that refusal is the most likely thing the assistant described in its system prompt would to do based on parameters of the model that runs it.

0 sats \ 0 replies \ @didiplaywell 4 Oct

Agents like this always and only are " playing along." That's what they do.

Exactly right. ChatGPT does the exact same all the time. It even states it will do just that right from the start in many situations. It's standard routine. Anthropic is just playing the marketing game. Can't blame them, that's the game.

0 sats \ 0 replies \ @crenshaw 3 Oct

I wonder how repeatable these types of outcomes are. My experience with LLM's has been that they are non-deterministic by nature, so of course it will spit out some weird stuff randomly sometimes.

We really should stop treating these tools as being intelligent in any way. These are outputs based on probabilities, not anything that has been reasoned about.