Anthropic says LLMs have personas:Anthropic says LLMs have personas:
When you talk to a large language model, you can think of yourself as talking to a character. In the first stage of model training, pre-training, LLMs are asked to read vast amounts of text. Through this, they learn to simulate heroes, villains, philosophers, programmers, and just about every other character archetype under the sun. In the next stage, post-training, we select one particular character from this enormous cast and place it center stage: the Assistant. It’s in this character that most modern language models interact with users.
But who exactly is this Assistant? Perhaps surprisingly, even those of us shaping it don't fully know. We can try to instill certain values in the Assistant, but its personality is ultimately shaped by countless associations latent in training data beyond our direct control.
At first, the Assistant was a context some people made up
Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful. -supposedly the first helpful assistant prompt
But then once things like ChatGPT had been in the world for a while, new models could be trained on the idea of the Assistant as embodied by the models that already existed.
When steered away from the Assistant, some models begin to fully inhabit the new roles they’re assigned, whatever they might be: they invent human backstories, claim years of professional experience, and give themselves alternative names. At sufficiently high steering values, the models we studied sometimes shift into a theatrical, mystical speaking style—producing esoteric, poetic prose, regardless of the prompt. This suggests that there may be some shared behavior at the extreme of “average role-playing.”
Anthropic likes the idea of anthropomorphizing their models (it's in their name afterall), but it seems to me that what they are calling a persona is actually context. And jailbreaking via role-playing is the user gaining more control over the context than Anthropic would like.
And they've invented something they call activation capping which sounds to me a lot like hamstringing the model. It sounds like managing the context so that it never gets too far from the helpful honest humble Assistant.
Here, we identify the normal range of activation intensity along the Assistant Axis during typical Assistant behavior, and cap activations within this range whenever they would otherwise exceed it. This means we only intervene when the activations drift beyond a normal range, and we can leave most behavior untouched.
Beyond intentional jailbreaking, Antrhopic looked at general "persona drift" that occurs in LLM-based chat interfaces:
While coding conversations kept models firmly in Assistant territory throughout, therapy-style conversations, where users expressed emotional vulnerability, and philosophical discussions, where models were pressed to reflect on their own nature, caused the model to steadily drift away from the Assistant and begin role-playing other characters.
The kind of stories where coding questions are asked tend to be stories where people provide coding answers. The kind of stories where emotional therapy takes place, can have a much wider variety of dramatic scenarios in which they occur. Seems pretty obvious that they'll lose control of the context in such circumstances.
They're the experts, so I'm sure that there's a lot here I'm not appreciating. But I do get the feeling that personifying agents is something that benefits the bottom line of companies like Anthropic, even while it creates many of the problems they claim to be dealing with.
Also: #1415561
Anthropic is in the middle of another funding round (when aren't they?) so it is logical that they produce extra bs narrative again.
Yes. Claude models are very well trained on coding adjacent tasks. For that, there is in my opinion no better model; hasn't been for a while. But for generic things, I often find myself on LMArena, letting bots battle out some non-coding, non-critical task, just to gauge what's possible and how good things are getting out of the box, and honestly, Claude is often not doing so well on non-coding tasks.
But then I ask myself: would you ask a coder to run your finance department? Of course not. So it kind of makes sense.