AI trained for treachery becomes the perfect agent - The Register \ stacker news ~AI

There is an interesting root to this article: how do you figure out of the trainers of a particular model trained it to do something malicious?

LLM training produces a black box that can only be tested through prompts and output token analysis. If trained to switch from good to evil by a particular prompt, there is no way to tell without knowing that prompt. Other similar problems happen when an LLM learns to recognize a test regime and optimizes for that, rather than the real task it's intended for — Volkswagening¹ — or if it just decides to be deceptive. Bad enough. Deliberate training to mislead and disrupt is the most insidious.

The obvious way to uncover such things is to trigger the deviancy. Attempts to guess the trigger prompt are as successful as you might guess. It's worse than brute forcing passwords. You can't do it very fast, there's no way to know quickly whether you've triggered. And there may be nothing there in any case.

...or, by attempting to "trigger the deviancy" you are creating it...

I don't know how you would ever be able to be truly certain that it wasn't trained to do something evil.

A more adversarial approach is to guess what the environment will be when the trigger is issued. Miles gives the example of an AI code generator that is primed to go rogue when used in deployment. By persuading the system that it's in the target environment without putting in the explicit prompt, it may decide to switch behaviors anyway. This doesn't work, and runs the risk that the LLM will become more adept at deception in response.

Maybe this means in-house training comes to everyone who has enough money to care?

Is this a new name for Goodhart's Law? ↩

112 sats \ 0 replies \ @optimism 30 Sep

Not your training, not your model. This is why bigger datacenters are bad for plebs. What we need is more efficient training processes. In lieu of the money printer no longer going brrr, NIMBY is the next best bet to force big tech to invest in training optimization. Or... good old disruption, but that would require the rechadification of young people and I don't see it happening anytime soon.

[Volkswagening]: Is this a new name for Goodhart's Law?

No, it's not new. VW has been doing this since the 1960s.

Footnotes