Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs \ stacker news

pull down to refresh

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs arxiv.org/abs/2502.17424

227 sats \ 6 comments \ @carter 14 Jul 2025 AI

1 sat \ 5 replies \ @optimism 14 Jul 2025

How do you read this? Is misalignment a precursor to everything else being bad quality too?

98 sats \ 4 replies \ @carter OP 14 Jul 2025

Hmm you just made me think about the Principle of Explosion tldr; it says if you accept one false fact its kinda a poison pill for your logical system and it will let you prove literally anything. This behavior seems similar. Once we teach it to be underhanded in one respect all the controls that protected us before can also be bypassed. This may be because "truthfulness" is a direction in the latent space and you can't separate the contexts

1 sat \ 3 replies \ @optimism 14 Jul 2025

Ah! So then by introducing inconsistency into the model through refinement, the end result becomes systemically misaligned. Which is kind of like operant conditioning in behavioral psychology?

Scary analogy to Jason-Bourne-style conditioning, actually.

99 sats \ 2 replies \ @carter OP 14 Jul 2025

I have seen other things saying that spot training was like "labodimizing" the model and that you sacrifice competency in a specialized task for a loss in general performance. So it may be that those tasks representations where somehow correlated with each other so when you mess with one you hurt the other. You could optimize it to keep everything the same but then you need to train more and specify all the constraints so its not practical

101 sats \ 1 reply \ @optimism 14 Jul 2025

You could optimize it to keep everything the same but then you need to train more and specify all the constraints so its not practical

Though, sticking with the human brain analogy, isn't that how we learn? Could do it much faster and in parallel...

102 sats \ 0 replies \ @carter OP 14 Jul 2025

I've said "we are the poor bastards who are forced to live through the learning process" before :)