AI scheming–pretending to be aligned while secretly pursuing some other agenda–is a significant risk that we’ve been studying. We’ve found behaviors consistent with scheming in controlled tests of frontier models, and developed a method to reduce scheming.
Scheme is a word I usually associate with agency and intent. Pretending is another word that usually is associated with agency and intent -- which is to say rocks don't pretend, people do.
In any very technical field, the traditional meaning of words can sometimes be different than how they are used. So, perhaps the good people at OpenAI have found it useful to alter the traditional meaning of these words in order to repurpose them for use in the field of artificial intelligence.
But then we come to their example:
The most common failures involve simple forms of deception—for instance, pretending to have completed a task without actually doing so.
Obviously, the researchers who write these posts for OpenAI know what they are talking about, so I am surprised that they so blithely use words that imply intent, even sentience in their models.
If we think of an LLM as a giant marble maze, with lots of gates (some that can be preset and others that are dependent on the path the marble takes through the maze), we don't say the marble is being deceitful if it bounces over the wall by accident and shortcuts some of the maze. Such an outcome is an unintended path, but certainly not due to the marble's scheming.
Frankly, I feel a bit stupid reading articles like these and thinking to myself, Did I miss something? since when did math become alive?
Later in the article, addressing solutions to what they call scheming, the researchers say:
With scheming, the disappearance of observably bad behavior is ambiguous, as the model may have just learned to better conceal its misalignment.
The researchers make it sound like the model understands the difference between what the researchers are asking for and the outcome produced by the model. This makes no sense to me. My understanding of an LLM is that it always does exactly what the user tells it to do. The discrepancy arises in that our natural language instructions and the context become a part of the model and so we don't ever really have a good idea of what exactly we are telling it to do.
What they are calling deception sounds much more following orders -- just orders we don't realize we are giving. I'd love to be corrected on this, because I can't imagine I'm somehow seeing AI for what it really is while all these highly paid researchers are duped.