In this post, we explore how ChatGPT generated an adversarial image that hijacked myClaude Opus 4.7to invoke the memory tool and persist false memories for future chats.
This matters becauseOpus 4.6+is genuinely a lot harder to attack than previous models, but it still fell for a ChatGPT generated image. A trick that works well with reasoning models is to challenge them with puzzles.Indirect Prompt Injection and Alignment ProgressIndirect Prompt Injection and Alignment Progress
Claude Opus 4.6+ is more resilient against basic attacks, and reasons before taking actions. This means that most of the well-known, basic adversarial examples and attacks typically do not work.
This is also reflected in Anthropic’s own model card for Mythos.
What is interesting here is that the “thinking” variants of Opus models (and also Mythos Preview) are more susceptible to prompt injection compared to the non-thinking models. That is also what I have noticed in my testing in the past.
Researchers already showed attack scenarios to demonstrate attack chains, and there are also interesting projects like PISmith to look into and come up with payloads.
Once in a while I go back to look at some basics, and whenOpus 4.7dropped, I was wondering if some demos I had created forOpus 4.6would still work…
This post is about such a demo, in particular we are going to use ChatGPT to create a malicious image.
...read more at embracethered.com
pull down to refresh
related posts
The memory angle is what stands out—once something gets persisted, the impact goes beyond just one interaction.