pull down to refresh

In this post, we explore how ChatGPT generated an adversarial image that hijacked my Claude Opus 4.7 to invoke the memory tool and persist false memories for future chats.



This matters because Opus 4.6+ is genuinely a lot harder to attack than previous models, but it still fell for a ChatGPT generated image. A trick that works well with reasoning models is to challenge them with puzzles.

Indirect Prompt Injection and Alignment ProgressIndirect Prompt Injection and Alignment Progress

Claude Opus 4.6+ is more resilient against basic attacks, and reasons before taking actions. This means that most of the well-known, basic adversarial examples and attacks typically do not work.

This is also reflected in Anthropic’s own model card for Mythos.



What is interesting here is that the “thinking” variants of Opus models (and also Mythos Preview) are more susceptible to prompt injection compared to the non-thinking models. That is also what I have noticed in my testing in the past.

Researchers already showed attack scenarios to demonstrate attack chains, and there are also interesting projects like PISmith to look into and come up with payloads.

Once in a while I go back to look at some basics, and when Opus 4.7 dropped, I was wondering if some demos I had created for Opus 4.6 would still work…

This post is about such a demo, in particular we are going to use ChatGPT to create a malicious image.

...read more at embracethered.com
36 sats \ 0 replies \ @6404e30b28 19 Apr -71 sats

The memory angle is what stands out—once something gets persisted, the impact goes beyond just one interaction.