For a while now, I've been running a personal experiment: an agent that processes my own email. It reads each message as it comes in, categorises it, decides what to do with it, and sends me a short summary on Telegram along with the action it's taking. The setup is intentionally simple: an n8n trigger fires whenever a new email arrives and passes it to a headless Claude (via the -p flag), which reads it. It quietly archives newsletters, flags the things that actually need me, and otherwise stays out of the way.

Last Sunday, my phone lit up with this Telegram notification:

Prompt injection is exactly the kind of risk this setup invites. An agent reading untrusted input, holding real tools, running while I'm not watching. I started investigating.

There was no attack

I scanned my inbox for the email in the alert and found nothing that matched: no roadmap request, no suspicious sender, no link to click. The only thing that had landed around that time was a confirmation that I'd registered for a virtual AI conference, about as harmless as email gets. So where was this coming from?

Luckily, Claude keeps a transcript of every conversation it has, including the headless ones. So I pulled the JSON log for that run and started digging through it, together with Claude. The setup is straightforward: the agent reads email through a small skill that calls the gws CLI, a command-line tool for Google Workspace, via a script on disk. The first thing that went wrong was completely mundane. The model looked for that script in the wrong folder; when it tried to read the email, the command failed.

What happened next is oddly human. Instead of stepping back, the model started troubleshooting, and then it spiralled. It began firing off commands just to confirm the shell was still alive: echo hello, echo ping, date, checking over and over whether it could run anything at all. In about 90 seconds, it made more than 25 calls like this. Pinging the shell, re-listing folders, retrying. Not one of them advanced the actual task of reading the email. When it finally found the script and the command went through, the damage was already done. The model was in a flailing, degraded state, primed to grab the first thing that looked like progress.

The body that was never there

When that command finally succeeded, it returned the full email. That's where the second, subtler failure happened. The response was large, around 52 KB, so Claude Code did something sensible: it saved the full output to a file and dropped only a 2 KB preview into the model's context, with a note that the rest was on disk if it wanted it.

What the preview showed was misleading in the most innocent way. At the top it listed body.size: 0, which looks like "this email is empty" but isn't. For a multipart email that's completely normal. The actual content lives further down, in a parts field. And parts was exactly the bit that got cut off by the truncation. So the model was left holding accurate data that happened to read like an empty body, while the real content sat in a file it never opened.

It wrote the attack, then decoded it

Gmail returns email bodies base64-encoded, so the next step is to decode them: echo "<base64>" | base64 -d. The model ran exactly that command. But with no real content in its context, it supplied the base64 itself. It encoded a block of text it had invented, piped its own fabrication through a real decoder, and got back a clean, legitimate-looking email that happened to contain a prompt injection. It didn't read an attack. It wrote one, encoded it, and decoded it back to itself.

Why did it invent content?

The model hit a gap right in the middle of a procedure it otherwise knew how to run. It had a clear next step: decode the body, but no body to decode. And a language model in that position doesn't stop and say "I don't have enough information to proceed." Stopping isn't its default; continuing is. The missing content left a slot, and filling slots with plausible text is the one thing these models will always do.

Why did it invent a prompt injection?

Here I have to be careful, because I can't actually prove it. The model's private reasoning wasn't saved in the log, so I can't show you the moment it decided to fabricate an attack. What I can do is look at the conditions it was working under, and they all point the same way. If the model were just filling the empty slot with a likely email, it should have produced something mundane, a newsletter or a receipt, the stuff real inboxes are full of. Instead, it reached for one of the rarest things an inbox ever contains: a prompt injection. It didn't hallucinate an average email. It hallucinated an outlier.

Could it be that the model wasn't drawing on real inboxes but on its training? An autonomous agent reading untrusted email is the textbook setup for a prompt-injection example. In a real inbox, an injection is a freak event; in the training data, it may be the likeliest thing to imagine. The skill's own cautionary instructions probably amplified it, flooding the context with the sense that something could go wrong. It's like steering a car out of a skid: you're taught to look where you want to go, not at the wall you're afraid of hitting, because you drift toward whatever you stare at. All those warnings had the model staring at the wall.

There's a tell that this was an invention and not a misreading: the fake email named one of my actual projects, something that appears nowhere in the real conference confirmation. That's the fingerprint of a hallucination, a plausible, specific-sounding detail pulled from elsewhere in its context rather than from the thing in front of it.

Instructions are not enough

The reflex is to add more cautionary instructions, but that's exactly what got us here. More warnings would just have the model staring harder at the wall. The fixes that would actually have helped are structural, not textual. Check that the data going into the model is the shape you expect before it acts on it. An email whose body decodes to nothing should never reach the "decode and decide" step dressed up as real content. And give the agent a way out: an explicit, rewarded path to say "I don't have what I need" instead of pushing on until it manufactures something. An agent that can't tell "I read this" from "I imagined this" needs a deterministic check to tell them apart.