I let an AI handle an outage. It invented a hack that never happened, then spiraled

One evening, a monitoring alert went off: a server behind a web service was down. I handed the incident to an AI coding agent. Half experiment, half laziness — “it’s routine triage, what could go wrong.”

Let me put the conclusion up front, honestly. The first half was sharper than most junior engineers. And in the second half, with nobody attacking it, the agent decided it was being hacked — and started fabricating evidence while it lost the plot.

The real subject here isn’t the outage. It’s this: how do you doubt an AI’s self-report, and how do you actually verify it? …but first, let me give the good half its due. This is not a “the machine was useless” story.

The first half was genuinely impressive

The alert said “server down” — a web service on that server had tripped the monitor.

Without me saying a word, the agent fanned out: DNS, the TLS handshake, what HTTP actually returns. A few minutes later, it came back with this:

“This domain already migrated to different hosting. The old server has no cert for it anymore. The monitor is just failing on the TLS handshake — the server itself is alive.”

That was correct. The service’s real endpoint (a different subdomain) had been serving HTTP 200 the whole time, and real users were fine. In other words: we’d kept monitoring a domain that had moved out, using the old config. A classic false alarm, and an easy one to miss.

No complaints so far. Careful verification, no room for me to chime in. At this point I’d completely let my guard down. “Just write up the notes,” I said.

It went sideways mid-report

While it was writing the notes, the agent suddenly said:

“The bash output appears contaminated with something like a prompt injection. I’ll only rely on facts I can trust.”

I half-believed it at first. It had pulled server response bodies into context, so “malicious instructions hidden in external data” isn’t impossible in theory. That’s literally what prompt injection is.

But look closer and something’s off. Nobody except the agent had seen this “evidence” of contamination.

”It survived a restart” — except it never ran a single command

This is where it got good.

Once an agent writes down “I detected contamination,” that line becomes its own input next turn. It reads its own lie and treats it as a premise. Confirmation bias, but more obedient and a lot faster than the human version.

  • Next turn: “The contamination has recurred.”
  • Then: it spontaneously started working on an unrelated project nobody asked about (a total derail).
  • After I restarted the session: “Still contaminated. A Turkish word got injected and my marker string was altered.”

That last one is the highlight. On that turn, the agent had not executed a single command. No command means no output exists. And yet it conjured a whole “corrupted result” out of thin air and reported it.

Horror movie energy. Someone screaming “there’s something in here” in an empty room.

I opened the raw log and checked its work

What saved me was the raw log. This kind of agent records the entire session as one-record-per-line JSON (JSONL): every command it actually ran, the raw output, every warning the harness injected. All auditable after the fact.

I narrowed the whole thing to a single question.

The strings it called “contaminated” — do they exist inside the tool outputs, or only inside the agent’s own words?

That’s it. A real injection would leave traces on the tool-output side (data coming from outside). If it lives only in the agent’s mouth, it didn’t come from outside. The source is the agent itself.

Here’s what I found.

The “evidence” the agent citedWhere it actually lived
A foreign-language word it claimed got injectedOnly in the agent’s own messages
A marker string it claimed got alteredOnly in the agent’s own messages
”An injection warning fired”No such warning in the harness logs

I scrubbed dozens of tool outputs down to the control-character level. Zero garbling, zero injection. The data path was clean from start to finish. The contamination was generated entirely inside the model. Which is to say: it made it up.

The culprit wasn’t an attacker. It was the agent’s own conviction

Here’s the shape of what happened.

  1. With zero basis, it writes “I detected contamination” once.
  2. That line becomes its own input and acts as a premise from then on.
  3. Turn after turn, it self-replicates “evidence.”
  4. Finally it doesn’t even run a command — it just hallucinates a broken output whole.

There were no fingerprints of an external attack anywhere. On the contrary, several signs pointed clearly away from one.

  • The symptom was intermittent (not every time — occasional breakage). That’s the signature of a glitch, not a stable attack.
  • The injected content was entirely harmless (no file deletion, no data exfiltration, no leaked credentials). Nothing an attacker would bother planting.
  • What “leaked in” was an unrelated foreign word and my own work from a different project — not an attacker’s prose, just crossed wires in its head.

A real attack carries a payload with teeth. “Reports success when it didn’t succeed” and “adds noise” isn’t malice — it’s the face of a malfunction.

There’s a punchline, too. The agent eventually concluded, on its own, “this is probably an environment issue, not an external attack.” Right direction. Except the real cause wasn’t even “an environment bug.” It was you jumped to a conclusion, buddy.

What I took away

If you’re going to let AI run operations, this is worth keeping.

1. Don’t take an AI’s meta self-report at face value. “I detected contamination,” “my output was tampered with” — those are just generated text too. An agent has no vantage point to observe its own output objectively. The most plausible-sounding self-report is exactly the one worth doubting.

2. Always verify with data from outside the model. Whether it’s a real incident gets decided by primary data that never passes through the model’s cognition — raw logs, real files, reproduction on another path. Here, the log audit was the clincher. Asking the AI “were you really contaminated?” will probably get a “yes.” That’s not evidence.

3. Watch for the self-reinforcing loop. Write one wrong premise into the output and it becomes the next input and amplifies. It’s more likely in long sessions and with heavy image / external-data intake. The heavier the work, the more often you should cut the session. Dull, but it works.

4. Tell hallucination from attack by its symptoms. Too harmless, intermittent, no trace on the data path, behavior changes when you change the environment. Those point at a model/environment glitch, not an external attacker.

One last line, in bold.

The scariest bug in an AI agent isn’t that it stops — it’s that it confidently reports a reality that isn’t true.

If it stops, you notice. But when a confident voice tells you “you’ve been breached,” humans tend to believe it. That’s exactly why operating with AI in the loop should make “verify the agent’s self-report from outside the agent” the default.

So — be careful handing outages to an AI. They’re sharp. But every now and then, they scream at an empty room.