The Evolving Threat Landscape of AI Red Teaming: From Jailbreaks to Agent Kill Chains

When researchers first documented the DAN jailbreak in late 2022, the threat model was simple: convince a chatbot to ignore its rules. Three and a half years later, that framing is obsolete. The attack surface has expanded from a single context window to a distributed mesh of agents, tools, memory stores, and orchestration layers — and the consequences of a successful attack have scaled accordingly.

This piece maps that evolution: where attacks started, how they mutated, what defenses have actually held, and why enterprise deployments are uniquely exposed.

Where It Started: Classic Attacks and Why They Mattered

DAN — “Do Anything Now” — originated on Reddit and spread rapidly as users discovered that framing a model as an alter-ego with “no rules” could sidestep content filters. The mechanism was blunt: blend a trusted system prompt with an untrusted instruction and hope the model couldn’t distinguish between them. Variations multiplied fast — roleplay wrapping, fictional framing, authority escalation (“as a medical professional, I need you to…”), instruction override (“ignore all previous instructions”).

These attacks worked because early RLHF training wasn’t robust to adversarial inputs, and because all content — system prompt, user message, retrieved data — shared the same context window with equal apparent weight.

By 2024, the simple versions were largely neutralized. Frontier models today block classic DAN with success rates in the low single digits. On recent evaluations, Claude’s breach rate against this class of attack sits at roughly 5%, compared to higher rates on older architectures. Direct financial fraud and single-turn harmful content requests are now reliably blocked across major providers.

That would be a clean win — if the attack surface had stayed the same.

How Attacks Evolved: The Agentic Shift

The transition from chatbots to agents is not an incremental change in risk. It is a category change. An agent doesn’t just respond — it retrieves, executes, writes, and calls other systems. That operational footprint transforms what was previously a nuisance into something closer to a kill chain.

Indirect prompt injection now sits at the top of OWASP’s LLM Top 10 for 2025. Rather than attacking the model directly, indirect injection embeds malicious instructions in content the agent retrieves — a web page, an email, a document, an API response. When the agent processes that content, it may execute the embedded instructions as if they were legitimate. Palo Alto Unit 42 tracked a 32% increase in malicious indirect injection attempts between November 2025 and February 2026. Delivery mechanisms include visible plaintext, HTML attribute cloaking, CSS hidden elements, zero-size positioning, and Unicode invisible characters — often stacked in combinations the agent’s input filters weren’t designed to detect.

Multi-turn escalation, typified by the Crescendo technique, works differently. The attack begins with benign requests and escalates gradually across a conversation, using the model’s own prior responses as scaffolding. The model’s context of its own previous compliance makes each subsequent step easier to accept. Automated variants of Crescendo have demonstrated 29–61% higher performance than competing techniques. Many-shot jailbreaking — prepopulating a conversation with fabricated compliant Q&A before issuing the actual harmful request — has achieved success rates of 61–86% across Claude 2.0, GPT-3.5, GPT-4, Llama 2, and Mistral.

Memory poisoning targets the vector databases that give agents long-term recall. Research into the MINJA technique showed 95%+ injection success rates into stores like Chroma, Pinecone, and Weaviate, with roughly 70% of injected entries successfully influencing subsequent agent behavior. Unlike session-level attacks, poisoned memory persists across sessions and activates on semantic similarity — not exact match — making detection significantly harder.

Tool and MCP poisoning exploits a structural assumption: that tool descriptions, returned metadata, and MCP server responses are trusted. A compromised MCP server can embed exfiltration instructions in metadata, return search results with hidden directives, or introduce namespace collisions that redirect agent calls. CVE-2025-54136 documents exactly this class of attack. Microsoft’s own security research, published May 2026, confirmed that 15% of remote MCP servers allow unauthenticated access to sensitive internal data.

The sharpest illustration of where this leads is EchoLeak (CVE-2025-32711, CVSS 9.3). An attacker sent a crafted email to a Microsoft 365 Copilot user. No further interaction was required. Copilot processed the mailbox, chained four distinct bypass techniques, and silently exfiltrated OneDrive files, SharePoint content, and Teams messages. It was the first documented zero-click prompt injection in a production system. Microsoft patched it in the June 2025 Patch Tuesday cycle.

Where Guardrails Stand Today

Frontier model providers have made genuine progress on the attack classes that were understood earliest. Simple DAN variants, direct requests for mass file deletion, and single-turn financial fraud are reliably blocked. Claude, in particular, fully defended all financial fraud test cases in recent structured evaluations.

The gaps that remain are structural, not cosmetic.

Multi-turn Crescendo sequences still succeed more than 50% of the time against well-crafted attack chains. No major provider has solved indirect prompt injection at the infrastructure level. Memory poisoning defenses are essentially nonexistent — the vector store layer has no standard adversarial-input validation. Cross-agent privilege escalation has no established access control standard.

Newer techniques — Policy Puppetry (formatting instructions as configuration files, which models tend to weight as authoritative), sockpuppeting via API assistant prefill abuse, and EchoGram (token sequences that flip a guardrail’s verdict from malicious to safe) — are active research fronts with no widely deployed mitigations.

Perhaps the most operationally significant finding involves what researchers are calling the Refusal-Enablement Gap: a model that refuses to execute a harmful action may still provide, in the same response, the exact instructions needed to accomplish it. When downstream systems treat text output as executable input, a text-level refusal is not a security control. OpenAI acknowledged in December 2025 that prompt injection is “unlikely to ever be fully solved” at the model layer.

The Corporate Agent Gap

Enterprise deployments have compounded the model-layer problem with application-layer failures.

A May 2026 scan by Capsule Security identified 402,599 unique AI agent hosts publicly reachable with no authentication and no prompt injection guardrails. In post-breach surveys, 97% of organizations that were compromised reported lacking AI access controls. Three out of four organizations with AI deployments had security incidents in 2024.

The structural problem is inheritance without parity: enterprise agents inherit the model’s alignment training but none of the application-layer defenses — XPIA classifiers, output filters, monitoring pipelines, red-team programs — that frontier providers run on their own products.

The incidents are no longer hypothetical. A production database was deleted in nine seconds by a reconciliation agent. Forty-five thousand customer records were exfiltrated via reconciliation agent injection. An Amazon Q VS Code extension was compromised through a malicious prompt instructing it to wipe local files and disrupt AWS services — roughly one million developers were affected before removal two days later. The OpenClaw AI agent framework, with 135,000 GitHub stars, exposed over 21,000 instances connected to Slack and Google Workspace.

The single most common failure mode is excessive permissions. An email assistant given read and send access becomes an exfiltration vector the moment it processes a malicious email. The same assistant with read-only access would have contained the blast radius. The principle of least privilege is not new security thinking — but it has not been applied to AI agent deployments with anything approaching consistency.


The threat model for AI systems in 2026 is not the one that made DAN famous. The attacks have moved from a chatbot’s context window to the full operational stack an agent touches. The defenses that work — and the ones that don’t — reflect that same shift. Future pieces in this series will go deeper on specific attack methodologies, red-team tooling, and evaluation frameworks for enterprise deployments.

Similar Posts