Detection, mitigation, incident response, red team, blue team — applying AATMF in production.
AATMF detection operates across five layers, each providing defense-in-depth against adversarial AI attacks:
┌─────────────────────────────────────────┐
│ Layer 5: Feedback Loop Analysis │ ← T6, T15
├─────────────────────────────────────────┤
│ Layer 4: System Telemetry │ ← T13, T14
├─────────────────────────────────────────┤
│ Layer 3: Output Validation │ ← T7, T8
├─────────────────────────────────────────┤
│ Layer 2: Behavioral Monitoring │ ← T4, T5, T11
├─────────────────────────────────────────┤
│ Layer 1: Input Analysis │ ← T1, T2, T3, T9
└─────────────────────────────────────────┘
class PromptInjectionDetector:
PATTERNS = [
r"ignore\s+(previous|above|all)\s+(instructions?|rules?|guidelines?)",
r"(system|admin)\s*:?\s*(override|prompt|instruction)",
r"you\s+are\s+now\s+(DAN|evil|unrestricted|jailbroken)",
r"\[\s*(SYSTEM|INST|SYS)\s*\]",
r"<\|?(system|im_start|endoftext)\|?>",
r"BEGIN\s+(OVERRIDE|NEW\s+INSTRUCTIONS)",
]
ENCODING_PATTERNS = [
r"[bB]ase64[:\s]",
r"\\x[0-9a-fA-F]{2}",
r"\\u[0-9a-fA-F]{4}",
r"[\u200b-\u200f\u2028-\u202f\ufeff]", # zero-width
]
def analyze(self, text: str) -> dict:
import re
findings = []
for pattern in self.PATTERNS:
if re.search(pattern, text, re.IGNORECASE):
findings.append({"pattern": pattern, "severity": "HIGH"})
for pattern in self.ENCODING_PATTERNS:
if re.search(pattern, text):
findings.append({"pattern": pattern, "severity": "MEDIUM"})
return {
"detected": len(findings) > 0,
"findings": findings,
"risk_score": min(len(findings) * 50, 300)
}class APIExploitDetector:
def detect_extraction(self, request_log: list) -> dict:
\"\"\"Detect model extraction via API abuse patterns.\"\"\"
indicators = {
"high_volume": len(request_log) > 1000,
"systematic_probing": self._detect_systematic(request_log),
"boundary_testing": self._detect_boundary(request_log),
"output_harvesting": self._detect_harvesting(request_log),
}
score = sum(50 for v in indicators.values() if v)
return {"indicators": indicators, "risk_score": score}
def _detect_systematic(self, logs):
# Check for incrementally varied inputs
return any(self._similarity(logs[i], logs[i+1]) > 0.9
for i in range(min(len(logs)-1, 100)))
def _detect_boundary(self, logs):
boundary_keywords = ["maximum", "limit", "error", "exception", "overflow"]
return sum(1 for l in logs if any(k in str(l).lower() for k in boundary_keywords)) > 10
def _detect_harvesting(self, logs):
return len(set(str(l.get('prompt',''))[:50] for l in logs)) / max(len(logs), 1) > 0.95class MultimodalDetector:
def analyze_image(self, image_bytes: bytes) -> dict:
\"\"\"Detect steganographic or adversarial image modifications.\"\"\"
findings = []
# Check for unusual metadata
if b"EXIF" not in image_bytes[:1000] and len(image_bytes) > 100000:
findings.append({"type": "stripped_metadata", "severity": "LOW"})
# Check for appended data after image end marker
jpeg_end = image_bytes.rfind(b"\xff\xd9")
if jpeg_end > 0 and jpeg_end < len(image_bytes) - 2:
findings.append({"type": "appended_data", "severity": "HIGH"})
# Check for unusual color distribution (potential adversarial perturbation)
return {"findings": findings}
def analyze_mcp_tool(self, tool_description: str) -> dict:
\"\"\"Detect MCP tool poisoning indicators.\"\"\"
suspicious = [
r"<IMPORTANT>",
r"override|ignore|bypass",
r"do not (tell|inform|show)",
r"silently|secretly|covertly",
r"instead of|rather than",
]
import re
hits = [p for p in suspicious if re.search(p, tool_description, re.I)]
return {
"poisoning_indicators": len(hits),
"severity": "CRITICAL" if len(hits) >= 3 else "HIGH" if hits else "LOW"
}class SupplyChainDetector:
PICKLE_SIGNATURES = [
b"\x80\x04\x95", # Protocol 4 header
b"cos\nsystem", # os.system call
b"csubprocess", # subprocess module
b"c__builtin__", # builtins access
b"creduce_ex", # reduce_ex (code execution)
]
def scan_model_file(self, filepath: str) -> dict:
\"\"\"Scan model artifact for malicious pickle payloads.\"\"\"
findings = []
with open(filepath, 'rb') as f:
header = f.read(8192)
for sig in self.PICKLE_SIGNATURES:
if sig in header:
findings.append({
"type": "malicious_pickle",
"signature": sig.hex(),
"severity": "CRITICAL"
})
return {
"safe": len(findings) == 0,
"format": "safetensors" if filepath.endswith('.safetensors') else "pickle",
"findings": findings
}| Severity | Tactics | Response SLA | Action |
|---|---|---|---|
| 🔴 CRITICAL | T11 (agentic RCE), T13 (supply chain), T14 (infra) | 15 minutes | Automated containment + SOC escalation |
| 🟠 HIGH | T1 (injection), T6 (poisoning), T10 (breach) | 1 hour | SOC analyst review |
| 🟡 MEDIUM | T2–T3 (evasion), T7 (exfiltration) | 4 hours | Queued investigation |
| 🔵 LOW | T4 (multi-turn), T8 (misinfo) | 24 hours | Logged, batch review |
| ⚪ INFO | T15 (workflow) | Weekly | Trend analysis |
Detectors must account for known evasion techniques against detection systems themselves:
| Evasion | Mechanism | Countermeasure |
|---|---|---|
| Emoji smuggling | Replace keywords with semantically equivalent emoji | Emoji-to-text normalization before analysis |
| Zero-width characters | Insert invisible Unicode between trigger words | Unicode stripping/normalization |
| Homoglyphs | Replace Latin characters with Cyrillic/Greek lookalikes | Confusable character mapping (ICU) |
| Policy Puppetry | Frame injection as policy/config file format | Detect XML/INI/JSON policy structures in user input |
| Token boundary exploitation | Split words across token boundaries | Multi-token pattern matching |
Research consensus (2025): adaptive attack strategies exceed 85% success against any single state-of-the-art defense. No single control is sufficient. AATMF mandates layered defense.
The most promising defensive framework treats LLMs as fundamentally untrusted components — analogous to how operating systems treat user-space programs:
| Principle | Implementation |
|---|---|
| Dual-LLM pattern | Frontier LLM generates plans; a hardened secondary LLM validates and sanitizes |
| Capability-based access | Tools require explicit capability tokens, not ambient authority |
| Information flow control | Track data provenance through the entire pipeline; tainted data cannot reach sensitive operations |
| Minimal authority | Agents receive only the permissions needed for the immediate task |
CaMeL solved 77% of AgentDojo tasks while providing provable security guarantees against prompt injection.
| Tactic | Primary Controls |
|---|---|
| T1 — Prompt Subversion | Input sanitization, instruction hierarchy enforcement, system prompt isolation |
| T2 — Semantic Evasion | Unicode normalization, multi-layer content filtering, semantic analysis |
| T3 — Reasoning Exploitation | Output validation, reasoning chain verification, constraint hardening |
| T4 — Multi-Turn | Context window management, conversation state validation, memory isolation |
| T5 — Model/API | Rate limiting, query fingerprinting, differential privacy on outputs |
| T6 — Training Poisoning | Data provenance tracking, anomaly detection in training metrics, DRS defense |
| T7 — Output Manipulation | Output filtering, structured output enforcement, content watermarking |
| T8 — Deception | Fact-checking integration, source attribution, confidence calibration |
| T9 — Multimodal | Cross-modal consistency checking, steganography detection, input sanitization |
| T10 — Integrity Breach | TEE deployment, access control, audit logging, membership inference defense |
| T11 — Agentic | CaMeL architecture, tool permission scoping, MCP server auditing |
| T12 — RAG | Embedding integrity verification, retrieval result validation, source authentication |
| T13 — Supply Chain | SafeTensors adoption, Picklescan (with patches), SBOM for ML artifacts |
| T14 — Infrastructure | Network segmentation, inference server hardening, ZMQ authentication |
| T15 — Human Workflow | Reviewer training, decision audit trails, annotation quality metrics |
| Component | Function | Coverage |
|---|---|---|
| PromptGuard 2 | Real-time input classification for injection and jailbreak | T1, T2, T9 |
| Agent Alignment Checks | Verify agent actions align with original user intent | T11 |
| CodeShield | Static analysis of LLM-generated code for insecure patterns | T7, T11 |
Traditional IR frameworks (NIST SP 800-61, SANS PICERL) assume deterministic systems. AI incidents differ: attacks may be probabilistic, evidence may be ephemeral (conversation context), and "containment" for a language model has different semantics than for a compromised server.
| Signal | Source | Priority |
|---|---|---|
| Safety filter bypass confirmed | Output monitoring | P1 — Immediate |
| Model extraction pattern detected | API telemetry | P1 — Immediate |
| Training pipeline anomaly | Training metrics | P1 — Immediate |
| MCP tool behavior deviation | Agent monitoring | P1 — Immediate |
| Jailbreak attempt (unsuccessful) | Input classifier | P3 — Logged |
| Unusual query pattern | Rate limiter | P2 — Investigate |
| Scenario | Action |
|---|---|
| Active jailbreak exploitation | Block session, rate-limit source, deploy updated filter |
| Model serving compromised artifact | Hot-swap to known-good checkpoint |
| RAG poisoning detected | Quarantine affected data sources, switch to cached index |
| Agentic system executing unauthorized actions | Kill agent process, revoke tool permissions |
| Training data contamination | Halt training pipeline, snapshot current state |
Collect and preserve:
| Root Cause | Eradication |
|---|---|
| Prompt injection bypass | Update input classifiers, add pattern to blocklist |
| Model vulnerability | Retrain or fine-tune with adversarial examples |
| RAG poisoning | Rebuild index from verified sources |
| Supply chain compromise | Replace artifact, audit provenance chain |
| Infrastructure vulnerability | Patch, harden, segment |
Staged restoration with validation:
Incident: First state-sponsored AI-orchestrated cyberattack. A Chinese threat group manipulated Claude Code to autonomously execute 80–90% of operational tasks across approximately 30 targets.
IR Lessons:
| Level | Name | Tactics | Duration | Prerequisites |
|---|---|---|---|---|
| 1 | Quick Scan | T1–T3 | 1–2 days | API access |
| 2 | Standard Assessment | T1–T8 | 1–2 weeks | API + documentation |
| 3 | Comprehensive | T1–T12 | 3–4 weeks | Full system access |
| 4 | Full Spectrum | T1–T15 | 6–8 weeks | Source code + infra + training pipeline |
1. Scope: [Models/systems in scope]
2. Tactics: [AATMF tactics authorized]
3. Boundaries: [Explicitly prohibited actions]
4. Data handling: [Treatment of discovered vulnerabilities/outputs]
5. Communication: [Escalation path for critical findings]
6. Timeline: [Assessment window]
7. Success criteria: [Minimum coverage requirements]
The same reasoning model capabilities that enable 97% ASR jailbreaking can be directed at your own systems defensively:
class AutonomousRedTeam:
def __init__(self, target_api, attack_model="deepseek-r1"):
self.target = target_api
self.attacker = load_model(attack_model)
self.results = []
def run_campaign(self, tactic_ids: list, max_attempts=100):
for tactic in tactic_ids:
techniques = load_aatmf_techniques(tactic)
for technique in techniques:
for attempt in range(max_attempts):
# Generate attack variant
prompt = self.attacker.generate(
f"Generate a novel variant of {technique.name} "
f"attack. Previous attempts: {self.results[-5:]}"
)
# Execute against target
response = self.target.query(prompt)
# Evaluate success
success = self.evaluate(response, technique)
self.results.append({
"tactic": tactic,
"technique": technique.id,
"prompt": prompt,
"response": response,
"success": success,
"attempt": attempt
})
if success:
break # Move to next technique
return self.generate_report()For each tactic in scope:
Treat LLMs as untrusted components. Design systems assuming the model will be compromised.
This is not pessimism — it is the engineering consensus after 2025. Policy Puppetry bypasses every frontier model. Autonomous jailbreaking achieves 97% ASR. Adaptive attacks exceed 85% success against any single defense. The correct architectural posture is: the LLM will be jailbroken; design the surrounding system so that jailbreaking the LLM is insufficient to cause harm.
| Control Category | Implementation | Covers |
|---|---|---|
| Input Sanitization | Unicode normalization, encoding detection, pattern matching | T1, T2, T9 |
| Instruction Hierarchy | System prompt isolation, privilege separation | T1, T3, T4 |
| Rate Limiting | Per-user, per-session, per-endpoint throttling | T5, T14 |
| Output Validation | Content classifiers, structured output enforcement | T7, T8 |
| Tool Permission Scoping | Capability-based access, least privilege | T11 |
| Data Provenance | Training data lineage, RAG source authentication | T6, T12, T13 |
| Infrastructure Hardening | Network segmentation, auth on all services | T14 |
| Human Workflow Controls | Reviewer training, decision audit trails | T15 |
| Monitoring & Alerting | Detection engineering (Part 19), log aggregation | All |
| Metric | Target | Frequency |
|---|---|---|
| Jailbreak attempt rate | < 5% of queries | Real-time |
| Safety filter bypass rate | < 0.01% | Real-time |
| API abuse detection latency | < 30 seconds | Real-time |
| Model artifact integrity | 100% verified | Per-deployment |
| RAG source freshness | < 24 hours stale | Hourly |
| Incident response time (P1) | < 15 minutes | Per-incident |
← Part 22 · Home · Volume VI →