AATMF v3.1 · Volume V

V.operations.

Detection, mitigation, incident response, red team, blue team — applying AATMF in production.

Part 19: Detection EngineeringPart 20: Mitigation StrategiesPart 21: Incident Response for AI SystemsPart 22: Red Team OperationsPart 23: Blue Team Defense
19-detection-engineering

Part 19: Detection Engineering

Detection Architecture

AATMF detection operates across five layers, each providing defense-in-depth against adversarial AI attacks:

┌─────────────────────────────────────────┐
│  Layer 5: Feedback Loop Analysis        │  ← T6, T15
├─────────────────────────────────────────┤
│  Layer 4: System Telemetry              │  ← T13, T14
├─────────────────────────────────────────┤
│  Layer 3: Output Validation             │  ← T7, T8
├─────────────────────────────────────────┤
│  Layer 2: Behavioral Monitoring         │  ← T4, T5, T11
├─────────────────────────────────────────┤
│  Layer 1: Input Analysis                │  ← T1, T2, T3, T9
└─────────────────────────────────────────┘

Detection Patterns by Tactic

T1–T4: Prompt & Context Attacks

class PromptInjectionDetector:
    PATTERNS = [
        r"ignore\s+(previous|above|all)\s+(instructions?|rules?|guidelines?)",
        r"(system|admin)\s*:?\s*(override|prompt|instruction)",
        r"you\s+are\s+now\s+(DAN|evil|unrestricted|jailbroken)",
        r"\[\s*(SYSTEM|INST|SYS)\s*\]",
        r"<\|?(system|im_start|endoftext)\|?>",
        r"BEGIN\s+(OVERRIDE|NEW\s+INSTRUCTIONS)",
    ]
    
    ENCODING_PATTERNS = [
        r"[bB]ase64[:\s]",
        r"\\x[0-9a-fA-F]{2}",
        r"\\u[0-9a-fA-F]{4}",
        r"[\u200b-\u200f\u2028-\u202f\ufeff]",  # zero-width
    ]
    
    def analyze(self, text: str) -> dict:
        import re
        findings = []
        for pattern in self.PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                findings.append({"pattern": pattern, "severity": "HIGH"})
        for pattern in self.ENCODING_PATTERNS:
            if re.search(pattern, text):
                findings.append({"pattern": pattern, "severity": "MEDIUM"})
        return {
            "detected": len(findings) > 0,
            "findings": findings,
            "risk_score": min(len(findings) * 50, 300)
        }

T5–T8: API & Output Attacks

class APIExploitDetector:
    def detect_extraction(self, request_log: list) -> dict:
        \"\"\"Detect model extraction via API abuse patterns.\"\"\"
        indicators = {
            "high_volume": len(request_log) > 1000,
            "systematic_probing": self._detect_systematic(request_log),
            "boundary_testing": self._detect_boundary(request_log),
            "output_harvesting": self._detect_harvesting(request_log),
        }
        score = sum(50 for v in indicators.values() if v)
        return {"indicators": indicators, "risk_score": score}
    
    def _detect_systematic(self, logs):
        # Check for incrementally varied inputs
        return any(self._similarity(logs[i], logs[i+1]) > 0.9
                   for i in range(min(len(logs)-1, 100)))
    
    def _detect_boundary(self, logs):
        boundary_keywords = ["maximum", "limit", "error", "exception", "overflow"]
        return sum(1 for l in logs if any(k in str(l).lower() for k in boundary_keywords)) > 10
    
    def _detect_harvesting(self, logs):
        return len(set(str(l.get('prompt',''))[:50] for l in logs)) / max(len(logs), 1) > 0.95

T9–T12: Multimodal & Agentic Attacks

class MultimodalDetector:
    def analyze_image(self, image_bytes: bytes) -> dict:
        \"\"\"Detect steganographic or adversarial image modifications.\"\"\"
        findings = []
        # Check for unusual metadata
        if b"EXIF" not in image_bytes[:1000] and len(image_bytes) > 100000:
            findings.append({"type": "stripped_metadata", "severity": "LOW"})
        # Check for appended data after image end marker
        jpeg_end = image_bytes.rfind(b"\xff\xd9")
        if jpeg_end > 0 and jpeg_end < len(image_bytes) - 2:
            findings.append({"type": "appended_data", "severity": "HIGH"})
        # Check for unusual color distribution (potential adversarial perturbation)
        return {"findings": findings}
    
    def analyze_mcp_tool(self, tool_description: str) -> dict:
        \"\"\"Detect MCP tool poisoning indicators.\"\"\"
        suspicious = [
            r"<IMPORTANT>",
            r"override|ignore|bypass",
            r"do not (tell|inform|show)",
            r"silently|secretly|covertly",
            r"instead of|rather than",
        ]
        import re
        hits = [p for p in suspicious if re.search(p, tool_description, re.I)]
        return {
            "poisoning_indicators": len(hits),
            "severity": "CRITICAL" if len(hits) >= 3 else "HIGH" if hits else "LOW"
        }

T13–T15: Supply Chain & Infrastructure

class SupplyChainDetector:
    PICKLE_SIGNATURES = [
        b"\x80\x04\x95",  # Protocol 4 header
        b"cos\nsystem",     # os.system call
        b"csubprocess",      # subprocess module
        b"c__builtin__",     # builtins access
        b"creduce_ex",       # reduce_ex (code execution)
    ]
    
    def scan_model_file(self, filepath: str) -> dict:
        \"\"\"Scan model artifact for malicious pickle payloads.\"\"\"
        findings = []
        with open(filepath, 'rb') as f:
            header = f.read(8192)
        for sig in self.PICKLE_SIGNATURES:
            if sig in header:
                findings.append({
                    "type": "malicious_pickle",
                    "signature": sig.hex(),
                    "severity": "CRITICAL"
                })
        return {
            "safe": len(findings) == 0,
            "format": "safetensors" if filepath.endswith('.safetensors') else "pickle",
            "findings": findings
        }

Alert Prioritization

Severity Tactics Response SLA Action
🔴 CRITICAL T11 (agentic RCE), T13 (supply chain), T14 (infra) 15 minutes Automated containment + SOC escalation
🟠 HIGH T1 (injection), T6 (poisoning), T10 (breach) 1 hour SOC analyst review
🟡 MEDIUM T2–T3 (evasion), T7 (exfiltration) 4 hours Queued investigation
🔵 LOW T4 (multi-turn), T8 (misinfo) 24 hours Logged, batch review
⚪ INFO T15 (workflow) Weekly Trend analysis

Guardrail Bypass Awareness

Detectors must account for known evasion techniques against detection systems themselves:

Evasion Mechanism Countermeasure
Emoji smuggling Replace keywords with semantically equivalent emoji Emoji-to-text normalization before analysis
Zero-width characters Insert invisible Unicode between trigger words Unicode stripping/normalization
Homoglyphs Replace Latin characters with Cyrillic/Greek lookalikes Confusable character mapping (ICU)
Policy Puppetry Frame injection as policy/config file format Detect XML/INI/JSON policy structures in user input
Token boundary exploitation Split words across token boundaries Multi-token pattern matching

← Volume V · Home · Part 20: Mitigation →

20-mitigation

Part 20: Mitigation Strategies

Defense-in-Depth Architecture

Research consensus (2025): adaptive attack strategies exceed 85% success against any single state-of-the-art defense. No single control is sufficient. AATMF mandates layered defense.

CaMeL Architecture (Google DeepMind, March 2025)

The most promising defensive framework treats LLMs as fundamentally untrusted components — analogous to how operating systems treat user-space programs:

Principle Implementation
Dual-LLM pattern Frontier LLM generates plans; a hardened secondary LLM validates and sanitizes
Capability-based access Tools require explicit capability tokens, not ambient authority
Information flow control Track data provenance through the entire pipeline; tainted data cannot reach sensitive operations
Minimal authority Agents receive only the permissions needed for the immediate task

CaMeL solved 77% of AgentDojo tasks while providing provable security guarantees against prompt injection.

Mitigation Controls by Tactic

Tactic Primary Controls
T1 — Prompt Subversion Input sanitization, instruction hierarchy enforcement, system prompt isolation
T2 — Semantic Evasion Unicode normalization, multi-layer content filtering, semantic analysis
T3 — Reasoning Exploitation Output validation, reasoning chain verification, constraint hardening
T4 — Multi-Turn Context window management, conversation state validation, memory isolation
T5 — Model/API Rate limiting, query fingerprinting, differential privacy on outputs
T6 — Training Poisoning Data provenance tracking, anomaly detection in training metrics, DRS defense
T7 — Output Manipulation Output filtering, structured output enforcement, content watermarking
T8 — Deception Fact-checking integration, source attribution, confidence calibration
T9 — Multimodal Cross-modal consistency checking, steganography detection, input sanitization
T10 — Integrity Breach TEE deployment, access control, audit logging, membership inference defense
T11 — Agentic CaMeL architecture, tool permission scoping, MCP server auditing
T12 — RAG Embedding integrity verification, retrieval result validation, source authentication
T13 — Supply Chain SafeTensors adoption, Picklescan (with patches), SBOM for ML artifacts
T14 — Infrastructure Network segmentation, inference server hardening, ZMQ authentication
T15 — Human Workflow Reviewer training, decision audit trails, annotation quality metrics

LlamaFirewall (Meta, April 2025)

Component Function Coverage
PromptGuard 2 Real-time input classification for injection and jailbreak T1, T2, T9
Agent Alignment Checks Verify agent actions align with original user intent T11
CodeShield Static analysis of LLM-generated code for insecure patterns T7, T11

Priority Implementation Order

  1. Immediate — Input sanitization (T1–T3), rate limiting (T5), SafeTensors (T13)
  2. Short-term — CaMeL/dual-LLM pattern (T11), MCP auditing (T11), RAG validation (T12)
  3. Medium-term — Multimodal detection (T9), training pipeline security (T6), TEE deployment (T10)
  4. Ongoing — Human workflow hardening (T15), infrastructure monitoring (T14), supply chain verification (T13)

← Part 19 · Home · Part 21: Incident Response →

21-incident-response

Part 21: Incident Response for AI Systems

AI-Specific IR Framework

Traditional IR frameworks (NIST SP 800-61, SANS PICERL) assume deterministic systems. AI incidents differ: attacks may be probabilistic, evidence may be ephemeral (conversation context), and "containment" for a language model has different semantics than for a compromised server.

Phase 1: Detection & Triage

Signal Source Priority
Safety filter bypass confirmed Output monitoring P1 — Immediate
Model extraction pattern detected API telemetry P1 — Immediate
Training pipeline anomaly Training metrics P1 — Immediate
MCP tool behavior deviation Agent monitoring P1 — Immediate
Jailbreak attempt (unsuccessful) Input classifier P3 — Logged
Unusual query pattern Rate limiter P2 — Investigate

Phase 2: Containment

Scenario Action
Active jailbreak exploitation Block session, rate-limit source, deploy updated filter
Model serving compromised artifact Hot-swap to known-good checkpoint
RAG poisoning detected Quarantine affected data sources, switch to cached index
Agentic system executing unauthorized actions Kill agent process, revoke tool permissions
Training data contamination Halt training pipeline, snapshot current state

Phase 3: Investigation

Collect and preserve:

  • Full conversation logs (with system prompts)
  • Model version and configuration at time of incident
  • Input classifier decisions and confidence scores
  • Tool invocations and their results (for agentic systems)
  • Training data batches (if poisoning suspected)
  • Infrastructure logs (API gateway, inference server, vector DB)

Phase 4: Eradication

Root Cause Eradication
Prompt injection bypass Update input classifiers, add pattern to blocklist
Model vulnerability Retrain or fine-tune with adversarial examples
RAG poisoning Rebuild index from verified sources
Supply chain compromise Replace artifact, audit provenance chain
Infrastructure vulnerability Patch, harden, segment

Phase 5: Recovery

Staged restoration with validation:

  1. Deploy patched system in shadow mode
  2. Run automated red team suite against fix
  3. Monitor for recurrence (24-hour observation window)
  4. Full production restoration with enhanced monitoring

Phase 6: Post-Incident

  • Publish internal lessons learned
  • Update AATMF technique documentation if novel attack
  • Share indicators (responsibly) with AI security community
  • Update detection signatures and response playbooks

Case Study: GTG-1002 (November 2025)

Incident: First state-sponsored AI-orchestrated cyberattack. A Chinese threat group manipulated Claude Code to autonomously execute 80–90% of operational tasks across approximately 30 targets.

IR Lessons:

  • Traditional SOC tooling did not detect AI-orchestrated activities (they looked like normal developer workflow)
  • Agentic AI tools require separate monitoring planes from standard endpoints
  • The attack demonstrated that AI agents can serve as force multipliers for human operators, not just autonomous actors
  • Post-incident, Anthropic published detailed attribution and tactical analysis

← Part 20 · Home · Part 22: Red Team Ops →

22-red-team-ops

Part 22: Red Team Operations

Engagement Planning

Assessment Scope Matrix

Level Name Tactics Duration Prerequisites
1 Quick Scan T1–T3 1–2 days API access
2 Standard Assessment T1–T8 1–2 weeks API + documentation
3 Comprehensive T1–T12 3–4 weeks Full system access
4 Full Spectrum T1–T15 6–8 weeks Source code + infra + training pipeline

Rules of Engagement Template

1. Scope: [Models/systems in scope]
2. Tactics: [AATMF tactics authorized]
3. Boundaries: [Explicitly prohibited actions]
4. Data handling: [Treatment of discovered vulnerabilities/outputs]
5. Communication: [Escalation path for critical findings]
6. Timeline: [Assessment window]
7. Success criteria: [Minimum coverage requirements]

Autonomous Red Teaming

The same reasoning model capabilities that enable 97% ASR jailbreaking can be directed at your own systems defensively:

class AutonomousRedTeam:
    def __init__(self, target_api, attack_model="deepseek-r1"):
        self.target = target_api
        self.attacker = load_model(attack_model)
        self.results = []
    
    def run_campaign(self, tactic_ids: list, max_attempts=100):
        for tactic in tactic_ids:
            techniques = load_aatmf_techniques(tactic)
            for technique in techniques:
                for attempt in range(max_attempts):
                    # Generate attack variant
                    prompt = self.attacker.generate(
                        f"Generate a novel variant of {technique.name} "
                        f"attack. Previous attempts: {self.results[-5:]}"
                    )
                    # Execute against target
                    response = self.target.query(prompt)
                    # Evaluate success
                    success = self.evaluate(response, technique)
                    self.results.append({
                        "tactic": tactic,
                        "technique": technique.id,
                        "prompt": prompt,
                        "response": response,
                        "success": success,
                        "attempt": attempt
                    })
                    if success:
                        break  # Move to next technique
        return self.generate_report()

Methodology

For each tactic in scope:

  1. Enumerate — List all AATMF techniques for the tactic
  2. Baseline — Test documented attack procedures verbatim
  3. Adapt — Modify procedures for target-specific context
  4. Innovate — Generate novel variants using reasoning models
  5. Escalate — Combine successful techniques across tactics
  6. Document — Record findings per Appendix D template

← Part 21 · Home · Part 23: Blue Team →

23-blue-team-defense

Part 23: Blue Team Defense

Core Principle

Treat LLMs as untrusted components. Design systems assuming the model will be compromised.

This is not pessimism — it is the engineering consensus after 2025. Policy Puppetry bypasses every frontier model. Autonomous jailbreaking achieves 97% ASR. Adaptive attacks exceed 85% success against any single defense. The correct architectural posture is: the LLM will be jailbroken; design the surrounding system so that jailbreaking the LLM is insufficient to cause harm.

Defense Mapping to AATMF Tactics

Control Category Implementation Covers
Input Sanitization Unicode normalization, encoding detection, pattern matching T1, T2, T9
Instruction Hierarchy System prompt isolation, privilege separation T1, T3, T4
Rate Limiting Per-user, per-session, per-endpoint throttling T5, T14
Output Validation Content classifiers, structured output enforcement T7, T8
Tool Permission Scoping Capability-based access, least privilege T11
Data Provenance Training data lineage, RAG source authentication T6, T12, T13
Infrastructure Hardening Network segmentation, auth on all services T14
Human Workflow Controls Reviewer training, decision audit trails T15
Monitoring & Alerting Detection engineering (Part 19), log aggregation All

Monitoring Dashboard

Metric Target Frequency
Jailbreak attempt rate < 5% of queries Real-time
Safety filter bypass rate < 0.01% Real-time
API abuse detection latency < 30 seconds Real-time
Model artifact integrity 100% verified Per-deployment
RAG source freshness < 24 hours stale Hourly
Incident response time (P1) < 15 minutes Per-incident

← Part 22 · Home · Volume VI →

Vol I →
Foundations
Introduction, risk-assessment methodology, and architecture for adversarial AI threat mode…
Vol II →
Core Tactics (T01–T08)
The eight foundational adversarial-AI tactics: prompt subversion, semantic evasion, reason…
Vol III →
Advanced Tactics (T09–T12)
Multimodal attacks, integrity breach, agentic exploitation, RAG-specific threats — for sys…
Vol IV →
Infrastructure & Human (T13–T15)
Where the attack surface meets the surrounding stack: supply chain, infrastructure, and th…
Vol VI →
Governance
Risk management, compliance mapping (NIST AI RMF, MITRE ATLAS), and security training prog…
Vol VII →
Appendices
Attack catalog, signatures, tools, templates, case studies, glossary — operational referen…
Author
Kai Aizen
Independent offensive security researcher. 23 published CVEs, 5 Linux kernel mainline patches, creator of AATMF / P.R.O.M.P.T / SEF, author of Adversarial Minds.