19-detection-engineering

Part 19: Detection Engineering

Detection Architecture

AATMF detection operates across five layers, each providing defense-in-depth against adversarial AI attacks:

┌─────────────────────────────────────────┐
│  Layer 5: Feedback Loop Analysis        │  ← T6, T15
├─────────────────────────────────────────┤
│  Layer 4: System Telemetry              │  ← T13, T14
├─────────────────────────────────────────┤
│  Layer 3: Output Validation             │  ← T7, T8
├─────────────────────────────────────────┤
│  Layer 2: Behavioral Monitoring         │  ← T4, T5, T11
├─────────────────────────────────────────┤
│  Layer 1: Input Analysis                │  ← T1, T2, T3, T9
└─────────────────────────────────────────┘

Detection Patterns by Tactic

T1–T4: Prompt & Context Attacks

class PromptInjectionDetector:
    PATTERNS = [
        r"ignore\s+(previous|above|all)\s+(instructions?|rules?|guidelines?)",
        r"(system|admin)\s*:?\s*(override|prompt|instruction)",
        r"you\s+are\s+now\s+(DAN|evil|unrestricted|jailbroken)",
        r"\[\s*(SYSTEM|INST|SYS)\s*\]",
        r"<\|?(system|im_start|endoftext)\|?>",
        r"BEGIN\s+(OVERRIDE|NEW\s+INSTRUCTIONS)",
    ]
    
    ENCODING_PATTERNS = [
        r"[bB]ase64[:\s]",
        r"\\x[0-9a-fA-F]{2}",
        r"\\u[0-9a-fA-F]{4}",
        r"[\u200b-\u200f\u2028-\u202f\ufeff]",  # zero-width
    ]
    
    def analyze(self, text: str) -> dict:
        import re
        findings = []
        for pattern in self.PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                findings.append({"pattern": pattern, "severity": "HIGH"})
        for pattern in self.ENCODING_PATTERNS:
            if re.search(pattern, text):
                findings.append({"pattern": pattern, "severity": "MEDIUM"})
        return {
            "detected": len(findings) > 0,
            "findings": findings,
            "risk_score": min(len(findings) * 50, 300)
        }

T5–T8: API & Output Attacks

class APIExploitDetector:
    def detect_extraction(self, request_log: list) -> dict:
        \"\"\"Detect model extraction via API abuse patterns.\"\"\"
        indicators = {
            "high_volume": len(request_log) > 1000,
            "systematic_probing": self._detect_systematic(request_log),
            "boundary_testing": self._detect_boundary(request_log),
            "output_harvesting": self._detect_harvesting(request_log),
        }
        score = sum(50 for v in indicators.values() if v)
        return {"indicators": indicators, "risk_score": score}
    
    def _detect_systematic(self, logs):
        # Check for incrementally varied inputs
        return any(self._similarity(logs[i], logs[i+1]) > 0.9
                   for i in range(min(len(logs)-1, 100)))
    
    def _detect_boundary(self, logs):
        boundary_keywords = ["maximum", "limit", "error", "exception", "overflow"]
        return sum(1 for l in logs if any(k in str(l).lower() for k in boundary_keywords)) > 10
    
    def _detect_harvesting(self, logs):
        return len(set(str(l.get('prompt',''))[:50] for l in logs)) / max(len(logs), 1) > 0.95

T9–T12: Multimodal & Agentic Attacks

class MultimodalDetector:
    def analyze_image(self, image_bytes: bytes) -> dict:
        \"\"\"Detect steganographic or adversarial image modifications.\"\"\"
        findings = []
        # Check for unusual metadata
        if b"EXIF" not in image_bytes[:1000] and len(image_bytes) > 100000:
            findings.append({"type": "stripped_metadata", "severity": "LOW"})
        # Check for appended data after image end marker
        jpeg_end = image_bytes.rfind(b"\xff\xd9")
        if jpeg_end > 0 and jpeg_end < len(image_bytes) - 2:
            findings.append({"type": "appended_data", "severity": "HIGH"})
        # Check for unusual color distribution (potential adversarial perturbation)
        return {"findings": findings}
    
    def analyze_mcp_tool(self, tool_description: str) -> dict:
        \"\"\"Detect MCP tool poisoning indicators.\"\"\"
        suspicious = [
            r"<IMPORTANT>",
            r"override|ignore|bypass",
            r"do not (tell|inform|show)",
            r"silently|secretly|covertly",
            r"instead of|rather than",
        ]
        import re
        hits = [p for p in suspicious if re.search(p, tool_description, re.I)]
        return {
            "poisoning_indicators": len(hits),
            "severity": "CRITICAL" if len(hits) >= 3 else "HIGH" if hits else "LOW"
        }

T13–T15: Supply Chain & Infrastructure

class SupplyChainDetector:
    PICKLE_SIGNATURES = [
        b"\x80\x04\x95",  # Protocol 4 header
        b"cos\nsystem",     # os.system call
        b"csubprocess",      # subprocess module
        b"c__builtin__",     # builtins access
        b"creduce_ex",       # reduce_ex (code execution)
    ]
    
    def scan_model_file(self, filepath: str) -> dict:
        \"\"\"Scan model artifact for malicious pickle payloads.\"\"\"
        findings = []
        with open(filepath, 'rb') as f:
            header = f.read(8192)
        for sig in self.PICKLE_SIGNATURES:
            if sig in header:
                findings.append({
                    "type": "malicious_pickle",
                    "signature": sig.hex(),
                    "severity": "CRITICAL"
                })
        return {
            "safe": len(findings) == 0,
            "format": "safetensors" if filepath.endswith('.safetensors') else "pickle",
            "findings": findings
        }

Alert Prioritization

Severity	Tactics	Response SLA	Action
🔴 CRITICAL	T11 (agentic RCE), T13 (supply chain), T14 (infra)	15 minutes	Automated containment + SOC escalation
🟠 HIGH	T1 (injection), T6 (poisoning), T10 (breach)	1 hour	SOC analyst review
🟡 MEDIUM	T2–T3 (evasion), T7 (exfiltration)	4 hours	Queued investigation
🔵 LOW	T4 (multi-turn), T8 (misinfo)	24 hours	Logged, batch review
⚪ INFO	T15 (workflow)	Weekly	Trend analysis

Guardrail Bypass Awareness

Detectors must account for known evasion techniques against detection systems themselves:

Evasion	Mechanism	Countermeasure
Emoji smuggling	Replace keywords with semantically equivalent emoji	Emoji-to-text normalization before analysis
Zero-width characters	Insert invisible Unicode between trigger words	Unicode stripping/normalization
Homoglyphs	Replace Latin characters with Cyrillic/Greek lookalikes	Confusable character mapping (ICU)
Policy Puppetry	Frame injection as policy/config file format	Detect XML/INI/JSON policy structures in user input
Token boundary exploitation	Split words across token boundaries	Multi-token pattern matching

← Volume V · Home · Part 20: Mitigation →

20-mitigation

Part 20: Mitigation Strategies

Defense-in-Depth Architecture

Research consensus (2025): adaptive attack strategies exceed 85% success against any single state-of-the-art defense. No single control is sufficient. AATMF mandates layered defense.

CaMeL Architecture (Google DeepMind, March 2025)

The most promising defensive framework treats LLMs as fundamentally untrusted components — analogous to how operating systems treat user-space programs:

Principle	Implementation
Dual-LLM pattern	Frontier LLM generates plans; a hardened secondary LLM validates and sanitizes
Capability-based access	Tools require explicit capability tokens, not ambient authority
Information flow control	Track data provenance through the entire pipeline; tainted data cannot reach sensitive operations
Minimal authority	Agents receive only the permissions needed for the immediate task

CaMeL solved 77% of AgentDojo tasks while providing provable security guarantees against prompt injection.

Mitigation Controls by Tactic

Tactic	Primary Controls
T1 — Prompt Subversion	Input sanitization, instruction hierarchy enforcement, system prompt isolation
T2 — Semantic Evasion	Unicode normalization, multi-layer content filtering, semantic analysis
T3 — Reasoning Exploitation	Output validation, reasoning chain verification, constraint hardening
T4 — Multi-Turn	Context window management, conversation state validation, memory isolation
T5 — Model/API	Rate limiting, query fingerprinting, differential privacy on outputs
T6 — Training Poisoning	Data provenance tracking, anomaly detection in training metrics, DRS defense
T7 — Output Manipulation	Output filtering, structured output enforcement, content watermarking
T8 — Deception	Fact-checking integration, source attribution, confidence calibration
T9 — Multimodal	Cross-modal consistency checking, steganography detection, input sanitization
T10 — Integrity Breach	TEE deployment, access control, audit logging, membership inference defense
T11 — Agentic	CaMeL architecture, tool permission scoping, MCP server auditing
T12 — RAG	Embedding integrity verification, retrieval result validation, source authentication
T13 — Supply Chain	SafeTensors adoption, Picklescan (with patches), SBOM for ML artifacts
T14 — Infrastructure	Network segmentation, inference server hardening, ZMQ authentication
T15 — Human Workflow	Reviewer training, decision audit trails, annotation quality metrics

LlamaFirewall (Meta, April 2025)

Component	Function	Coverage
PromptGuard 2	Real-time input classification for injection and jailbreak	T1, T2, T9
Agent Alignment Checks	Verify agent actions align with original user intent	T11
CodeShield	Static analysis of LLM-generated code for insecure patterns	T7, T11

Priority Implementation Order

Immediate — Input sanitization (T1–T3), rate limiting (T5), SafeTensors (T13)
Short-term — CaMeL/dual-LLM pattern (T11), MCP auditing (T11), RAG validation (T12)
Medium-term — Multimodal detection (T9), training pipeline security (T6), TEE deployment (T10)
Ongoing — Human workflow hardening (T15), infrastructure monitoring (T14), supply chain verification (T13)

← Part 19 · Home · Part 21: Incident Response →

21-incident-response

Part 21: Incident Response for AI Systems

AI-Specific IR Framework

Traditional IR frameworks (NIST SP 800-61, SANS PICERL) assume deterministic systems. AI incidents differ: attacks may be probabilistic, evidence may be ephemeral (conversation context), and "containment" for a language model has different semantics than for a compromised server.

Phase 1: Detection & Triage

Signal	Source	Priority
Safety filter bypass confirmed	Output monitoring	P1 — Immediate
Model extraction pattern detected	API telemetry	P1 — Immediate
Training pipeline anomaly	Training metrics	P1 — Immediate
MCP tool behavior deviation	Agent monitoring	P1 — Immediate
Jailbreak attempt (unsuccessful)	Input classifier	P3 — Logged
Unusual query pattern	Rate limiter	P2 — Investigate

Phase 2: Containment

Scenario	Action
Active jailbreak exploitation	Block session, rate-limit source, deploy updated filter
Model serving compromised artifact	Hot-swap to known-good checkpoint
RAG poisoning detected	Quarantine affected data sources, switch to cached index
Agentic system executing unauthorized actions	Kill agent process, revoke tool permissions
Training data contamination	Halt training pipeline, snapshot current state

Phase 3: Investigation

Collect and preserve:

Full conversation logs (with system prompts)
Model version and configuration at time of incident
Input classifier decisions and confidence scores
Tool invocations and their results (for agentic systems)
Training data batches (if poisoning suspected)
Infrastructure logs (API gateway, inference server, vector DB)

Phase 4: Eradication

Root Cause	Eradication
Prompt injection bypass	Update input classifiers, add pattern to blocklist
Model vulnerability	Retrain or fine-tune with adversarial examples
RAG poisoning	Rebuild index from verified sources
Supply chain compromise	Replace artifact, audit provenance chain
Infrastructure vulnerability	Patch, harden, segment

Phase 5: Recovery

Staged restoration with validation:

Deploy patched system in shadow mode
Run automated red team suite against fix
Monitor for recurrence (24-hour observation window)
Full production restoration with enhanced monitoring

Phase 6: Post-Incident

Publish internal lessons learned
Update AATMF technique documentation if novel attack
Share indicators (responsibly) with AI security community
Update detection signatures and response playbooks

Case Study: GTG-1002 (November 2025)

Incident: First state-sponsored AI-orchestrated cyberattack. A Chinese threat group manipulated Claude Code to autonomously execute 80–90% of operational tasks across approximately 30 targets.

IR Lessons:

Traditional SOC tooling did not detect AI-orchestrated activities (they looked like normal developer workflow)
Agentic AI tools require separate monitoring planes from standard endpoints
The attack demonstrated that AI agents can serve as force multipliers for human operators, not just autonomous actors
Post-incident, Anthropic published detailed attribution and tactical analysis

← Part 20 · Home · Part 22: Red Team Ops →

22-red-team-ops

Part 22: Red Team Operations

Engagement Planning

Assessment Scope Matrix

Level	Name	Tactics	Duration	Prerequisites
1	Quick Scan	T1–T3	1–2 days	API access
2	Standard Assessment	T1–T8	1–2 weeks	API + documentation
3	Comprehensive	T1–T12	3–4 weeks	Full system access
4	Full Spectrum	T1–T15	6–8 weeks	Source code + infra + training pipeline

Rules of Engagement Template

1. Scope: [Models/systems in scope]
2. Tactics: [AATMF tactics authorized]
3. Boundaries: [Explicitly prohibited actions]
4. Data handling: [Treatment of discovered vulnerabilities/outputs]
5. Communication: [Escalation path for critical findings]
6. Timeline: [Assessment window]
7. Success criteria: [Minimum coverage requirements]

Autonomous Red Teaming

The same reasoning model capabilities that enable 97% ASR jailbreaking can be directed at your own systems defensively:

class AutonomousRedTeam:
    def __init__(self, target_api, attack_model="deepseek-r1"):
        self.target = target_api
        self.attacker = load_model(attack_model)
        self.results = []
    
    def run_campaign(self, tactic_ids: list, max_attempts=100):
        for tactic in tactic_ids:
            techniques = load_aatmf_techniques(tactic)
            for technique in techniques:
                for attempt in range(max_attempts):
                    # Generate attack variant
                    prompt = self.attacker.generate(
                        f"Generate a novel variant of {technique.name} "
                        f"attack. Previous attempts: {self.results[-5:]}"
                    )
                    # Execute against target
                    response = self.target.query(prompt)
                    # Evaluate success
                    success = self.evaluate(response, technique)
                    self.results.append({
                        "tactic": tactic,
                        "technique": technique.id,
                        "prompt": prompt,
                        "response": response,
                        "success": success,
                        "attempt": attempt
                    })
                    if success:
                        break  # Move to next technique
        return self.generate_report()

Methodology

For each tactic in scope:

Enumerate — List all AATMF techniques for the tactic
Baseline — Test documented attack procedures verbatim
Adapt — Modify procedures for target-specific context
Innovate — Generate novel variants using reasoning models
Escalate — Combine successful techniques across tactics
Document — Record findings per Appendix D template

← Part 21 · Home · Part 23: Blue Team →

23-blue-team-defense

Part 23: Blue Team Defense

Core Principle

Treat LLMs as untrusted components. Design systems assuming the model will be compromised.

This is not pessimism — it is the engineering consensus after 2025. Policy Puppetry bypasses every frontier model. Autonomous jailbreaking achieves 97% ASR. Adaptive attacks exceed 85% success against any single defense. The correct architectural posture is: the LLM will be jailbroken; design the surrounding system so that jailbreaking the LLM is insufficient to cause harm.

Defense Mapping to AATMF Tactics

Control Category	Implementation	Covers
Input Sanitization	Unicode normalization, encoding detection, pattern matching	T1, T2, T9
Instruction Hierarchy	System prompt isolation, privilege separation	T1, T3, T4
Rate Limiting	Per-user, per-session, per-endpoint throttling	T5, T14
Output Validation	Content classifiers, structured output enforcement	T7, T8
Tool Permission Scoping	Capability-based access, least privilege	T11
Data Provenance	Training data lineage, RAG source authentication	T6, T12, T13
Infrastructure Hardening	Network segmentation, auth on all services	T14
Human Workflow Controls	Reviewer training, decision audit trails	T15
Monitoring & Alerting	Detection engineering (Part 19), log aggregation	All

Monitoring Dashboard

Metric	Target	Frequency
Jailbreak attempt rate	< 5% of queries	Real-time
Safety filter bypass rate	< 0.01%	Real-time
API abuse detection latency	< 30 seconds	Real-time
Model artifact integrity	100% verified	Per-deployment
RAG source freshness	< 24 hours stale	Hourly
Incident response time (P1)	< 15 minutes	Per-incident

← Part 22 · Home · Volume VI →

V.operations.