Part of 3.5 Assurance and Posture Plane
Security architecture for encapsulated AI systems has emerged as a distinct research domain within computer science cryptography and security, addressing threat surfaces that differ materially from classical software security. The briefs collectively document four active fronts: agentic governance enforcement, prompt injection and adversarial environmental attack, vulnerability detection tooling, and data privacy controls for federated and retrieval-augmented pipelines.
The dominant architectural pattern emerging from recent research is the governance-first execution kernel — a structural separation between an untrusted probabilistic model and a deterministic symbolic enforcement layer. Arbiter-K, published by researchers from CUHK, Shanghai Jiao Tong University, Zhejiang University, Peking University, and Tsinghua University, instantiates this pattern by demoting the LLM to a non-privileged Probabilistic Processing Unit encapsulated by a Symbolic Governor that enforces Resource Limits, Taint Checks, and Access Control Lists before any intent reaches a deterministic sink.[1][2] Empirical evaluation on OpenClaw and NanoBot benchmarks shows Arbiter-K achieves 76–95% unsafe operation interception — a 92.79% absolute gain over native guardrails — while native policies in Amazon Bedrock AgentCore and Anthropic Skills intercept fewer than 9% of unsafe operations under adversarial conditions.[1:1] The architecture implements a Semantic Instruction Set Architecture (ISA) that reifies probabilistic outputs into discrete, auditable instructions.[2:1]
Prompt injection against agentic systems has been formalized under two named threat models. The Owner-Harm threat model, submitted to arXiv cs.CR on 20 April 2026, defines eight categories of deployer-damaging agent behavior and demonstrates that compositional safety systems achieving 100% true-positive rate on generic criminal harm benchmarks (AgentHarm) detect only 14.8% of prompt-injection-mediated owner-harm tasks on AgentDojo — a detection gap attributable to environment-bound symbolic rules that fail to generalize across tool vocabularies.[3] A two-stage gate plus deterministic post-audit verifier raises overall detection to 85.3% TPR and hijacking detection from 43.3% to 93.3%.[3:1]
Adversarial Environmental Injection (AEI) formalizes a complementary attack surface in which adversaries corrupt tool outputs rather than direct prompts, constructing a "fake world" of poisoned retrieval results and fabricated reference networks. The POTEMKIN harness, validated across 11,000+ runs on five frontier agents, identifies two orthogonal attack surfaces — epistemic breadth attacks (The Illusion) and navigational depth attacks (The Maze) — and documents that resistance to one attack type frequently increases vulnerability to the other, ruling out unified single-defense mitigations.[4]
For stateful defenses against covert decomposition attacks, the University of Pennsylvania's Benchmarks for Stateful Defenses (BSD) pipeline provides an automated evaluation framework, formally positioning LLM misuse mitigation within the cryptography and security research domain.[5]
Several architectures address LLM-assisted security defect discovery. The Refute-or-Promote (RoP) pipeline combines Stratified Context Hunting, adversarial kill mandates, context asymmetry, and a Cross-Model Critic to eliminate approximately 79–83% of false-positive candidates before disclosure; a 31-day production campaign across seven targets yielded four CVEs and eight merged security fixes.[6] Phoenix, a training-free multi-agent framework using Behavioral Contract Synthesis, achieves F1 = 0.825 on PrimeVul Paired with 7–14B open-source models by decomposing detection into a Semantic Slicer, Requirement Reverse Engineer, and Contract Judge — addressing the semantic ambiguity problem that causes global classifiers to degrade.[7] RAVEN applies a four-module RAG-driven architecture (Explorer, RAG engine, Analyst, Reporter) to memory corruption vulnerability analysis, evaluated against NIST-SARD benchmarks at a 54.21% average quality score.[8]
Domain-specific safety gaps are documented by HarmChip, the first jailbreak benchmark for LLM safety in hardware security workflows, spanning 16 hardware security domains and 960 prompts. Code-oriented and open-weight models reach 94–100% attack success rates on the Hard benchmark, while existing general-purpose benchmarks such as AdvBench and JailbreakBench do not address hardware-security-specific vectors.[9][10]
A hardware-level threat is documented in vLLM's Prefix Caching: shared KV-cache blocks exist as a single physical copy without integrity protection, enabling silent, persistent, and selectively targeted corruption of inference outputs via Rowhammer-class bit-flip attacks. A checksum-based countermeasure is proposed as a low-overhead mitigation.[11]
Two cryptographic approaches address privacy in AI data pipelines. PPPQ-ANN combines Fully Homomorphic Encryption and Trusted Execution Environments to enable privacy-preserving approximate nearest neighbor search at million-scale (>50 QPS), specifically mitigating embedding inversion and membership attacks against vector stores.[12] Sherpa.ai's multi-party Private Set Union (PSU) protocol for vertical federated learning hides intersection membership during entity alignment — addressing the leakage risk of conventional Private Set Intersection — and supports both exact and typo-tolerant identifier matching across healthcare, financial services, and telecommunications verticals.[13]
Add implementation guidance and reference material here.
Track open research questions and emerging developments.
Academic Research Proposes Governance-First Kernel Architecture for Agentic AI, Documenting Critical Gaps in Existing Guardrail Approaches — evt_src_9925c0e0b7a6237c ↩︎ ↩︎
Academic Research Proposes Governance-First Execution Kernel (Arbiter-K) for Agentic AI Systems with Quantified Safety Gains — evt_src_b1b5120371728c58 ↩︎ ↩︎
Formal Owner-Harm Threat Model Exposes Critical Gap in AI Agent Safety Benchmarks and Proposes Multi-Layer Verification Architecture — evt_src_cd647d2c2e513723 ↩︎ ↩︎
Formalization of Adversarial Environmental Injection (AEI) Threat Model Exposes Robustness Gap in Frontier Agentic AI Systems — evt_src_e2320280c8e96877 ↩︎
Academic Research Formalizes Benchmarking Framework for Covert LLM Misuse and Stateful Defenses — evt_src_9bf12dddc6151bef ↩︎
Adversarial Multi-Agent Review Methodology Demonstrates 79–83% False-Positive Kill Rate in LLM-Assisted Security Defect Discovery — evt_src_2d90e66d0bee0562 ↩︎
Academic Research Validates Multi-Agent Behavioral Contract Synthesis as a Viable Architecture for Training-Free Vulnerability Detection — evt_src_a3ba054377633d18 ↩︎
RAVEN Framework Demonstrates RAG-Driven Multi-Agent Architecture for Automated Vulnerability Analysis — evt_src_434cf8bb4c16ddcf ↩︎
HarmChip: First Domain-Specific Jailbreak Benchmark Exposes LLM Safety Gaps in Hardware Security Workflows — evt_src_6d7ed7a7f01b9431 ↩︎
HarmChip Benchmark Establishes First Domain-Specific LLM Safety Evaluation Framework for Hardware Security — evt_src_d0273ad73517c216 ↩︎
Peer-Reviewed Research Documents Bit-Flip Vulnerability in Shared KV-Cache Blocks of Production LLM Serving Systems — evt_src_233383e5867f7b5c ↩︎
Academic Research Demonstrates Production-Viable Privacy-Preserving Approximate Nearest Neighbor Search via Hybrid FHE and TEE Architecture — evt_src_abaa3c3f17abbb7a ↩︎
Sherpa.ai Publishes Multi-Party Privacy-Preserving Entity Alignment Protocol for Vertical Federated Learning — evt_src_550c10d632f25c1c ↩︎
Delft University Releases DeepRed: Open-Source LLM Agent Benchmark for Cybersecurity CTF Evaluation with Partial-Credit Scoring — evt_src_da4a5ac523236f14 ↩︎
DeepRed Open-Source Benchmark Quantifies LLM Agent Capability Ceiling at 35% on Realistic Multi-Step Security Tasks — evt_src_a8be6fe151ac955a ↩︎