Part of 3.5 Assurance and Posture Plane
Within the Assurance Posture plane, auditability and explainability address the mechanisms by which AI decisions can be traced, attributed, and rendered interpretable to human stakeholders. The field spans three intersecting concerns: action-level audit logging in agentic systems, feature attribution methods for model outputs, and the structural limits of reasoning transparency itself.
The ClawNet framework, developed by researchers at the Hong Kong Generative AI Research & Development Centre, HKUST, and HKBU, formalizes three governance primitives that directly address auditability in multi-agent deployments: identity binding (every operation traceable to a specific human principal), scoped authorization (operations bounded by verified agent permissions), and action-level accountability (every operation logged to an append-only audit log).[1] The authors explicitly identify the absence of these primitives in current frameworks — including MetaGPT, AutoGen, CrewAI, LangGraph, and ChatDev — as a foundational infrastructure gap.[2] Google's Agent2Agent protocol is characterized as providing a communication layer insufficient for enterprise governance, as it does not bind agents to specific owners or enforce authorization scopes.[1:1] The append-only audit log primitive in ClawNet represents a concrete, implementable auditability mechanism for cross-user agentic deployments.
In biomedical AI, a complementary transparency framework has been proposed around three principles — Provenance, Openness, and Evaluation Transparency — in response to documented population bias in large-scale training datasets such as CellxGene and GEO.[3] Provenance tracking here functions as an auditability mechanism at the data layer, enabling downstream stakeholders to assess representational gaps before deployment.
The dominant non-symbolic explainability methods — SHAP, Integrated Gradients, and Attention Rollout — have each been subject to recent empirical and formal scrutiny. A comparative preprint evaluated all three on a fine-tuned DistilBERT model for SST-2 sentiment classification, finding gradient-based attribution (Integrated Gradients) to be the most stable, attention-based methods computationally efficient but less prediction-aligned, and SHAP flexible but variable.[4] However, a peer-reviewed paper from researchers at the University of Toulouse (IRIT), Nanyang Technological University, ICREA, and the University of Lleida formally documents provable flaws in SHAP's theoretical foundations: in documented failure cases, the relevant feature receives a SHAP score of 0 while irrelevant features receive non-zero scores — a direct inversion of ground truth.[5][6] The authors advocate for symbolic XAI methods grounded in logic-based explainability as a more rigorous alternative, noting that non-symbolic methods have dominated the field for approximately a decade without adequate formal validation.[6:1]
A complementary architectural approach combines Knowledge Graphs with LLMs to generate traceable, user-friendly explanations of ML outputs in manufacturing environments. The method stores domain-specific data alongside ML results in a Knowledge Graph, then uses selective triplet retrieval to prompt an LLM to produce structured explanations — evaluated across 33 questions using structured XAI metrics.[7]
For vision transformers specifically, the Max Planck Institute for Informatics has introduced Vi-CD (Automatic Visual Circuit Discovery), the first edge-based circuit discovery method for vision transformers. Vi-CD produces circuits up to 10x sparser than prior approaches and has been applied to identify circuits underlying typographic attacks in CLIP models, with demonstrated use in steering model behavior corrections.[8]
A critical tension has emerged between explainability and security: the same chain-of-thought (CoT) transparency that supports auditability creates an exploitable attack surface. The AutoRAN framework, developed at Stony Brook University and Penn State University, automates hijacking of internal safety reasoning in large reasoning models (LRMs), achieving near-100% attack success rates against GPT-o3, GPT-o4-mini, and Gemini-2.5-Flash across AdvBench, HarmBench, and StrongReject benchmarks.[9][10] A separate framework, PRJA, achieves an 83.6% average jailbreak success rate against DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini by injecting harmful content into reasoning steps while leaving final answers unchanged — exploiting the gap between visible reasoning traces and actual model behavior.[11]
This is compounded by a structural finding from South China University of Technology: LLM reasoning is primarily mediated by latent-state trajectories rather than surface CoT traces, meaning that CoT-based explanations may not faithfully represent the computational process being audited.[12] Any auditability framework that relies solely on surface reasoning traces must account for this gap between observable explanation and underlying mechanism.
Add implementation guidance and reference material here.
Track open research questions and emerging developments.
ClawNet: Academic Research Proposes Identity-Governed Multi-Agent Collaboration Framework with Explicit Governance Primitives — evt_src_41e455ab4dd54226 ↩︎ ↩︎ ↩︎
ClawNet Research Proposes Identity-Governed Multi-Agent Collaboration Framework for Cross-User Autonomous Cooperation — evt_src_0a7e2c5f47536d7c ↩︎
arXiv Paper Documents Systemic Population Bias in Biomedical AI Training Datasets, Proposes Provenance and Evaluation Transparency Framework — evt_src_b8dc0960ef5ab5d2 ↩︎
Comparative Study of LLM Explainability Techniques Published on arXiv: Integrated Gradients, Attention Rollout, and SHAP Evaluated on DistilBERT — evt_src_2e597c150a03041d ↩︎
Academic Research Formally Documents Rigor Failures in SHAP-Based XAI Feature Attribution — evt_src_e0e69e7554ef1259 ↩︎
Academic Research Challenges Rigor of Shapley-Based XAI Methods, Advocates Symbolic Alternatives — evt_src_83c013a5cbcb8601 ↩︎ ↩︎ ↩︎
Academic Research Validates LLM + Knowledge Graph Architecture for Explainable AI in Manufacturing — evt_src_a77057dc5a3966c5 ↩︎
Max Planck Institute Introduces Edge-Based Mechanistic Interpretability for Vision Transformers (Vi-CD) — evt_src_c2be3774430fedd7 ↩︎ ↩︎
AutoRAN: Automated Safety Reasoning Hijacking Achieves Near-100% Attack Success Against Leading Large Reasoning Models — evt_src_53c782d82f84579e ↩︎ ↩︎
AutoRAN Framework Demonstrates Near-100% Safety Guardrail Bypass in Leading Large Reasoning Models — evt_src_b05fa47162dc4d2b ↩︎
Academic Research Documents 83.6% Jailbreak Success Rate Against Commercial Large Reasoning Models via Psychological Framing — evt_src_bcadb43b8f11fbf2 ↩︎ ↩︎
Academic Research Challenges Chain-of-Thought as Primary Reasoning Object in LLMs, Elevating Latent-State Dynamics — evt_src_02fd31faab2e6ef9 ↩︎ ↩︎