The system that decides what information, memory, tools, and instructions are supplied to each model step, in what format, and at what time.
Why it matters to DAIS: Determines answer quality, hallucination risk, and consistency of governed workflows by controlling what agents know at decision time.
Context engineering has moved from a theoretical concern to an active engineering discipline, with production deployments and peer-reviewed research converging on a shared diagnosis: naive message-history accumulation is structurally inadequate for long-running agentic systems. Slack's staff engineering team has publicly documented that one of its multi-agent applications spans hundreds of requests and generates megabytes of output, making full-context inclusion per request impractical and forcing a shift to structured memory, staged validation, and credibility-weighted evidence distillation.[1] Benchmarks reinforce the urgency: the MemGround benchmark, published March 2026, demonstrates that state-of-the-art LLMs and memory agents systematically fail at sustained dynamic tracking, temporal event association, and complex reasoning over long-term accumulated evidence.[2] MemEvoBench, the first benchmark specifically targeting long-horizon memory safety, further shows that memory evolution across multi-round interactions produces substantial safety degradation that static prompt-based defenses cannot address, with GPT-5 and Gemini-2.5-Pro achieving the lowest attack success rates among nine frontier models evaluated.[3]
Retrieval quality presents an equally sharp gap between benchmark performance and production reality. The RARE framework demonstrates that a strong retriever baseline scoring 66.4% PerfRecall@10 on general benchmarks drops to between 5.0% and 27.9% on domain-specific redundancy-aware benchmarks constructed from finance, legal, and patent corpora — a collapse driven by high document similarity that standard evaluation frameworks do not capture.[4] In financial document QA, the FinGround pipeline achieves a 78% hallucination reduction relative to GPT-4o by decomposing claims into a six-type financial taxonomy with type-routed verification, including formula reconstruction for computational errors that uniform detectors miss at a 43% rate.[5]
Several distinct architectural patterns have emerged across the research landscape. Structured memory with episodic-semantic separation is a recurring motif: APEX-MEM uses a property graph with domain-agnostic ontology and append-only storage to achieve 86.2% on LongMemEval and 88.88% on LOCOMO's QA task.[6] HeLa-Mem proposes a bio-inspired alternative modelling memory as a dynamic graph with Hebbian learning dynamics, distinguishing an episodic memory graph from a semantic store populated via Hebbian Distillation.[7] Evo-MedAgent deploys a three-store architecture — Retrospective Clinical Episodes, Adaptive Procedural Heuristics, and a Tool Reliability Controller — enabling training-free inter-case learning at test time on chest imaging benchmarks.[8]
Plan-conditioned and hierarchical retrieval represents a second cluster. A-MAR conditions retrieval on structured reasoning plans derived before any retrieval is performed, outperforming static retrieval and strong multimodal LLM baselines on SemArt and Artpedia.[9] H-TechniqueRAG applies a two-stage hierarchical retrieval mechanism to cyber threat intelligence annotation, reducing the candidate search space by 77.5%, cutting LLM API calls by 60%, and improving F1 over the prior state-of-the-art TechniqueRAG by 3.8%.[10] Skill-RAG couples a lightweight hidden-state prober with a prompt-based skill router that selects among four discrete retrieval skills — query rewriting, question decomposition, evidence focusing, and an exit skill — gating retrieval at two pipeline stages.[11]
Stateful, evidence-driven RAG is a third pattern, with Vocalbeats.AI researchers and a parallel March 2026 arXiv submission both independently proposing persistent evidence pools that carry explicit relevance and confidence signals across iterative retrieval cycles, preserving conflicting and non-supportive evidence rather than discarding it.[12][13] RUMS, published jointly by Microsoft Research and the University of Washington, takes a complementary angle on memory selection for personalization: rather than semantic similarity, it uses conditional mutual information between candidate memory subsets and the model's output distribution, achieving up to 95% computational cost reduction while outperforming baselines and models 400 times its size.[14][15]
In robotics and embodied AI, the HELM framework from Tsinghua University, Alibaba Group, and Bengbu University names three execution-loop deficiencies — memory gap, verification gap, and recovery gap — and addresses them with a CLIP-indexed Episodic Memory Module, a learned State Verifier MLP, and a Harness Controller, achieving a 23.1-point task success improvement over OpenVLA on LIBERO-LONG.[16][17] The State Verifier's effectiveness is shown to depend critically on episodic memory access, establishing memory-conditioned pre-execution verification as a load-bearing architectural component rather than an optional augmentation.[17:1]
Several structural gaps remain unresolved. A citation analysis of 1,136 references across 22 primary papers on agent memory and skill discovery found a cross-community citation rate below 1%, indicating near-complete isolation between these research communities.[18] A companion finding is that every surveyed LLM agent system operates at a fixed, predetermined compression level — episodic memory at 5–20×, procedural skills at 50–500×, declarative rules at 1,000× or more — with none supporting adaptive cross-level compression, a gap termed "the missing diagonal."[18:1]
Context isolation presents an underexplored failure mode. An arXiv preprint by Cheng and Song documents a case in which a 23 KB prompt-engineered multi-modal system produced a closed-loop behavioral collapse attributed to transformer attention's structural inability to enforce prompt-layer isolation between defined operational modes, with a redesigned system using physical conversation termination producing no analogous failure.[19] Whether this failure generalizes beyond the single-subject case remains an open empirical question.
Evaluation infrastructure is also contested. DR3-Eval, released by a multi-institution consortium including Nanjing University and the National University of Singapore, identifies that DeepResearch Bench relies on live web access making results non-reproducible, DeepResearchGym lacks grounding in authentic user workflows, and DRBench omits explicit modeling of noisy or misleading information — gaps DR3-Eval addresses with a static sandbox corpus of 100 tasks across 13 atomic domains.[20] The RARE framework similarly argues that standard RAG benchmarks assume distinct documents with minimal overlap, systematically overstating performance in the high-redundancy corpora that characterize regulated verticals.[4:1]
Several developments from April 2026 signal accelerating investment in context engineering infrastructure. UC Merced's LatentMAS framework enables training-free latent multi-agent collaboration by transferring full KV cache states between agents, with the companion Orthogonal Backfill (OBF) compression technique reducing inter-agent KV cache communication cost by 79.8%–89.4% across nine benchmarks.[21] GazeX encodes radiologist eye-tracking data — over 30,000 gaze keyframes from five radiologists — as a behavioral prior during pretraining, producing verifiable inspection trajectories and finding-linked localized regions evaluated against 231,835 radiographic studies.[22] This represents a concrete implementation of expert-structured context conditioning as an auditability mechanism.
Anthropologic's Agent Skills specification, defining a structured directory format centered on a SKILL.md file with progressive-disclosure loading, has been adopted as an open standard for cross-platform skill portability, prompting researchers from NUS, UC Berkeley, and CUHK to publish a bilevel Monte Carlo Tree Search framework for automated skill optimization.[23] The CRVA-TGRAG framework addresses knowledge currency in vulnerability analysis — a domain where over 30,000 of 200,000+ known CVEs have been changed or updated — by combining Parent Document Segmentation with ensemble retrieval and teacher-guided preference optimization.[24] Collectively, these developments indicate that context engineering is consolidating around a set of recurring primitives: structured memory with lifecycle management, plan-conditioned retrieval, stateful evidence accumulation, and pre-execution verification.
These patterns indicate content relevant to this plane:
How memory is retained, pruned, and rehydrated across runs.
Signal quality of retrieval, grounding, and citation integrity.
Look for concrete mechanisms that change what the model sees, remembers, or forgets across steps.
Use these rules when content could belong to multiple planes:
These articles were classified with this plane as their primary mapping.
Slack staff software engineer Dominic Marks has publicly detailed a three-channel context management architecture used in production multi-agent systems at Slack, moving away from message-history accumulation toward structured memory, staged validation, and credibility-weighted evidence distillation to maintain coherence across long-running agentic sessions.
A peer-reviewed arXiv paper introduces FinGround, a three-stage verify-then-ground pipeline for financial document QA that achieves 78% hallucination reduction relative to GPT-4o and 68% reduction over the strongest baseline under retrieval-equalized evaluation. The paper explicitly frames hallucination detection as a compliance requirement tied to the EU AI Act's August 2026 high-risk enforcement deadline, and demonstrates cost-controlled verification at $0.003 per query via an 8B distilled detector.
Researchers from Tsinghua University, Alibaba Group, and Bengbu University published HELM, a framework that addresses structural long-horizon memory failures in Vision-Language-Action (VLA) models. The work demonstrates that extending context windows alone does not close the performance gap in multi-step robotic manipulation tasks, and introduces an Episodic Memory Module with CLIP-indexed keyframe retrieval and a pre-execution State Verifier MLP. All experiments are conducted in simulation; real-robot deployment has not been validated.
A peer-reviewed arXiv submission introduces HELM, a model-agnostic framework for vision-language-action manipulation that addresses three named execution-loop deficiencies — memory gap, verification gap, and recovery gap — through three coupled components: an Episodic Memory Module, a learned State Verifier, and a Harness Controller. Empirical results show a 23.1-point task success improvement over OpenVLA on LIBERO-LONG, with the State Verifier's effectiveness shown to depend critically on episodic memory access. The work also releases LIBERO-Recovery as a standardized perturbation-injection evaluation protocol.
A new academic framework called RARE (Redundancy-Aware Retrieval Evaluation) demonstrates that standard RAG retrieval benchmarks significantly overstate real-world performance in high-similarity corpora such as financial reports, legal codes, and patents. A strong retriever baseline scoring 66.4% on general benchmarks drops to 5.0–27.9% on domain-specific redundancy-aware benchmarks, revealing a material gap between benchmark validation and production robustness in regulated verticals.
A peer-reviewed study submitted April 20, 2026 characterizes the geometric structure of Google AlphaEarth's 64-dimensional land surface embeddings across 12.1 million Continental U.S. samples (2017–2023), and demonstrates that retrieval-based agentic reasoning outperforms parametric-only approaches on environmental queries. The study introduces a nine-tool agentic system over a FAISS-indexed embedding database and benchmarks two model tiers, finding asymmetric payoff from geometric grounding tools across model capability levels.
Researchers published A-MAR, an agent-based multimodal retrieval framework that conditions retrieval on structured reasoning plans rather than direct query-to-retrieval pipelines. Evaluated on SemArt, Artpedia, and the newly introduced ArtCoT-QA benchmark, A-MAR outperforms static retrieval and strong multimodal LLM baselines on explanation quality and multi-step reasoning. Code and data are publicly available, lowering adoption barriers for the pattern.
Academic researchers published RAVEN, a multi-agent framework combining LLM agents and Retrieval-Augmented Generation to automate memory corruption vulnerability analysis and structured report generation, evaluated against NIST-SARD benchmarks with a 54.21% average quality score.
Researchers have published HeLa-Mem, a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, demonstrating superior benchmark performance with fewer context tokens compared to existing embedding-based retrieval approaches. The open-source release signals active research momentum in structured, learnable memory lifecycle management for agentic systems.
Academic researchers from four universities have published MemEvoBench, the first benchmark specifically designed to evaluate long-horizon memory safety in LLM agents. The benchmark demonstrates that memory evolution — the accumulation and reinforcement of agent memory across multi-round interactions — produces substantial safety degradation that static prompt-based defenses cannot address. Nine frontier models were evaluated, with GPT-5 and Gemini-2.5-Pro achieving the lowest attack success rates. The findings establish memory lifecycle management as a first-order safety and reliability concern for enterprise agentic deployments.
A paper submitted to arXiv on 17 April 2026 introduces Skill-RAG, a retrieval-augmented generation framework that uses a lightweight hidden-state prober and a prompt-based skill router to detect and correct retrieval failure states during multi-turn generation. Experiments show accuracy improvements on hard cases and out-of-distribution datasets, with representation-space analyses confirming that failure states occupy structured, separable regions.
An arxiv preprint by independent researchers Cheng and Song documents a single-subject case in which a 23 KB prompt-engineered multi-modal LLM system produced a closed-loop behavioral collapse, attributing the failure to a structural property of transformer attention rather than to adversarial input or user error. A redesigned system using physical conversation termination between modes produced no analogous failure, providing a controlled architectural contrast.
Researchers from NUS, UC Berkeley, and CUHK have published a bilevel optimization framework that uses Monte Carlo Tree Search to automatically optimize the structure and content of LLM agent skills, building directly on Anthropic's Agent Skills specification, which has been adopted as an open standard for cross-platform skill portability. The work signals growing academic and ecosystem investment in systematic, automated methods for improving agent capability design.
A peer-reviewed paper submitted to arXiv cs.AI demonstrates a working method combining Knowledge Graphs and LLMs to generate traceable, user-friendly explanations of ML model outputs in manufacturing environments, evaluated against structured XAI metrics across 33 questions.
A peer-reviewed survey submitted to arXiv cs.AI on April 17, 2026 systematically categorizes methods for integrating graphs with large language models and agents, covering multiple graph modalities and integration strategies across domains including cybersecurity, healthcare, finance, and materials science, while identifying a gap in practical guidance on when and how to apply each approach.
A peer-reviewed arXiv paper submitted April 17, 2026 introduces the Experience Compression Spectrum, a unifying framework that maps LLM agent memory, skills, and rules onto a single compression axis. Analysis of 20+ existing systems and 1,136 references across 22 papers reveals near-zero cross-community knowledge transfer, fixed compression levels across all surveyed systems, and neglected knowledge lifecycle management — collectively signaling a structural gap in the current LLM agent ecosystem.
A research paper submitted to arXiv on April 15, 2026 introduces APEX-MEM, a conversational memory system using a property graph with domain-agnostic ontology, append-only storage, and a multi-tool retrieval agent. The system achieves 86.2% on LongMemEval and 88.88% on LOCOMO's QA task, establishing a measurable benchmark for structured, temporally-grounded memory in long-running conversational AI.
A March 2026 arXiv paper introduces CRVA-TGRAG, a two-stage retrieval-augmented generation framework targeting knowledge conflict and staleness in vulnerability analysis, combining ensemble retrieval techniques with teacher-guided preference optimization to improve CVE detection accuracy.
Researchers at Vocalbeats.AI (Singapore) have published a stateful, evidence-driven RAG framework that converts retrieved documents into structured reasoning units with explicit relevance and confidence signals, maintained in a persistent evidence pool across iterative retrieval cycles. The framework demonstrates consistent benchmark improvements over standard RAG and multi-step baselines and is positioned within the emerging agentic RAG paradigm.
A March 2026 arXiv paper in the Computation and Language category introduces a RAG framework that models question answering as progressive evidence accumulation, using persistent evidence pools, structured reasoning units with explicit relevance and confidence signals, and iterative query refinement to outperform standard RAG and multi-step baselines while maintaining stability under retrieval noise.
These articles touch this plane but are primarily mapped elsewhere.
Researchers from Renmin University of China and Huawei Noah's Ark Lab have published AdaPlan-H, a self-adaptive hierarchical planning mechanism for LLM agents that initiates with coarse-grained macro plans and progressively refines them based on task complexity. The framework uses a two-stage optimization process (imitation learning and capability enhancement via SFT and DPO) and is validated on embodied and text-based agent benchmarks using multiple open-weight and proprietary models. Code and data are publicly released.
Researchers from Hubei University, Apple Inc, and Huazhong University of Science and Technology published PhySE, a validated AR-LLM social engineering framework that combines real-time multimodal capture, adaptive psychological strategy routing, and VLM-based cold-start profiling. An IRB-approved study with 60 participants confirmed the system outperforms prior baselines on social experience scores and profile generation latency, establishing a documented attack surface for agentic systems operating in high-trust social contexts.
A peer-reviewed arXiv paper submitted April 25, 2026 introduces PhySE, a psychological framework enabling real-time social engineering attacks via AR glasses and LLMs. The framework combines VLM-based profiling and adaptive psychological agent behavior, validated through an IRB-approved study with 60 participants and 360 annotated conversations. The research empirically documents that current RAG-based profiling introduces latency vulnerabilities and that adaptive LLM agents can be weaponized for context-aware manipulation without static scripts.
Researchers at Rensselaer Polytechnic Institute have published a controlled experimental study demonstrating that a multi-agent LLM pipeline — decomposed into Domain Expert, Manager, Coder, and Quality Assurer roles — significantly improves structural quality in automated ontology generation from unstructured insurance contract text, with gains driven primarily by front-loaded planning. The study also surfaces concrete failure modes in single-agent baselines including poor Ontology Design Pattern compliance, structural redundancy, and ineffective iterative repair.
A peer-reviewed arXiv paper submitted April 25, 2026 demonstrates that decomposing ontology construction into four specialized agent roles — Domain Expert, Manager, Coder, and Quality Assurer — significantly improves structural quality over single-agent LLM baselines, with performance gains driven primarily by front-loaded planning. The study used domain-specific insurance contracts as its experimental corpus and evaluated outputs via heterogeneous LLM judges and competency-question-driven SPARQL assessment.
A peer-reviewed arXiv paper introduces Analytica, a novel agent architecture using Soft Propositional Reasoning (SPR) that achieves 15.84% average accuracy improvement over base models on economic, financial, and political forecasting tasks, with a cost-efficient Jupyter Notebook grounder variant delivering comparable accuracy at 90.35% lower cost. The work formalizes bias-variance decomposition as a design principle for LLM reasoning systems and demonstrates near-linear time complexity at scale.
Researchers released FormalScience, a domain-agnostic human-in-the-loop agentic pipeline for converting informal scientific reasoning into formal Lean4 proofs, accompanied by FormalPhysics — a 200-problem university-level physics benchmark with formally verified representations. The work introduces the first systematic characterisation of semantic drift in physics autoformalisation and publicly releases both the codebase and an interactive UI system.
A peer-reviewed arXiv paper submitted April 24, 2026 demonstrates that training language models on power-law-distributed data consistently outperforms uniform distribution training on compositional reasoning tasks, and provably requires significantly less training data — with the mechanism traced to a beneficial asymmetry in the loss landscape that enables high-frequency skill compositions to scaffold acquisition of rare long-tail skills.
Researchers have published the first dataset and expert evaluation framework for assessing open-ended legal reasoning by LLMs within the Japanese jurisdiction, based on the writing component of the Japanese bar examination. The study includes manual hallucination analysis and legal expert evaluation, with all resources to be made publicly available.
A peer-reviewed arXiv paper (cs.AI, submitted 26 April 2026) introduces DxChain, a chain-based clinical reasoning framework that achieves state-of-the-art performance on diagnostic accuracy and logical consistency across two real-world MIMIC-IV benchmarks. The framework operationalizes a three-phase cognitive cycle — Memory Anchoring, Navigation, and Verification — and introduces adversarial debate, tree-of-thoughts planning, and cold-start hallucination mitigation as named, measurable architectural components. The work is publicly available and represents a validated reference pattern for structured agentic reasoning in a regulated domain.
A peer-reviewed paper submitted to arXiv on 25 April 2026 introduces GSAR, a typed grounding and hallucination recovery framework for multi-agent LLMs. The authors claim it is the first published framework coupling evidence-typed scoring with tiered recovery under an explicit compute budget. Evaluation was conducted on the FEVER dataset using four independently-trained frontier LLM judges, with statistically robust results across all ablations.
A new arXiv paper introduces StoryTR, the first video moment retrieval benchmark requiring Theory of Mind (ToM) reasoning, comprising 8,100 samples from narrative short-form videos. A 7B model trained on ToM-guided synthetic data achieves a +15.1% relative IoU improvement over baselines, outperforming Gemini-3.0-Pro (0.53 average IoU) on the benchmark. The result signals that explicit, structured reasoning chains in training data can close capability gaps against significantly larger frontier models on narrative reasoning tasks.
A peer-reviewed arXiv paper (cs.CR, submitted 21 April 2026) introduces Phoenix, a training-free multi-agent framework for vulnerability detection that uses Behavioral Contract Synthesis. Phoenix achieves F1 = 0.825 on PrimeVul Paired using 7–14B open-source models — up to 48x smaller than competing approaches — while exposing a systemic benchmark reliability failure in legacy deep learning vulnerability detection models.
Researchers submitted a paper to arXiv cs.AI introducing MAGEO, a multi-agent framework that reframes Generative Engine Optimization as a strategy learning problem. The framework distills validated editing patterns into reusable, engine-specific skills and introduces new evaluation protocols and benchmarks. Experiments across three generative engines show MAGEO outperforms heuristic baselines on visibility and citation fidelity metrics.
Researchers from Hong Kong Generative AI Research & Development Center, HKUST, and HKBU have published ClawNet, an open-source identity-governed agent collaboration framework built on OpenClaw, introducing three governance primitives — identity binding, scoped authorization, and action-level accountability — for cross-user multi-agent systems. The paper explicitly identifies the absence of governance infrastructure in current multi-agent frameworks as a market gap, and demonstrates the architecture in cross-organizational deployment scenarios.
Add implementation guidance, patterns, and reference material here.
Track open research questions and emerging developments for this plane.
Slack Publishes Production Architecture for Context Management in Long-Running Multi-Agent Systems — evt_src_0313be6c61bfc8f6 ↩︎
MemGround Benchmark Exposes Systematic Memory Failures in State-of-the-Art LLMs and Memory Agents — evt_src_10e7059dd4c396f5 ↩︎
MemEvoBench: First Benchmark for Long-Horizon Memory Safety in LLM Agents Reveals Structural Vulnerabilities in Memory Evolution — evt_src_0f1111ccebc84525 ↩︎
RARE Framework Exposes Critical RAG Retrieval Performance Gaps in High-Redundancy Enterprise Corpora — evt_src_0304f1582278176f ↩︎ ↩︎
FinGround Research Establishes Atomic Claim Verification as Emerging Standard for Financial AI Assurance — evt_src_bc0c167764eedfd0 ↩︎
APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning Advances Long-Term Conversational AI Benchmarks — evt_src_691144544d083341 ↩︎
HeLa-Mem: Bio-Inspired Graph-Based Memory Architecture for LLM Agents Published on arXiv — evt_src_3505ce126257a510 ↩︎
Evo-MedAgent: Self-Evolving Memory Architecture Demonstrates Training-Free Inter-Case Learning for Medical AI Agents — evt_src_7e9f8d5220716692 ↩︎
A-MAR Framework Demonstrates Plan-First Retrieval Conditioning as Validated Architecture for Structured Reasoning in Multimodal AI — evt_src_4f31ac4dc135d33b ↩︎
Hierarchical RAG Framework Demonstrates 62.4% Latency Reduction and 77.5% Candidate Space Reduction for CTI Technique Annotation — evt_src_acdf00c2d40be0e3 ↩︎
Skill-RAG: Academic Research Introduces Failure-State-Aware Retrieval Framework with Hidden-State Probing and Skill Routing — evt_src_a7c35ab73f02869e ↩︎
Vocalbeats.AI Researchers Publish Stateful Evidence-Driven RAG Framework with Iterative Reasoning — evt_src_f981abf2af10fe47 ↩︎
Academic Research Advances Stateful, Evidence-Driven RAG with Iterative Reasoning for Multi-Step QA — evt_src_8e90cce29d295897 ↩︎
Academic Research Proposes Response-Aware Memory Selection Method (RUMS) for LLM Personalization with 95% Computational Cost Reduction — evt_src_99fd5c6c267b3fd5 ↩︎
Microsoft Research and University of Washington Publish Response-Aware Memory Selection Method for LLM Personalization — evt_src_0e61bdf0b7c4ac42 ↩︎
HELM Research Demonstrates Structural Memory Gap in Vision-Language-Action Models, Introduces Pre-Execution Verification and Episodic Memory Architecture — evt_src_3dc129ab42eb1e64 ↩︎
HELM Framework Demonstrates Verification-Conditioned Execution and Episodic Memory as Load-Bearing Components in Long-Horizon Agentic Systems — evt_src_114539a023cfd0d0 ↩︎ ↩︎
Academic Research Identifies Structural Fragmentation in LLM Agent Memory and Skill Communities, Proposes Unified Compression Framework — evt_src_7fd403cdfde7c52a ↩︎ ↩︎
Peer-Reviewed Case Study Documents Structural Failure of Prompt-Layer Isolation in Multi-Modal Human-LLM Systems — evt_src_a1efa8f4816161d5 ↩︎
DR3-Eval: Academic Consortium Releases Reproducible Multimodal Benchmark for Deep Research Agents — evt_src_d1cd821232204350 ↩︎
UC Merced Research Demonstrates 80%+ KV Cache Compression for Latent Multi-Agent LLM Collaboration — evt_src_d41e819e90e06c2b ↩︎
GazeX: Radiologist Gaze-Conditioned Vision Language Model Demonstrates Expert Behavioral Priors as Context Engineering Mechanism — evt_src_a5bcad33fe3ab62e ↩︎
Academic Research Formalizes Bilevel MCTS Framework for Automated Agent Skill Optimization, Building on Anthropic's Open Skill Specification — evt_src_e30cf8e97f2ad4d0 ↩︎
Academic Research Proposes Two-Stage RAG Framework for CVE Knowledge Conflict Resolution — evt_src_ec60ee7de14bf568 ↩︎