Part of 3.1 Context Engineering Plane
Within the Encapsulated AI reference architecture, the Data Sources & Connectors layer governs how external information is ingested, indexed, and made retrievable by agentic systems. The briefs surveyed here reveal a landscape dominated by vector-indexed retrieval, structured graph schemas, and domain-specific embedding corpora — each representing a distinct integration pattern for grounding LLM reasoning in external data.
The most concrete instantiation of an external data connector in the surveyed literature is Google AlphaEarth's 64-dimensional land surface embedding corpus, indexed via FAISS across 12.1 million Continental U.S. samples (2017–2023). Research characterizing this corpus demonstrates that retrieval-augmented agentic reasoning over a FAISS index outperforms parametric-only approaches on environmental queries, with a nine-tool agentic system built directly atop the indexed embedding database.[1] The AlphaEarth manifold exhibits high geometric complexity (effective dimensionality 13.3; local intrinsic dimensionality ~10), meaning connector design must account for non-uniform retrieval coherence across the embedding space — local geometry predicts retrieval coherence with an R² of 0.32.[1:1]
For conversational data, APEX-MEM demonstrates a property graph connector with a domain-agnostic ontology and append-only storage, achieving 86.2% on LongMemEval and 88.88% on LOCOMO's QA task. Its multi-tool retrieval agent treats the property graph as a temporally grounded event store, enabling structured access to long-horizon conversational history.[2] Evo-MedAgent similarly employs a three-store memory architecture — Retrospective Clinical Episodes, Adaptive Procedural Heuristics, and a Tool Reliability Controller — as a training-free connector layer over frozen model backends, with per-case overhead bounded to one retrieval pass and one reflection call.[3]
Data warehouse schemas represent a second major connector class. DW-Bench, introduced April 2026, formalizes LLM reasoning over data warehouse graph topology using dual-edge schemas that integrate both foreign-key and data-lineage edges across five schemas and 1,046 verifiable questions. Tool-augmented retrieval methods substantially outperform static approaches on this benchmark, though a documented capability ceiling emerges on hard compositional subtypes.[4]
A broader survey of graph-LLM integration methods (arXiv, April 2026) organizes connector strategies along three axes: purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, causal graphs, dependency graphs), and integration strategy (prompting, augmentation, training, or agent-based use). The survey explicitly flags a gap in practical guidance on when each integration type is appropriate.[5]
In the cybersecurity domain, H-TechniqueRAG demonstrates a hierarchical two-stage retrieval connector over CTI corpora: a macro-level tactic retrieval pass narrows the candidate space by 77.5% before technique-level retrieval, reducing LLM API calls by 60% and inference latency by 62.4% while improving F1 by 3.8% over the prior state-of-the-art TechniqueRAG.[6]
An emerging integration pattern treats tool generation itself as a connector mechanism. El Agente Forjador, a multi-agent framework evaluated across 24 quantum simulation tasks, demonstrates that coding agents can autonomously forge, validate, and reuse computational tools through a four-stage workflow (tool analysis → generation → execution → iterative evaluation), consistently outperforming direct problem-solving baselines and reducing API cost for lower-capability agents through toolset reuse.[7]
The briefs provide strong coverage of vector and graph connector patterns but offer limited detail on protocol-level standards (e.g., MCP, OpenAPI schemas, or connector authentication flows). Connector failure modes — latency degradation, schema drift, and retrieval staleness — are not systematically addressed. The documented gap in graph-LLM integration guidance[5:1] extends to connector selection criteria more broadly: no surveyed work provides a decision framework for choosing among embedding-based, graph-structured, or tool-generated connector patterns for a given deployment context.
Add implementation guidance and reference material here.
Track open research questions and emerging developments.
Research Characterizes AlphaEarth Embedding Geometry for Agentic Environmental Reasoning, Demonstrating Retrieval Superiority Over Parametric-Only Approaches — evt_src_9f0950074af0cdad ↩︎ ↩︎
APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning Advances Long-Term Conversational AI Benchmarks — evt_src_691144544d083341 ↩︎
Evo-MedAgent: Self-Evolving Memory Architecture Demonstrates Training-Free Inter-Case Learning for Medical AI Agents — evt_src_7e9f8d5220716692 ↩︎
DW-Bench: New Academic Benchmark Exposes LLM Reasoning Limits on Data Warehouse Graph Topology — evt_src_f2b33e89d1a32bd8 ↩︎
arXiv Survey Maps Graph-LLM Integration Methods Across Reasoning, Retrieval, and Agent-Based Use — evt_src_48b50b8042868786 ↩︎ ↩︎
Hierarchical RAG Framework Demonstrates 62.4% Latency Reduction and 77.5% Candidate Space Reduction for CTI Technique Annotation — evt_src_acdf00c2d40be0e3 ↩︎
arXiv Research Demonstrates Autonomous Tool Generation and Reuse in Multi-Agent Framework for Quantum Simulation — evt_src_58af837fcefe554d ↩︎