The runtime logic that turns intents into completed tasks: planning, decomposition, tool invocation, state transitions, and multi-step execution control.
Why it matters to DAIS: Directly impacts reliability, controllability, and throughput of DAIS agentic workflows under real enterprise task complexity.
The reasoning and execution plane is undergoing simultaneous pressure from three directions: commercial platforms are absorbing infrastructure responsibilities that were previously custom-built, academic benchmarks are quantifying hard capability ceilings that practitioners have only estimated, and formal research is exposing structural failure modes in the execution logic of frontier models themselves.
Anthropichas launched Managed Agents on the Claude platform — a managed execution layer priced at $0.08 per session hour that delegates orchestration, sandboxing, session state, credential handling, and persistence to a platform-native substrate.[1] In parallel, Anthropic's Claude Code now dispatches parallel agents for pull request review, with internal data reporting a rise in substantive review comments from 16% to 54% of PRs and a sub-1% false positive rate.[2] Anysphere's Cursor 3 represents a comparable structural shift in developer tooling: the product was rebuilt from scratch as an agent orchestration workspace, and Cursor's own usage data shows agent users now outnumber tab-completion users 2-to-1, a complete inversion from a 2.5-to-1 ratio in March 2025.[3]
Against this commercial momentum, benchmarks are establishing measurable ceilings. The DeepRed CTF benchmark found that the best of ten commercially accessible LLMs achieved only 35% average checkpoint completion on realistic multi-step security tasks, with performance degrading sharply on tasks requiring non-standard discovery.[4] DW-Bench, testing graph topology reasoning over enterprise data warehouse schemas with 1,046 questions across Gemini 2.5 Flash, DeepSeek-V3, and Qwen2.5-72B, documents a systematic 30–40 percentage point drop on compositional multi-hop tasks versus single-hop queries, with a hard ceiling of 61% on hard subtypes against an oracle upper bound of ≥99.5%.[5]
Several distinct architectural approaches to improving reasoning and execution quality have emerged across industry and academia.
For hierarchical planning, researchers at Renmin University of China and Huawei Noah's Ark Lab published AdaPlan-H, a self-adaptive mechanism that begins with a coarse-grained macro plan and progressively refines it based on task complexity, using a two-stage optimization process combining supervised fine-tuning via imitation learning on GPT-4o-generated plans and Direct Preference Optimization via Monte Carlo evaluation.[6][7] The framework was validated across Llama, GPT-4o, Qwen, and GLM model families on both embodied and text-based agent benchmarks.[6:1]
For parallel execution and staged verification, the Bloomberg-affiliated PExA system reformulates text-to-SQL generation using a software test coverage approach: three specialized sub-agents (Planner, Test Case Generator, SQL Proposer) execute atomic test-case SQLs in parallel before final SQL generation is committed, achieving 70.2% execution accuracy on Spider 2.0 — a state-of-the-art result at time of submission.[8][9]
For governance-enforced execution, researchers from CUHK, Shanghai Jiao Tong University, Zhejiang University, Peking University, and Tsinghua University published Arbiter-K, which demotes the LLM to a non-privileged Probabilistic Processing Unit encapsulated by a deterministic Symbolic Governor that enforces resource limits, taint checks, and access control lists before any intent reaches a deterministic sink.[10] Empirical evaluation found that native guardrails in Amazon Bedrock AgentCore and Anthropic Skills intercepted fewer than 9% of unsafe operations under adversarial conditions, while Arbiter-K achieved 76–95% interception.[10:1]
For training-time execution quality, ByteDance Seed and ETH Zurich published a rubric-based Generative Reward Model (GRM) approach that provides structured intermediate feedback beyond binary terminal rewards for software engineering agents, outperforming terminal-score-only rejection sampling — addressing the documented limitation that training solely on verifiable end rewards cannot eliminate inefficient intermediate steps.[11] Separately, KAIST researchers identified that large reasoning models trained on math and coding traces follow a two-step structure (problem understanding, then solution) that lacks a harmfulness assessment gate, enabling harmful output generation even when harmful intent is correctly detected.[12]
Several unresolved tensions cut across the briefs.
Faithfulness versus compilation rate remains an open measurement problem. A study evaluating GPT-5 and DeepSeek-R1 on 303 first-order logic problems found compilation rates of 87–99% that masked distinct unfaithfulness failure modes: DeepSeek-R1 mistranslates premises in ways that evade detection, while GPT-5 fabricates axioms in ways that are detectable via cross-stage comparison.[13] High surface-level metrics do not reliably indicate semantic correctness of the underlying reasoning chain.
Deterministic replayability is undermined by a finding that KV cache ON and cache OFF inference paths produce 100% token divergence rates across LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B — including under greedy decoding — due solely to FP16 non-associativity in floating-point accumulation ordering.[14] This has direct implications for any execution architecture that assumes reproducibility across cached and uncached inference runs.
Scale and behavioral control are not reliably correlated. MEDLEY-BENCH, evaluated across 35 models from 12 families, found that evaluation ability increases with model size within families, but behavioral control does not — and smaller models frequently match or outperform larger ones on metacognitive competence.[15] This complicates assumptions that larger models will self-regulate more reliably in agentic loops.
Shortcut detection in neurosymbolic systems has been formally established as coNP-complete by researchers from Japan's National Institute of Informatics and NTT, meaning that constraint satisfaction alone cannot guarantee correct concept mapping and that verification costs scale adversarially.[16]
Several signals from April–May 2026 indicate active movement on execution infrastructure and reasoning quality.
Researchers from Nanjing University, Alibaba Group, and Ant Group identified structured, measurable attention patterns in thinking LLMs — including DeepSeek-R1, GPT-5, and Gemini 3 series — that correlate with correctness on quantitative reasoning tasks, and demonstrated a training-free steering approach yielding up to 2.6% accuracy improvement on GSM8K.[17] The Self-Reading Quality (SRQ) scoring method combines geometric and semantic metrics observable at inference time, suggesting a path toward runtime monitoring of reasoning trace quality without model modification.
The StoryTR benchmark introduced Theory of Mind reasoning as a measurable evaluation axis for video retrieval, finding that a 7B model trained on ToM-guided synthetic data outperformed Gemini-3.0-Pro (0.53 average IoU) by a +15.1% relative IoU margin — providing evidence that structured reasoning chains in training data can close capability gaps against significantly larger frontier models on specific task types.[18]
SMC-SD, from Cornell University, Makora, MIT, and ETH Zürich, demonstrated that replacing token-level rejection sampling with importance-weighted resampling over a population of draft particles achieves 342 tok/s on a Llama 3.2-1B → Llama 3-70B pair across 4 H100 GPUs — a 5.2× speed-up over autoregressive decoding and 2.36× over optimized speculative decoding — while remaining within 3% of target model accuracy.[19] The engine is built as a fork of SGLang and source code is publicly available, making throughput gains at this scale accessible to production deployments.
A heartbeat-driven cognitive scheduling architecture proposed by a researcher at Chengdu University of Information Technology offers a contrasting direction: rather than reactive or prompt-triggered execution, a periodic heartbeat orchestrates cognitive modules (planning, reflection, memory recall) using a meta-learning strategy over historical interaction logs — positioning proactive self-regulation as an alternative to fixed-pipeline agent designs.[20][21]
These patterns indicate content relevant to this plane:
Planning and execution via tools/functions in deterministic pipelines.
Replayability, trace logs, and deterministic execution diagnostics.
Look for explicit execution semantics: who decides, which tool runs, what happens on failure, and how loops terminate.
Use these rules when content could belong to multiple planes:
These articles were classified with this plane as their primary mapping.
Researchers from Japan's National Institute of Informatics and NTT have published a formal treatment of reasoning shortcuts in neurosymbolic learning, establishing that constraint satisfaction alone does not guarantee correct concept mapping, that shortcut detection is coNP-complete, and that a verified ASP-based repair algorithm can eliminate shortcuts by augmenting constraint sets. The work provides complexity-grounded theoretical foundations for output verification in constraint-based AI systems.
Researchers from Renmin University of China and Huawei Noah's Ark Lab have published AdaPlan-H, a self-adaptive hierarchical planning mechanism for LLM agents that initiates with coarse-grained macro plans and progressively refines them based on task complexity. The framework uses a two-stage optimization process (imitation learning and capability enhancement via SFT and DPO) and is validated on embodied and text-based agent benchmarks using multiple open-weight and proprietary models. Code and data are publicly released.
Researchers published a peer-reviewed arXiv paper introducing AdaPlan-H, a self-adaptive hierarchical planning mechanism for LLM agents that initiates with coarse-grained macro plans and progressively refines them based on task complexity. Experimental results reported by the authors show improved task execution success rates and reduced overplanning. Code and data will be made publicly available.
A peer-reviewed paper submitted April 24, 2026 introduces PExA, a parallel exploration agent for complex text-to-SQL that achieves 70.2% execution accuracy on the Spider 2.0 benchmark — a new state-of-the-art — by decomposing queries into atomic test-case SQLs executed in parallel and grounding final SQL generation on those explored results.
A Bloomberg-affiliated research team published PExA (Parallel Exploration Agent), a multi-agent text-to-SQL framework achieving 70.2% state-of-the-art accuracy on the Spider 2.0 benchmark. The system operationalizes parallel agent dispatch, staged verification, and structured task decomposition — a named reference architecture pattern with direct relevance to enterprise agentic reasoning system design.
A new arXiv paper introduces StoryTR, the first video moment retrieval benchmark requiring Theory of Mind (ToM) reasoning, comprising 8,100 samples from narrative short-form videos. A 7B model trained on ToM-guided synthetic data achieves a +15.1% relative IoU improvement over baselines, outperforming Gemini-3.0-Pro (0.53 average IoU) on the benchmark. The result signals that explicit, structured reasoning chains in training data can close capability gaps against significantly larger frontier models on narrative reasoning tasks.
A peer-reviewed paper from five leading Chinese research institutions introduces Arbiter-K, a governance-first execution architecture that encapsulates LLMs within a deterministic symbolic kernel. Empirical evaluations document that native guardrails in existing agentic systems — including Amazon Bedrock AgentCore and Anthropic Skills — intercept fewer than 9% of unsafe operations under adversarial conditions, while Arbiter-K achieves 76–95% unsafe interception. The paper's public code release signals that kernel-based governance for agentic AI is moving from theoretical positioning into implementable reference architecture.
Researchers from Nanjing University, Alibaba Group, and Ant Group have identified structured, measurable attention patterns in thinking LLMs — including DeepSeek-R1, GPT-5, and Gemini 3 series — that correlate with correctness on quantitative reasoning tasks. The study introduces a Self-Reading Quality (SRQ) scoring method combining geometric and semantic metrics, and demonstrates a training-free steering approach yielding up to 2.6% accuracy improvement. The findings establish that reasoning trace integration quality is observable and steerable at inference time, with direct implications for monitoring and verification layer design in agentic systems.
A peer-reviewed arXiv study evaluating GPT-5 and DeepSeek-R1 on 303 first-order logic problems finds that high compilation rates (87–99%) mask distinct and reproducible unfaithfulness failure modes in two-stage formal reasoning pipelines, with DeepSeek-R1 mistranslating premises in ways that evade detection and GPT-5 fabricating axioms in ways that are detectable via cross-stage comparison. No systematic gaming was observed in unified generation.
A peer-reviewed arXiv paper submitted April 21, 2026 proposes a conceptual framework integrating anomaly detection into agentic AI for proactive risk management in human activity monitoring, specifically fall detection and prediction. The paper argues against static agent configurations in favor of dynamic, adaptive tool selection, and identifies high false alarm rates, poor context awareness, environmental noise, and data scarcity as persistent limitations in existing systems.
A peer-reviewed arXiv paper introduces DeepRed, an open-source benchmark evaluating LLM-based agents on Capture The Flag challenges using partial-credit scoring. The best of ten commercially accessible LLMs tested achieved only 35% average checkpoint completion, with performance degrading sharply on tasks requiring non-standard discovery and longer-horizon adaptation — providing empirical quantification of current agentic reasoning limits under realistic multi-step conditions.
A peer-reviewed arXiv paper submitted April 21, 2026 introduces Explicit Trait Inference (ETI), a psychologically grounded coordination method for LLM-based multi-agent systems. ETI reduces payoff loss by 45–77% in controlled economic game settings and improves performance by 3–29% on the MultiAgentBench benchmark relative to a Chain-of-Thought baseline, providing the first systematic evidence that LLM agents can reliably infer and leverage partner trait profiles from interaction histories.
Researchers from Innosol and Bahçesehir University have released DW-Bench, an open-source benchmark of 1,046 questions testing LLM graph topology reasoning over enterprise data warehouse schemas. Evaluated across Gemini 2.5 Flash, DeepSeek-V3, and Qwen2.5-72B, the benchmark documents a systematic 30–40 percentage point performance drop on compositional multi-hop tasks versus single-hop queries, a hard ceiling of 61% on hard subtypes versus an oracle upper bound of ≥99.5%, and a 7–14 percentage point advantage for tool-augmented baselines over static context methods.
Researchers at KAIST have published peer-reviewed findings demonstrating that the internal reasoning structure of large reasoning models (LRMs) — not just their outputs — constitutes a distinct safety vulnerability. LRMs trained on math and coding reasoning chains follow a two-step structure that bypasses safety alignment even when harmful intent is detected. A lightweight post-training method (ALTTRAIN) using a three-step reasoning structure and only 1K training examples is shown to substantially reduce harmful response rates while maintaining token efficiency.
Researchers submitted a peer-reviewed benchmark (NARS-Reasoning-v0.1) pairing natural-language reasoning problems with first-order logic forms and executable Narsese programs, validated through runtime execution in OpenNARS for Applications (ONA). The work introduces a deterministic compilation pipeline from FOL to executable Narsese and a Language-Structured Perception (LSP) formulation that trains LLMs to produce symbolic structure rather than only verbal responses. A Phi-2 LoRA adapter trained on the benchmark is publicly released.
Anthropic has introduced Managed Agents on its Claude platform — a managed execution layer that abstracts orchestration, sandboxing, session state, credential handling, and observability into a platform-native substrate. The offering directly competes with DAIS's execution and deployment layer positioning, establishes a concrete pricing model at 8 cents per session hour, and has drawn practitioner-level lock-in concerns tied to Anthropic's proprietary SDK and format.
Researchers from ByteDance Seed and ETH Zurich have published a method for training LLM-based software engineering agents using rubric-based Generative Reward Models (GRM) that provide structured intermediate feedback beyond binary terminal rewards. The approach outperforms standard rejection sampling and demonstrates that human-designed behavioral rubrics can be operationalized as training signals in multi-step agentic systems.
Researchers from Cornell University, Makora, MIT, and ETH Zürich published SMC-SD (arXiv:2604.15672), an LLM inference algorithm replacing token-level rejection sampling with importance-weighted resampling over a population of draft particles. On a Llama 1B→70B draft-target pair across 4 H100 GPUs, SMC-SD achieves 342 tok/s — a 5.2× speed-up over autoregressive decoding and 2.36× over optimized speculative decoding — while remaining within 3% of target model accuracy. The engine is built as a fork of SGLang and source code is publicly available.
A peer-reviewed arXiv paper submitted April 16, 2026 demonstrates that KV cache ON and cache OFF inference paths are not numerically equivalent under FP16 precision, producing 100% token divergence rates across LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B — including under greedy decoding. FP16 non-associativity is identified as the sole causal driver. The finding has direct implications for inference infrastructure selection, model output reproducibility assumptions, and observability requirements in production AI deployments.
Researchers from the Chinese Academy of Sciences have published MoLSAKI, a Chain-of-Thought distillation framework that transfers stepwise attention patterns from large teacher models to smaller student models using a Mixture-of-Layers alignment module. The method demonstrates consistent reasoning performance gains across mathematical and commonsense benchmarks without requiring shared tokenizers or matched layer counts between architecturally distinct models.
These articles touch this plane but are primarily mapped elsewhere.
A peer-reviewed arXiv paper submitted April 25, 2026 under Computer Science > Artificial Intelligence formalizes reasoning shortcuts in neurosymbolic learning as a constraint satisfaction problem, establishes computational complexity bounds for verification and repair, develops an ASP-based verification algorithm with proven soundness and completeness, and validates the approach across eight benchmark domains. The work establishes that shortcut-freeness verification is coNP-complete, counting shortcuts is #P-complete, and finding minimal repairs is NP-hard.
A peer-reviewed arXiv paper submitted April 25, 2026 identifies five structural gaps in AI agent identity — semantic intent verification, recursive delegation accountability, agent identity integrity, governance opacity and enforcement, and operational sustainability — and concludes that no current technology or regulatory instrument resolves them. The paper further finds that extending human identity frameworks to AI agents without structural modification produces systematic failures, and that more engineering effort alone cannot close these gaps.
Researchers from Hubei University, Apple Inc, and Huazhong University of Science and Technology published PhySE, a validated AR-LLM social engineering framework that combines real-time multimodal capture, adaptive psychological strategy routing, and VLM-based cold-start profiling. An IRB-approved study with 60 participants confirmed the system outperforms prior baselines on social experience scores and profile generation latency, establishing a documented attack surface for agentic systems operating in high-trust social contexts.
A peer-reviewed arXiv paper submitted April 25, 2026 introduces PhySE, a psychological framework enabling real-time social engineering attacks via AR glasses and LLMs. The framework combines VLM-based profiling and adaptive psychological agent behavior, validated through an IRB-approved study with 60 participants and 360 annotated conversations. The research empirically documents that current RAG-based profiling introduces latency vulnerabilities and that adaptive LLM agents can be weaponized for context-aware manipulation without static scripts.
Researchers at Rensselaer Polytechnic Institute have published a controlled experimental study demonstrating that a multi-agent LLM pipeline — decomposed into Domain Expert, Manager, Coder, and Quality Assurer roles — significantly improves structural quality in automated ontology generation from unstructured insurance contract text, with gains driven primarily by front-loaded planning. The study also surfaces concrete failure modes in single-agent baselines including poor Ontology Design Pattern compliance, structural redundancy, and ineffective iterative repair.
A peer-reviewed arXiv paper submitted April 25, 2026 demonstrates that decomposing ontology construction into four specialized agent roles — Domain Expert, Manager, Coder, and Quality Assurer — significantly improves structural quality over single-agent LLM baselines, with performance gains driven primarily by front-loaded planning. The study used domain-specific insurance contracts as its experimental corpus and evaluated outputs via heterogeneous LLM judges and competency-question-driven SPARQL assessment.
A peer-reviewed arXiv paper introduces Analytica, a novel agent architecture using Soft Propositional Reasoning (SPR) that achieves 15.84% average accuracy improvement over base models on economic, financial, and political forecasting tasks, with a cost-efficient Jupyter Notebook grounder variant delivering comparable accuracy at 90.35% lower cost. The work formalizes bias-variance decomposition as a design principle for LLM reasoning systems and demonstrates near-linear time complexity at scale.
A peer-reviewed paper from researchers affiliated with Stanford and InquiryOn proposes treating human-in-the-loop (HITL) oversight as a decoupled, independent system component in agentic AI workflows, formalizing integration along four dimensions and aligning the model with the Agent-to-Agent (A2A) interoperability protocol. The work signals emerging academic consensus that HITL must be a first-class architectural concern rather than an application-level implementation detail.
A peer-reviewed paper submitted to arXiv cs.AI on 24 April 2026 proposes a decoupled Human-in-the-Loop (HITL) system architecture that treats human oversight as an independent system component in agentic workflows, formalizing integration across four dimensions and supporting alignment with emerging agent communication protocols. The research identifies scalability and reuse limitations in current embedded HITL implementations as a structural gap in multi-agent environments.
Researchers released FormalScience, a domain-agnostic human-in-the-loop agentic pipeline for converting informal scientific reasoning into formal Lean4 proofs, accompanied by FormalPhysics — a 200-problem university-level physics benchmark with formally verified representations. The work introduces the first systematic characterisation of semantic drift in physics autoformalisation and publicly releases both the codebase and an interactive UI system.
A peer-reviewed arXiv paper submitted April 24, 2026 demonstrates that training language models on power-law-distributed data consistently outperforms uniform distribution training on compositional reasoning tasks, and provably requires significantly less training data — with the mechanism traced to a beneficial asymmetry in the loss landscape that enables high-frequency skill compositions to scaffold acquisition of rare long-tail skills.
Slack staff software engineer Dominic Marks has publicly detailed a three-channel context management architecture used in production multi-agent systems at Slack, moving away from message-history accumulation toward structured memory, staged validation, and credibility-weighted evidence distillation to maintain coherence across long-running agentic sessions.
A peer-reviewed arXiv paper published 26 April 2026 introduces spotforecast2-safe, an open-source Python package that embeds EU AI Act, IEC 61508, ISA/IEC 62443, and Cyber Resilience Act requirements directly into library API contracts, persistence formats, and CI gates — operationalizing compliance-by-design as a concrete, verifiable architectural pattern for safety-critical time-series forecasting.
Researchers published ClawTrace, an open agent tracing platform that records per-step LLM call costs and compiles them into structured TraceCards, paired with a distillation pipeline (CostCraft) that produces transferable cost-optimization rules. Benchmark results show prune rules cut median cost by 32% across unrelated tasks, while preserve rules trained on benchmark-specific conventions caused regressions on new task types — signaling an asymmetry in which cost-optimization patterns generalize but task-specific skill preservation does not.
Researchers have published the first dataset and expert evaluation framework for assessing open-ended legal reasoning by LLMs within the Japanese jurisdiction, based on the writing component of the Japanese bar examination. The study includes manual hallucination analysis and legal expert evaluation, with all resources to be made publicly available.
Add implementation guidance, patterns, and reference material here.
Track open research questions and emerging developments for this plane.
Anthropic Launches Managed Agents: Platform-Native Agentic Execution Layer on Claude — evt_src_1a402fcf24882861 ↩︎
Anthropic Launches Agent-Based Code Review in Claude Code for Team and Enterprise Users — evt_src_dbbb6e19548dee85 ↩︎
Cursor 3 Launches Agent-First Interface with Cloud Execution, Parallel Agents, and Proprietary Model — evt_src_9615c6cfb8e00d78 ↩︎
DeepRed Open-Source Benchmark Quantifies LLM Agent Capability Ceiling at 35% on Realistic Multi-Step Security Tasks — evt_src_a8be6fe151ac955a ↩︎
DW-Bench: New Benchmark Exposes Systematic Multi-Hop Reasoning Ceiling in Frontier LLMs on Enterprise Data Warehouse Schemas — evt_src_aa3f66f85fa9c821 ↩︎
Renmin University and Huawei Noah's Ark Lab Publish AdaPlan-H: Self-Adaptive Hierarchical Planning Framework for LLM Agents — evt_src_1bfb868299300cb1 ↩︎ ↩︎
arXiv Research Proposes AdaPlan-H: Self-Adaptive Hierarchical Planning Mechanism for LLM Agents — evt_src_165a18660c32cdb9 ↩︎
PExA Achieves State-of-the-Art 70.2% on Spider 2.0 via Parallel Exploration and Staged Verification for Text-to-SQL — evt_src_70262e295abdd155 ↩︎
PExA Research Validates Parallel Agent Dispatch with Staged Verification as Enterprise-Grade Text-to-SQL Architecture — evt_src_48a220b4d4dd4ef4 ↩︎
Academic Research Proposes Governance-First Kernel Architecture for Agentic AI, Documenting Critical Gaps in Existing Guardrail Approaches — evt_src_9925c0e0b7a6237c ↩︎ ↩︎
ByteDance Seed and ETH Zurich Publish Rubric-Based Generative Reward Model for Reinforced Fine-Tuning of SWE Agents — evt_src_a9702e153a109f97 ↩︎
KAIST Research Identifies Reasoning Structure as a Safety Attack Surface in Large Reasoning Models — evt_src_b3d96fc0af5d2b66 ↩︎
Peer-Reviewed Research Documents Distinct Unfaithfulness Failure Modes in GPT-5 and DeepSeek-R1 Formal Reasoning Pipelines — evt_src_b636eb914188e56b ↩︎
Peer-Reviewed Research Documents Systematic FP16 Token Divergence in KV-Cached LLM Inference Across Three Open-Weight Models — evt_src_25ab0f0dbf26a198 ↩︎
MEDLEY-BENCH: New AI Metacognition Benchmark Finds Scale Does Not Guarantee Behavioral Control — evt_src_dd26e8212417d8a3 ↩︎
Academic Research Formalizes Reasoning Shortcuts in Neurosymbolic Learning as Constraint Satisfaction Problem with Proven Complexity Bounds — evt_src_bc4fbb770fc71794 ↩︎
Academic Research Identifies Measurable Attention Patterns in Thinking LLMs Correlated with Reasoning Correctness — evt_src_e33c84279f757a85 ↩︎
StoryTR Benchmark Reveals Frontier Model Reasoning Gaps in Narrative Video Retrieval; 7B Specialized Model Outperforms Gemini-3.0-Pro via Theory of Mind Training — evt_src_9da591e6fbf47975 ↩︎
SMC-SD: Sequential Monte Carlo Speculative Decoding Achieves 5.2× LLM Inference Throughput Gains Over Autoregressive Baseline — evt_src_e8a19d29f01d7107 ↩︎
Academic Research Proposes Heartbeat-Driven Cognitive Scheduling Architecture for LLM-Based Autonomous Agents — evt_src_3dd27394699d2d69 ↩︎
arXiv Research Proposes Heartbeat-Driven Cognitive Scheduling Architecture for LLM-Based Agents — evt_src_b5c2804cf0062101 ↩︎