• Turing Post
  • Posts
  • FOD#137: From Model-Centric to Systems-Level Thinking About Agents

FOD#137: From Model-Centric to Systems-Level Thinking About Agents

How agentic AI shifts toward memory management, tool control, and system-level discipline

This Week in Turing Post:

  • Wednesday / AI 101 series: Hybrid AI

  • Friday / Open Source AI series: The Real Math – when open source saves money, when it doesn't + an interview with MiniMax!

🚀 From our partners: AI for When It Is Rocket Science. Agent Composer now available in the Contextual AI platform! It helps teams tackle expert-level engineering tasks in high-stakes environments – compressing hours of routine (but complex) work into minutes.

What makes it different:

  • Unified context layer: Agents operate with full task, data, and workflow context

  • Flexible, controlled agents: Combine dynamic intelligence with structured workflows for mission-critical reliability.

  • Intuitive, no-code build: Create and optimize agents in minutes with pre-built templates, natural language prompts – no rocket science required.

To the main topic: Agentic realities – why agent progress is becoming a systems problem

I don’t know how it happens, but every week reading research papers gives me an idea for an editorial. There is always a combination of papers that were clearly not coordinated, yet end up answering each other. They expose blind spots when read alone and make sense when read together. And that, I believe, is the value proposition of Turing Post’s digests: noticing things you might otherwise have missed.

This week, my insights came from two surveys on agents: one is about agentic reasoning as a paradigm. The other is about efficiency and cost in agent systems. Surprisingly, they both describe the same bottleneck but from two sides.

Quiet boldly, the Agentic Reasoning for Large Language Models survey (Tianxin Wei et al.) argues that reasoning is not a one-shot model call. Instead, it defines reasoning as something that happens across interaction steps: planning, tool use, search, memory updates, feedback, and revision.

Image credit: The original paper

In this framing, reasoning is no longer equivalent to producing an internal chain-of-thought. It is closer to a control process over time. The agent maintains state, interacts with an environment, updates internal representations, and decides what to do next. The unit of analysis shifts from an answer to a trajectory.

This shift explains why so many recent agent papers focus on memory, reflection, and multi-step workflows rather than prompt tricks. Once reasoning is distributed across time, earlier decisions do not disappear. They influence later behavior whether or not they remain valid.

The survey makes this concrete in how it treats core components of agent behavior. Memory is described as an active element in the reasoning process, shaping future decisions rather than merely storing past information. Feedback appears as a mechanism for updating behavior over time, not simply for scoring outputs. Multi-agent configurations are framed in operational terms as well, focusing on role separation and coordination as ways to maintain consistency across long horizons rather than as mechanisms for improving pointwise accuracy.

What the paper implicitly acknowledges is that once reasoning becomes interactive, coherence becomes fragile.

The Toward Efficient Agents survey (Xiaofang Yang et al.) picks up exactly at that fragility point. Instead of asking how to design better reasoning mechanisms, it asks what happens when those mechanisms are deployed repeatedly.

Image Credit: The original paper

The answer is that agent systems accumulate cost and state in ways that are not linear. Token usage compounds across steps. Memory grows faster than relevance. Tool calls introduce latency and retries. Planning depth increases even when marginal gains drop.

The survey is concrete about this. It decomposes agent cost into components: generation, memory access, tool invocation, retries. Efficiency is not treated as a single metric, but as a system-level tradeoff between effectiveness and resource use.

The paper is also clear about where it locates the source of the problem. The focus is not on reducing model size or changing model capacity, but on how agent behavior is organized at the system level. Memory is discussed in terms of ongoing compression and filtering. Tool use is treated as something that requires selectivity. Planning is described with an emphasis on explicit termination conditions. Without these mechanisms, performance can deteriorate over longer runs even when individual steps appear correct.

These concerns are framed in terms of maintaining stable behavior over time rather than improving performance on isolated tasks.

Reading the two surveys together we can see that they describe the same phenomenon from different angles:

  • The reasoning survey shows how intelligence in agents spreads across interaction steps.

  • The efficiency survey shows what happens when that spread is left unchecked.

Both papers treat memory as an active component that shapes future behavior. Both treat tool use as an action with consequences, not a free capability. Both treat planning depth as something that must be regulated.

Neither paper claims that agents fail because models cannot reason. The failure mode is structural. Once reasoning persists over time, it requires mechanisms to prevent accumulation from overwhelming the system.

This is why many agent failures look familiar to anyone with systems experience. Old state interferes with new decisions. Failed paths continue to influence behavior. The system technically works, but its internal structure degrades.

So what’s the takeaway?

These papers suggest that as agents run longer, we see how the main constraint shifts. Progress depends less on how strong individual reasoning steps are and more on whether the system can stay coherent over time. Memory, reasoning, and action all need to be managed, not simply expanded.

Taken together, the surveys point toward a move from model-centric to systems-level thinking about agents.

Our news digest is always free. Click on the partner’s link above to support us or Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Nvidia, Hugging Face, Microsoft, Google, a16z etc plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

Twitter Library

Follow us on  🎥 YouTube Twitter  Hugging Face 🤗

We are watching/reading

I’ve read the new opus by Dario Amodei and here is my honest take. Please watch and let me know what you think. This episode is not about denying AI risk. The risks are real. The question is whether this essay helps us reason about them, or whether it mainly reveals how Silicon Valley talks to itself when stakes feel existential.

News from the usual suspects

  • NVIDIA

    • NVIDIA introduces Earth-2, a suite of open-source AI models designed to accelerate weather and climate forecasting. Covering everything from global 15-day forecasts to minute-level storm nowcasting, the platform aims to democratize high-resolution prediction tools and reduce reliance on traditional supercomputing. A notable step toward scalable, accessible weather intelligence.

    • NVIDIA is investing $2B in CoreWeave and expanding their partnership to build over 5 gigawatts of “AI factories” by 2030. CoreWeave will deploy multiple generations of NVIDIA hardware—including Rubin GPUs and Vera CPUs—while offering its AI-native software stack to cloud providers and enterprises. It's a tight alignment aimed at scaling infrastructure for the next wave of AI adoption.

    • Microsoft

      • Maia: Inference is forever
        Microsoft unveils Maia 200, its first custom AI inference chip as part of their heterogenous AI infrastructure – and a strategic move into the heart of where AI economics play out. Built on TSMC’s 3nm process with custom FP8/FP4 cores and 216GB of HBM3e, Maia 200 is optimized for the endless grind of inference. But beyond specs, it’s about integration: by aligning silicon, models, and apps across workloads like GPT-5.2 and Copilot, Microsoft gains a tight feedback loop – and a durable advantage.

      • QDK gets sharper qubits
        Microsoft expands its Quantum Development Kit with powerful new tools for chemistry and error correction, signaling readiness for the logical qubit era. Fully integrated with VS Code and GitHub Copilot, the QDK simplifies quantum programming and supports major frameworks like Qiskit and Cirq. It’s a move toward practical applications, built on Microsoft’s vision of a unified platform combining quantum hardware, software, and AI—now tightly looped into Azure.

🔦 Survey highlight

  • Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs by Wei Zhou et al.

Models (if 🦋 - it’s open sourced)

  • 🦋 Waypoint-1: Real-time interactive video diffusion from Overworld
    Researchers from Overworld introduced Waypoint-1, a real-time, text-and-input-controllable video diffusion model trained on 10,000 hours of labeled gameplay. Using a frame-causal rectified flow transformer, it generates each frame conditioned on text, mouse, and keyboard with zero latency. The model supports 30 FPS at 4 steps or 60 FPS at 2 steps on consumer GPUs via the WorldEngine library. Training used diffusion forcing and self-forcing to minimize inference mismatch and error accumulation in autoregressive rollouts. →read the tech overview

  • 🦋 Kimi K2.5
    Researchers from Moonshot AI released Kimi K2.5, a 1 trillion parameter native multimodal LLM with 32B active parameters, trained on 15 trillion visual-text tokens. It integrates vision-language reasoning, tool use, and swarm-like agent execution. The model features 256K context length, MoonViT vision encoder (400M params), and Mixture-of-Experts with 384 experts. It outperforms rivals on benchmarks like MathVista (90.1), OCRBench (92.3), and VideoMMU (86.6), and supports native INT4 quantization and long-context reasoning →read their blog

  • 🦋 LongCat-Flash-Thinking-2601 technical report
    Researchers from Meituan introduced LongCat-Flash-Thinking-2601, a 560B-parameter open-source MoE reasoning model with 27B activated parameters and SOTA agentic reasoning performance. It achieves 88.2 on τ²-Bench, 29.3 on VitaBench, and 73.1 on BrowseComp. Trained across 32,000 environments in 20+ domains using the DORA framework, it integrates real-world noise and a Heavy Thinking Mode for test-time scaling. Its Zigzag attention variant supports 1M-token context and yields 1.5× inference speedup with minimal performance trade-off →read the paper

  • Pushing Qwen3-Max-Thinking beyond its limits
    Researchers from Qwen introduced Qwen3-Max-Thinking, a flagship reasoning LLM with advanced tool-use and test-time scaling. It achieved top-tier performance on 19 benchmarks, including C-Eval (93.7), HMMT Feb (98.0), and Arena-Hard v2 (90.2). The model autonomously selects tools like Search, Memory, and Code Interpreter. Its multi-round scaling boosts reasoning accuracy (e.g., GPQA from 90.3 to 92.8) with efficient context usage. APIs support OpenAI and Anthropic compatibility via Alibaba Cloud →read the blog

  • Robobrain 2.5: Depth in sight, time in mind
    Researchers from BAAI introduced RoboBrain 2.5, an 8B-parameter embodied AI model with two major upgrades: Precise 3D Spatial Reasoning and Dense Temporal Value Estimation. It predicts collision-free 3D keypoint traces using (u, v, d) coordinates from monocular RGB inputs and delivers dense, step-aware progress feedback via hop-based value estimation. Trained on 12.4M samples, it achieves SOTA on benchmarks like TraceSpatial (83/63/44 success), MSMU (64.17), and VABench-V (0.1189 error), surpassing prior models in real-world manipulation →read the paper

Research this week

(as always, 🌟 indicates papers that we recommend to pay attention to)

Agent training, mid-training, and experience scaling

  • daVinci-Dev: Agent-native Mid-training for Software Engineering
    Establish agent-native mid-training with executable, feedback-rich trajectories to instill foundational software-engineering behaviors more efficiently than post-training alone →read the paper 

  • 🌟 Endless Terminals: Scaling RL Environments for Terminal Agents
    Scale reinforcement learning by procedurally generating executable terminal environments so simple PPO agents improve when environments, not scaffolds, scale →read the paper

  • EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience
    Evolve native computer-use agents through a self-sustaining loop of task synthesis, sandbox rollouts, and iterative policy refinement →read the paper

  • LLM-in-Sandbox Elicits General Agentic Intelligence
    Unlock general agentic behavior by letting models explore a code sandbox and optionally reinforce those behaviors without agent-specific training data →read the paper

Reinforcement learning systems and optimization theory

  • 🌟 Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Precision Flow
    Stabilize and accelerate large-scale RL by unifying FP8 precision across rollout and training to eliminate numerical mismatch →read the paper

  • Your Group-Relative Advantage Is Biased
    Expose systematic bias in group-relative advantage estimators and correct it with difficulty-aware reweighting for more robust RLVR training →read the paper

  • Behavior Knowledge Merge in Reinforced Agentic Models
    Merge multiple RL-trained agents by disentangling shared and task-specific updates instead of naively averaging sparse RL task vectors →read the paper

Test-time learning, adaptation, and discovery

  • 🌟 Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers
    Adapt attention sparsity at inference time by routing heads dynamically to balance efficiency and performance on long-context inputs →read the paper

  • 🌟 Learning to Discover at Test Time
    Perform reinforcement learning at test time to search for one exceptional solution rather than optimizing average performance across tasks →read the paper

Strategic reasoning, persuasion, and dialogue

  • Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind
    Model reviewer mental states and persuasion strategies to generate rebuttals grounded in theory-of-mind reasoning rather than surface imitation →read the paper

  • 🌟 GameTalk: Training LLMs for Strategic Conversation
    Optimize long-horizon objectives across full dialogues by training models with conversation-level rewards in multi-agent games →read the paper

  • Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance
    Reframe rebuttal writing as evidence-centric planning with inspectable reasoning and on-demand external search →read the paper

Safety, calibration, and reliability of agents

  • 🌟 Building Production-Ready Probes for Gemini
    Design activation probes that generalize under long-context and multi-turn distribution shifts for real-world misuse mitigation →read the paper

  • Agentic Confidence Calibration
    Calibrate agent confidence at the trajectory level by extracting process-level signals that explain and predict failure →read the paper

  • Agentic Uncertainty Quantification
    Turn verbalized uncertainty into active control signals that dynamically balance fast execution and targeted reflection →read the paper

  • Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces
    Estimate domain-level accuracy under drift using decoding-time entropy statistics as a scalable monitoring signal →read the paper

  • Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy
    Reveal how standard fine-tuning silently degrades contextual privacy reasoning while leaving benchmark performance intact →read the paper

Architecture limits, attention, and prompt mechanics

  • Lost in the Prompt Order: Revealing the Limitations of Causal Attention
    Explain prompt-order sensitivity by showing how causal masks block option tokens from accessing context →read the paper

Mechanistic interpretability and actionable control

  • Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability
    Reframe mechanistic interpretability as an intervention pipeline that enables diagnosis, steering, and measurable model improvement →read the paper

  • A BERTology View of LLM Orchestrations
    Reuse hidden states from serving LLMs to perform classification in-pass, reducing latency and guard-model overhead →read the paper

Systems, organizations, and socio-technical limits

  • The Responsibility Vacuum: Organizational Failure in Scaled Agent Systems
    Identify a structural failure mode where decision throughput exceeds human verification capacity, making responsibility unassignable →read the paper

Automated systems and low-level optimization

  • Towards Automated Kernel Generation in the Era of LLMs
    Survey how LLMs and agents are being used to automate kernel design, optimization, and evaluation across hardware targets →read the paper

That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How did you like it?

Login or Subscribe to participate in polls.

Reply

or to participate.