If Turing Post is part of your weekly routine, please share it with one smart friend. Itβs the simplest way to keep the Monday digests free.
This Week in Turing Post:
Wednesday / AI 101 series: From vibe coding to Spec-driven development
Friday / GenAI Unicorn profile or a Reflection on a private event in SF
From our partners: Rethinking Identity for Agentic AI
As AI agents move from experimentation to production, they operate with autonomy across infrastructure. But legacy identity systems were designed for human users and static environments, not dynamic, non-human workloads acting independently at scale.
Teleport introduces an Agentic Identity Framework, defining what modern identity must look like in an agentic world.
To the main topic: This week I came across a paper related to agentic RL and it made me thinking: we keep upgrading the βbrainβ and then act surprised when the agent falls apart halfway through a task.
Historically, reinforcement learning (RL) earned its reputation in clean worlds: games, simulators, well-defined action spaces, fast feedback loops. Then we moved RL into language: first to make models behave (RLHF), then to make them reason more reliably (post-training on chains of thought, tool use, long context). Agentic RL is the next step in that lineage, because real work is not a single answer. Itβs a trajectory: plan, call tools, read outputs, recover from errors, keep state, finish. The moment you ask an LLM to operate a browser, debug a repo, or navigate a workflow with 30β50 moves, βintelligenceβ stops being the bottleneck and stability becomes the tax you pay on every additional step.
Thatβs why Agentic RL is rapidly gaining attention: it is one of the few training paradigms that can reward the model for completing the whole job, rather than sounding convincing at step one. But ARL has a nasty habit: training collapse. Long horizons amplify small mistakes, sparse rewards make credit assignment noisy, invalid actions poison rollouts, and the environment keeps shifting because the agent itself is changing. You can get impressive early curves, and then the agent forgets how to act like an agent.
ARLArena (Wang et al., 2026 βread the paper) tackles this head-on by treating ARL as a systems problem with knobs you can actually tune. They build a standardized, reproducible testbed (behavior cloning warm start, format constraints, KL regularization), then decompose policy-gradient ARL into four design dimensions: loss aggregation, importance-sampling clipping, advantage design, and trajectory filtering. A practical takeaway is that the choice of clipping matters a lot: some token-level schemes remain fragile over long horizons, whereas sequence-level clipping is generally more stable in their experiments. Add more informative advantage signals and selective filtering, and the training loop becomes noticeably more stable across long-horizon tasks.
The bigger takeaway is almost boring, which is exactly why it matters. Agents donβt fail because the brain is too small. They fail because the harness is sloppy. ARLArena is a step toward making the harness an object of engineering, not folklore.
And then thereβs topic number two, the one everyone is talking about: Pentagon labeled Anthropic as a supply chain risk
Itβs controversial and many of my readers will not like what Iβm going to say in the video. In this text version, I want to quote Ben Thompson from Stratechery, because he captured exactly how I feel: βI donβt, for the record, want Anthropic to be destroyed, and I want them to be a U.S. AI champion. I also, for the record, donβt trust Amodeiβs judgment in terms of either national security or AI security.β
I also think we have to look beyond the public statements and stay sober about the power games at play. Please watch this video and let me know what you think, Iβm always for the open discussion.
Our news digest is always free. Click on the partnerβs link above to support us or upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Nvidia, Hugging Face, Microsoft, Google, a16z etc plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand whatβs going on with AI β
Follow us on π₯ YouTube Twitter Hugging Face π€
Twitter Library
News from the usual suspects
Google DeepMind β Aletheia Takes FirstProof for a Spin
Google DeepMind reports that its autonomous math agent, Aletheia (powered by Gemini 3 Deep Think), solved 6 out of 10 research-level problems in the inaugural FirstProof challengeβwithin deadline and without human mathematical intervention. Experts were unanimous on five, split on one. Transparency was emphasized, raw prompts published, and inference costs disclosed. Peer review may remain human, but the co-author list is getting crowded.AWS in Dubai β not resting
Amazonβs cloud arm reported outages across its UAE and Bahrain regions after unidentified βobjectsβ struck a UAE data center, sparking a fire and cutting power to two clusters. Recovery may take at least a day, with banks among those affected.Extra hot week for OpenAI
OpenAI locked in a reported $110 billion raise β valuing it at $730 billion and securing vast Nvidia capacity β while signing a classified deployment deal with the Pentagon, hours after Anthropic was labeled a supply chain risk. Cloud-only safeguards and embedded engineers aim to keep red lines intact, yet backlash is brewing. In one week: capital secured, critics mobilized, and the AI power map redrawn.
π‘ Evaluation Highlight
Implicit intelligence β Evaluating agents on what users donβt say

Image Credit: the original paper
Researchers from Labelboxβs Applied ML Research introduce Implicit Intelligence, benchmarking whether agents satisfy unstated constraints across 4 categories: implicit reasoning, catastrophic risk, privacy/security, accessibility. They pair it with Agent-as-a-World (AaW): single YAML scenarios simulated by an LLM world model with hidden execution rules and binary rubrics. Dataset: 205 iOS-Shortcut-grounded scenarios (303 actions), typically 3β5 entities, 2β4 actions, 3+ rubric criteria, 3+ hidden rules. Testing 16 models, best SPR 48.3% (NSS 72.7%). World model: Claude Opus 4.5, 98.6% consistency βread the paper
π¦ Models Highlight
Google DeepMind Nano Banana 2
Qwen 3.5 medium model series (open source)
Researchers from Alibaba introduced Qwen3.5-Flash, 27B, 35B-A3B, and 122B-A10B models, improving intelligence with lower compute. The 35B-A3B surpasses Qwen3-235B-A22B-2507 and Qwen3-VL-235B-A22B through architectural, data, and RL advances. The 122B-A10B and 27B narrow gaps with frontier systems in complex agent tasks. Flash offers 1M default context and built-in tools, and 35B-A3B runs locally on 24GB devices via GGUF. Series released via Hugging Face, ModelScope, and dedicated API endpoints for production deployment βread their post
Research this week
(as always, π indicates papers that we recommend to pay attention to)
This week the field is trying to make agents and reasoning stop wasting compute.
Efficient reasoning is turning βthinkingβ into a budgeted resource.
Agentic RL stability is becoming the main gate on scaling long-horizon agents.
Decoding is being treated as an optimization layer, not a superstition.
Search-heavy agents are shifting from deep chains of thought to parallel evidence.
Memory design is resurfacing as the bottleneck for long-context behavior.
World models are being defined by consistency constraints, not prettier videos.
Inference-time steering is expanding as a practical alternative to retraining.
Efficient reasoning, stopping, and decoding as a control layer
π Does Your Reasoning Model Implicitly Know When to Stop Thinking? (Beihang University and Bytedance) Uncovers that reasoning models often contain an implicit βstopβ signal that current sampling hides, then introduces a sampling recipe that surfaces shorter correct chains and can be baked into RL to improve pass@1 efficiency βread the paper
The Art of Efficient Reasoning: Data, Reward, and Optimization Maps efficient reasoning into a practical training playbook by dissecting prompts, rollouts, reward shaping, and optimization, and by treating length control as something you can train deliberately instead of hoping the model βfigures it outβ βread the paper
Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers Reframes decoding as solving a regularized optimization problem over the token probability simplex, then uses that lens to derive new samplers aimed at multi-sample pipelines (self-consistency, reranking, verifier selection) βread the paper
Stable reinforcement learning for LLMs and better exploration
π ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning (University of California, University of Wisconsin) Builds a clean testbed plus a βdesign-dimensionsβ analysis of policy-gradient choices for agentic RL, then distills a training recipe and an optimizer variant intended to prevent the classic ARL faceplant βread the paper
π Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization (Microsoft and KAIST) Targets the βagents donβt exploreβ problem by mixing on- and off-policy updates and using memory as an exploration aid while training robustness to operate without memory at test time βread the paper
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training Stabilizes off-policy RL for LLMs under policy staleness and asynchronous training by deriving a sequence-level importance-weight reshaping rule with a variational justification, reducing variance without hacky length normalization βread the paper
DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning Forces exploration to stay meaningfully diverse by regularizing both trajectory-level variety and token-level entropy (only on correct paths), reducing mode collapse where the policy learns one beloved reasoning rut and refuses to leave it βread the paper
Agent execution, planning under constraints, and long-horizon research/search
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization Shifts the compute budget from serial chain-of-thought to parallel evidence gathering, aiming to cut latency and improve cross-setting generalization for research agents that live and die by retrieval βread the paper
TAPE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents Improves reliability in brittle environments by assembling multiple candidate plans into a graph, solving for a feasible path, and executing with constrained decoding plus replanning on feedback drift βread the paper
Revisiting Text Ranking in Deep Research Dissects what actually matters in deep-research pipelines by testing retrieval units, reranking depth, and query mismatch, then shows how βagent-styleβ queries interact with classic IR components (and how to translate them into something rankers like) βread the paper
Data and scaling recipes for practical agents
π On Data Engineering for Scaling LLM Terminal Capabilities (NVIDIA) Builds a synthetic task generation pipeline and an open terminal-task corpus, then demonstrates that data filtering, curricula, and long-context training can move terminal-agent performance by a lot without magical new architectures βread the paper
World models, multimodal consistency, and βimaginationβ for visual reasoning
π The Trinity of Consistency as a Defining Principle for General World Models Proposes a clean conceptual checklist (modal, spatial, temporal consistency) for what a general world model should satisfy, then pairs it with a benchmark aimed at multi-frame reasoning and generation βread the paper
Imagination Helps Visual Reasoning, But Not Yet in Latent Space Tests whether latent βimagination tokensβ causally drive visual reasoning, finds weak causal influence, and argues for teaching explicit textual imagination as a simpler, stronger alternative βread the paper
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis Automates artifact dataset creation by using agents to detect entities, inject controlled artifacts during diffusion, and curate explanations, turning βartifact handlingβ into something you can scale rather than hand-label forever. βread the paper
Robotics and action generation with learned progress signals
TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics Extracts task-progress rewards from a pretrained video modelβs token logits instead of asking it to output numbers, making reward signals more robust and portable across tasks and robot setups. βread the paper
World Guidance: World Modeling in Condition Space for Action Generation Compresses predicted future observations into a condition representation that is injected into the action inference path, aiming to keep action generation both fine-grained and generalizable. βread the paper
Sequence modeling and generative modeling theory that changes how you interpret βwhat the model is doingβ
Test-Time Training with KV Binding Is Secretly Linear Attention Reinterprets a broad slice of test-time-training-with-binding architectures as learned linear attention, yielding simplifications and parallel formulations while explaining behaviors that the βitβs memorizing KV pairsβ story cannot. βread the paper
Memory Caching: RNNs with Growing Memory Extends recurrent models with cached hidden-state checkpoints so effective memory can grow with sequence length, offering a knob between RNN linear cost and transformer quadratic cost while narrowing the recall gap. βread the paper
InfoNCE Induces Gaussian Distribution Analyzes contrastive learning and argues that InfoNCE-trained representations tend toward Gaussian structure under plausible assumptions, giving a more principled handle on a phenomenon people often treat as an empirical quirk. βread the paper
The Design Space of Tri-Modal Masked Diffusion Models Pretrains masked diffusion from scratch across text, image, and audio modalities, then studies scaling and training knobs systematically, pushing diffusion beyond βlanguage onlyβ into a unified multimodal recipe space. βread the paper
The Diffusion Duality, Chapter II: Ξ¨-Samplers and Efficient Curriculum Improves discrete diffusion sampling with predictor-corrector samplers that keep getting better as steps increase (instead of plateauing), and adds a more memory-efficient training curriculum for the Gaussian relaxation phase. βread the paper
Reliability without retraining: steering black-box models at inference time
No One Size Fits All: QueryBandits for LLM Hallucination Mitigation Learns an online policy that chooses query-rewrite strategies per prompt using contextual features, reducing hallucinations in black-box settings where you cannot fine-tune or edit weights. βread the paper
Thatβs all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.


