Turing Post
Posts
FOD#102: Do Reasoning Models Think Too Much?

FOD#102: Do Reasoning Models Think Too Much?

plus new video format: Three WOW and One Promising Release from the last week

Ksenia Se
May 27, 2025

This Week in Turing Post:

Wednesday, AI 101: we discuss BERT and an entire ecosystem of variants that it inspired
Friday, Interview: Insights from Devvret Rishi and Predibase

Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Hugging Face, Microsoft, Google, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

Our schedule was disrupted by Memorial Day, which the United States celebrates on the last Monday of May. So today’s FOD (which usually goes out on Monday) will be shorter plus we’re trying a new format:

Reading AI news can feel like wading through a swamp of hype and hypotheticals. What’s actually working? What’s real? That’s the question that sparked Three WOWs and One Promise – my weekly roundup of three breakthroughs that genuinely impressed me (after plowing through hundreds of AI newsletters) and one release that’s full of promise.

The idea came from Kevin Scott, Microsoft’s CTO. He once talked about “Capabilities Overhang” – the huge gap between what AI could do today and what we’ve actually built into products. That’s the heart of this video: to spotlight what AI is already doing right now, in the real world.

So: watch it, comment, and smash that Subscribe button. Let’s get the word out – AI isn’t some distant sci-fi future. It’s already here, and it’s reshaping our lives in ways worth celebrating.

(Also, how cool would it be if my four sons told their friends their mom’s a famous YouTuber?! Do subscribe ;)

To the main topic: Do Reasoning Models Think Too Much?

The efficiency arms race begins

As reasoning becomes the prized capability of modern LLMs, a new generation of papers is asking a surprisingly human question: Can these models learn when to stop thinking?

Just last week, we've seen a flurry of proposals – Thinkless, AdaptThink, ASRR, and Self-Braking Tuning (all the links are under ‘The freshest Research Papers' section) – all converging on a shared concern: reasoning is expensive, and most tasks don’t require a 500-token chain of thought. These frameworks are teaching models to self-regulate, either by toggling between reasoning depths or by suppressing redundant steps altogether.

Their approaches vary – from reinforcement learning with control tokens (Thinkless, AdaptThink) to identifying and trimming overthinking through internal feedback loops (ASRR, SBT). But the goal is the same: maximize inference efficiency while preserving or even enhancing accuracy.

Yet as they chase similar gains, these papers also highlight the limits of incrementalism. Their technical distinctions – while clever – blur in application. In the quest to tame overthinking, we may be seeing less of a creative divergence and more of a convergence toward a standard toolkit: dynamic thinking, token budgets, and adaptive control.

It raises a larger question: once we've optimized when to think, what happens next? Perhaps the next frontier isn't efficiency, but purpose – not how many steps a model takes, but why it takes them. Until then, these papers mark a collective step toward making reasoning models not only smarter, but more self-aware.

Swyx coined the term “AI engineer,” and now he’s running the best conferences for AI engineers and practitioners. I’ll be there. San Francisco, June 3-5. Let’s meet up – especially since I’ve got a 30% discount code for you. Register here; the lineup is amazing (and that’s just the keynotes) →

Discount code: THANKSKSENIA

Curated Collections

Our Deep Dive on JEPA is one of our most popular articles. This list is a great addition to keep learning about the architecture →

Click to read

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

We are reading/watching

Reasoning and the Capability Overhang by Will Schenk
Busting Unions with AI: How Amazon Uses AI to Crush Labor Movements by Devansh
The Intimacy Dividend: How AI Might Transform News Media Consumption by Shuwei Fang
OpenAI has an unsubtle communications strategy by Dave Karpf
How Does Claude 4 Think? by Dwarkesh Patel with Sholto Douglas & Trenton Bricken
Artificial Intelligence Implementation Plan by US Marine Corps (yep)

Image Credit: US Marine Corps

Models to play with

those models we find particularly interesting are marked with 🌟

🌟🌟 BAGEL is an open-source foundation model trained on diverse interleaved multimodal data, outperforming peers in reasoning, manipulation, and understanding → read the paper (disclaimer: I haven’t played with it yet but it looks incredibly interesting)
🌟 Claude Opus 4 & Sonnet 4 by Anthropic introduces extended thinking and hybrid modes that allow parallel tool use, memory retention via local files, and state-of-the-art results on SWE-bench and agent workflows → read more
🌟 Claude Code by Anthropic
Now GA with IDE integrations, background GitHub tasks, and a full SDK for custom agents. Extends Claude’s capabilities into hands-on dev tooling → read more
🌟 Gemma 3n by Google introduces a mobile-first, multimodal model designed for local inference with a 4B memory footprint and dynamic submodel creation for latency-quality tradeoffs → read more
Reward Reasoning Model by Microsoft Research and Tsinghua University proposes chain-of-thought reward modeling with test-time compute adaptation, enabling better alignment through self-evolved reasoning → read the paper
🌟 R3: Robust Rubric-Agnostic Reward Models introduces interpretable, generalizable reward modeling without fixed rubrics, improving alignment flexibility and transparency → read the paper
Panda is a pretrained model on synthetic chaotic systems that generalizes to real-world dynamics, even predicting PDEs with no retraining → read the paper
AceReason-Nemotron by Nvidia demonstrates that large-scale RL can outperform distillation in reasoning for both math and code, using curriculum-style training → read the paper
🌟 Neurosymbolic Diffusion Models improves symbolic reasoning accuracy by modeling dependencies through discrete diffusion, achieving better calibration and generalization → read the paper
MMaDA combines diffusion-based reasoning with unified chain-of-thought fine-tuning and a new RL algorithm (UniGRPO), outperforming SDXL and LLaMA-3 in multiple tasks → read the paper
UniVG-R1 reinforces visual grounding with CoT and difficulty-aware reinforcement learning, achieving top scores on multiple video/image grounding tasks → read the paper.
Web-Shepherd introduces a step-level reward model for web navigation, significantly improving trajectory evaluation accuracy and cost-efficiency → read the paper
🌟 Toto by Datadog a decoder-only foundation model with 151 million parameters for time series forecasting using observability metrics → read the paper

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Reasoning Efficiency & Optimization

Papers that focus on improving how, when, and how much large models “think,” using methods like adaptive reasoning, compression, and hybrid strategies.

🌟 Soft Thinking proposes training-free soft token generation in continuous space to emulate abstract reasoning and improve accuracy and efficiency in LLMs → read the paper
🌟 Reasoning Path Compression compresses semantic reasoning traces without retraining to boost inference throughput while preserving accuracy → read the paper
Mind the Gap bridges “thought leaps” in chain-of-thought math reasoning by injecting missing intermediate steps → read the paper
Fractured Chain-of-Thought Reasoning truncates reasoning paths to balance token cost and accuracy using a new sampling strategy → read the paper
Optimizing Anytime Reasoning uses token budget-aware training with verifiable rewards for efficient and flexible inference → read the paper
Think Only When You Need with LHRMs introduces hybrid thinking that chooses when to think using reinforcement-guided context awareness → read the paper
Reasoning Models Better Express Their Confidence shows that extended reasoning leads to better-calibrated confidence in model outputs → read the paper
🌟 General-Reasoner enhances large language model reasoning across diverse domains by using a large-scale dataset and generative model-based answer verification, outperforming existing methods → read the paper

🌟 While each of these papers presents a valid and interesting improvement, their conceptual overlap significantly reduces the standalone novelty of each when viewed in the broader landscape of reasoning efficiency frameworks. They are best seen as variations on a common optimization theme rather than as paradigm-shifting innovations:

AdaptThink trains reasoning models to decide when deep thinking is needed using reinforcement learning → read the paper
Thinkless employs control tokens and RL to toggle between short and extended reasoning for better efficiency → read the paper
Let LLMs Break Free from Overthinking introduces self-braking tuning to detect and halt redundant reasoning without external interventions → read the paper
When to Continue Thinking dynamically suppresses unnecessary reasoning using adaptive regulation mechanisms → read the paper

Papers enhancing reasoning via integration across code, logic, tools, and vision.

🌟 Learning to Reason via Mixture-of-Thought combines natural language, code, and symbolic logic for superior logical reasoning → read the paper
🌟 Tool-Star builds a multi-tool reasoning system via reinforcement learning and scalable data synthesis → read the paper
Pixel Reasoner enables visual reasoning in pixel space via operations like zoom and frame selection → read the paper.
Think or Not? for Vision-Language Models allows VLMs to decide whether to reason or not, reducing length and token usage → read the paper

Post-Training Control & Tuning Strategies

Papers about steering or adjusting pretrained models without major rearchitecture.

Two Experts Are All You Need (RICE) identifies and utilizes key cognitive experts in MoE architectures for more efficient reasoning → read the paper
🌟 Be Careful When Fine-tuning reveals a backdoor vulnerability where fine-tuning data can be stolen via black-box access → read the paper
🌟 QwenLong-L1 combines SFT and RL to train long-context reasoning models with curriculum-based scaling → read the paper

Model Compression, Quantization & Deployment

Papers that enable lighter, faster, and more secure deployment of large models.

Exploring Federated Pruning for LLMs preserves privacy in model compression via client-specific pruning without data sharing → read the paper
Scaling Law for Quantization-Aware Training analyzes quantization error trends and proposes mixed-precision to improve QAT → read the paper

Training Paradigms & Model Design

New frameworks and architectures to improve training efficiency, inference flexibility, or overall design philosophy.

Chain-of-Model Learning introduces layer-wise sub-representation chaining in Transformers for scalable and flexible inference → read the paper
Model Merging in Pre-training investigates merging checkpoints mid-pretraining for faster, cost-effective LLM training → read the paper
🌟 Alchemist (by Yandex) is a small but powerful SFT dataset for text-to-image models, improving generative quality → read the paper

Autonomous Agents & Scientific Automation

Papers extending LLMs into agentic roles across scientific or software domains.

NovelSeek builds a closed-loop multi-agent system for autonomous scientific research → read the paper
Efficient Agent Training for Computer Use trains computer-use agents using a small human-annotated set enhanced via synthetic generation → read the paper

Symbolic & Structured Query Enhancement

Blending neuro-symbolic methods for better query understanding and retrieval.

🌟Neuro-Symbolic Query Compiler uses AST-based neuro-symbolic grammar to improve RAG systems’ understanding of complex queries → read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve

Leave a review!

Reply

or to participate.