- Turing Post
- Posts
- FOD#122: What Are Thinking Tokens
FOD#122: What Are Thinking Tokens
and how they might be fueling the AI bubble
This Week in Turing Post:
Wednesday / AI 101 series: What are Modular Manifolds?
Friday / AI Interview: Renen Hallak @ Vast Data about what AI OS is
Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox.
Thinking tokens and their role in AI economy (and bubble)
“The costs will outweigh returns for many years, until we make models much more efficient to train and serve (thinking tokens, I’m looking at you).”
– Kevin Patrick Murphy, Google DeepMind
What are thinking tokens?
Early language models generated words straight to screen. Prompt engineers hacked around their limits with “Let’s think step by step” – the so-called Chain-of-Thought (CoT) prompt that forced models to reason out loud. It worked, but it was messy: long, verbose, and expensive.
Researchers have since turned CoT inward. Herel & Mikolov’s “Thinking Tokens for Language Modeling” (2024) was among the first to formalize thinking tokens – special hidden tokens that let models “buy” extra computation time on difficult problems. Similar ideas, like pause or filler tokens, had appeared earlier in transformer research but they called it thinking tokens. Today’s reasoning models – Claude 4.5, Gemini 2.5, GPT-5 – think largely off-stage, running long internal traces before replying, using hidden tokens that never make it to the screen.
The price of pause
In effect, large models learned to pause – buying themselves extra compute before committing to an answer. The upside is quality: they solve tougher math, winning in IMO, plan code, and reason in multiple steps without flooding the chat. The downside is cost. Huge cost. Every hidden token consumes GPU time and power. A single “thoughtful” answer can burn thousands of extra forward passes – and for simple questions, that’s pure waste.
The two fronts of reasoning research (all links are put together under the editorial)
Researchers are now fighting on two fronts.
One camp wants leaner thinking – shorter internal traces, adaptive budgets, and smarter routing that spend reasoning only when it helps.
The other wants different thinking altogether – reasoning that happens outside the token treadmill.
The first camp is about compression. A bunch of research came out this year:
NoWait prunes filler tokens like “Hmm…” and “Wait…”, trimming reasoning length by roughly 40 % without hurting accuracy.
Soft Thinking pushes reasoning into a continuous concept space, cutting token use by about 22 %.
And MARCOS – very noteworthy paper – goes further, replacing discrete chains with a hidden Markov flow of continuous thoughts – up to 15 × faster inference while matching or beating chain-of-thought baselines.
The second camp is about transformation – not thinking less, but thinking differently. And the last week brought a lot of new papers on this topic:
Tiny Recursive Models (TRM) show that 7-million-parameter networks can iteratively refine their answers and beat frontier LLMs on ARC-AGI puzzles – trading width for iteration, small brains thinking many times instead of giant ones thinking once.
LaDIR (Latent Diffusion for Iterative Reasoning) brings diffusion-style denoising into the latent reasoning space, letting models explore possibilities stochastically without printing token chains.
ETD (encode-think-decode) explores a novel way to improve this process by using a "recursive latent thoughts" method, essentially allowing the model to "think" more efficiently on specific parts of its neural network.
Thinking on the fly focuses on a different way for the model to "think" – by optimizing these latent thought vectors, rather than generating explicit textual reasoning.
Markovian Thinker frames reasoning itself as a probabilistic state-transition system – a bridge between cognitive architectures and modern LLMs, complete with formal efficiency guarantees.
And ToTAL (Thought Template Augmented LCLMs) gives long-context models reusable “thought templates,” cached reasoning patterns that prevent paying to rediscover the same plan twice.
Together these projects signal a shift away from brute reflection toward more structured, reusable, and probabilistic reasoning – an emerging focus on making computation count.
How does that relate to the current discussion about an AI bubble?
Here we run head-first into economics. A Bloomberg investigation traced how the cost of thinking itself now anchors trillion-dollar circular deals. Nvidia will invest up to $100 billion in OpenAI to fund massive data-center buildouts; OpenAI, in turn, will fill those sites with Nvidia chips. Within days came a $300 billion Oracle cloud pact and a multibillion AMD partnership that gives OpenAI equity. Morningstar’s Brian Colello called the Nvidia–OpenAI structure an early “breadcrumb” if the bubble pops. OpenAI doesn’t expect to be cash-flow positive until late in the decade.

Image Credit: Bloomberg “OpenAI, Nvidia Fuel $1 Trillion AI Market With Web of Circular Deals”
Behind those loops lies the math of inference. Systems route easy prompts through fast paths and hard ones through deep ones, each burning more hidden tokens. Every pause boosts reliability – and inflates compute. Scaled to billions of calls, the industry’s balance sheet starts to resemble a cognitive-energy grid. Why do it? The logic is simple: if AI is the next platform, you build the rails first and sell the tickets later.
The new cost curve of intelligence
Hidden reasoning changes the economics. It divides the market into quick inference and deep inference – fast models for everyday chat, slow ones for real problem-solving. It also makes the frontier less transparent: if the thinking stays hidden, so do the biases and failures inside it. It also gives more sense to develop better small language models that can be run on devices.
Circular financing and record capex only make sense if reasoning efficiency improves faster than compute costs rise. Otherwise, every new datacenter becomes a warehouse of invisible thinking tokens waiting to be monetized. Oracle’s GPU cloud already runs on thin margins – about 14 ¢ profit per $1 of AI server sales – a fragile spread when each generation demands more thought per answer.
The race to affordable thought
Hidden reasoning once looked like progress toward artificial thought. In 2025 it also looks like the new inflation – a quiet expansion of compute disguised as intelligence. Whether the AI economy stabilizes or the bubble bursts will hinge on a single ratio: How many tokens the machine must think before it earns one dollar back.
So the next efficiency race is about the price of internal cognition – how many thoughts a model can afford to have. Nothing compared to humans.
🤝 Recommended: NODES 2025 → Where Graphs Meet AI Innovation – streaming live Nov 6
Andrew Ng and Emil Eifrem open the day with a fireside chat on the future of AI. After that, choose from 140+ community sessions, packed with real-world use cases in GraphRAG, context engineering, and knowledge graphs. Discover how developers everywhere are building the next wave of graph-powered AI agents. Secure your free spot today!
Topic 2: With so much out there, attention gets stretched too thin. What matters is holding focus on the things that shape the long-term horizon. In our Attention Span #3, we unpack Sora as part of the emerging “training economy” – and as a strategic new monetization layer for OpenAI. We also look at what Google has missed, even though Veo 3 might be the higher-quality model. Watch it here→
Links from the editorial:
Kevin Patrick Murphy’s tweet
Thinking Tokens for Language Modeling by David Herel, Tomas Mikolov
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency (paper)
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space (paper)
MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts (paper)
Less is More: Recursive Reasoning with Tiny Networks (paper)
LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning (paper)
Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts (paper)
Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization (paper)
The Markovian Thinker (paper)
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs (paper)
OpenAI, Nvidia Fuel $1 Trillion AI Market With Web of Circular Deals by Bloomberg
We are also reading:
Why China Built 162 Square Miles of Solar Panels on the World’s Highest Plateau by NYT
Advertisement, Privacy, and Intimacy: Lessons from Social Media for Conversational AI by Hugging Face’s Ethics Team
“Yeah, we’re trying to build very capable AI, AGI, superintelligence, whatever it’s called these days” – what an interesting juggling of terms by Sam Altman in the interview to Stratechery
Unresolved debates about the future of AI by Helen Toner
Follow us on 🎥 YouTube Twitter Hugging Face 🤗
Curated Collections – Video is on 🔥
News from The Usual Suspects ©
SemiAnalysis builds a mirror
Benchmarking LLM inference has long been a game of smoke and mirrors. But InferenceMAX by SemiAnalysis promises to cut through the fog with nightly benchmarks of major models on real hardware, showing actual throughput vs. latency tradeoffs. No cherry-picking, no stale configs – just raw, open-source performance data.Reflection AI: America’s $2B Reply
Reflection AI has emerged from stealth with a $2 billion war chest and a mission: to return the frontier of open AI to U.S. soil. Formed by veterans of PaLM, AlphaGo, and Gemini, the team claims to have built frontier-scale MoE and RL platforms that rival Big Tech labs. Backers include NVIDIA, Sequoia, and Eric Schmidt. Not really clear though if we really need it.Figure goes from prototype to product
Figure has unveiled its third-gen humanoid, aptly named Figure 03 – a redesigned, manufacturable, and Helix-powered leap toward general-purpose robotics. With new tactile sensors, low-latency vision, and a whisper-quiet home-friendly chassis, it’s built to scale. Backed by a new supply chain and its own factory, Figure’s betting it can ship not dozens, but thousands. The soundtrack is soo ominous.
Models to pay attention to
Liquid AI introduces LFM-8B-A1B
LFM2-8B-A1B just dropped on @huggingface!
8.3B params with only 1.5B active/token 🚀
> Quality ≈ 3–4B dense, yet faster than Qwen3-1.7B
> MoE designed to run on phones/laptops (llama.cpp / vLLM)
> Pre-trained on 12T tokens → strong math/code/IF— Maxime Labonne (@maximelabonne)
1:58 PM • Oct 7, 2025
Apriel-1.5-15B-Thinker by SLAM Lab and ServiceNow introduce a 15B parameter open multimodal reasoning model. Built on Pixtral-12B, it uses depth upscaling, staged continual pretraining, and high-quality supervised fine-tuning with reasoning traces. It achieves a score of 52 on the Artificial Analysis Intelligence Index, matching DeepSeek-R1-0528 while using fewer resources. It scores 87% on AIME’25 and 88.2% on CharXiv, outperforming larger models under single-GPU constraints→read the paper
Ling-1T from Ant Group introduces a trillion-parameter general-purpose LLM, as part of the Ling (BaiLing) model family. Ling-1T achieves SOTA performance in logical reasoning, codegen, and mathematics, scoring 70.42% on the 2025 AIME benchmark with over 4,000 output tokens per problem. The Ling family includes non-thinking MoE models (Ling), reasoning-focused Ring models, multimodal Ming models, and the experimental LLaDA-MoE, all open-sourced for inclusive AGI development →read their announcement
That’s all for today. Thank you for reading! It’s shorter today due to Columbus Day. Hope you can also have some time off. Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.
How did you like it? |
Reply