Reasoning Models are a huge part of AI community discussions and debates, gaining more and more attention every day since the appearance of OpenAI’s o1 model and then open-sourced DeepSeek-R1. It seems that this topic has already been picked apart from every angle, which marks that we are dealing with a true breakthrough. As Nathan Lambert claims in his blog “this year every major AI laboratory has launched, or will launch, a reasoning model” because these models can solve “the hardest problems at the frontier of AI” far better than previous generation models.
While we already have so much information about Reasoning Models and every week a lot of papers introduce their approaches for improving “reasoning” capabilities of models, surprisingly, we still don’t have a solid definition of what exactly a Reasoning Model is. The main debate is whether reasoning models even deserve to be a separate category. But let’s call things what they are.
Today, we’re going to take a deep dive into what exactly counts as a Reasoning Model, which models fall into that category, and where this field might be headed in the future to unlock the upcoming AI possibilities. So let’s get started.
In today’s episode, we will cover:
Controversial nature of Reasoning Models
What makes Reasoning Models different?
Models which fall into the “Reasoning” category
Limitations: The problems of overthinking and more
Future perspectives
Mitigating overthinking
Expanding or extending general capabilities
Agentic capabilities
Conclusion
Sources and further reading
What Is a Reasoning Model? Definition & Difference from LLMs
The rise of Reasoning Models has sparkled lively debate in the AI field: Should these “thinking” models be considered a separate category of AI, distinct from plain LLMs? Or are they essentially the same core technology with some clever add-ons? Well, let’s look at both sides of this issue.
Proponents argue that models optimized specifically for reasoning, represent a qualitative leap from LLMs despite being built on them. Thanks to new concepts, such as special reinforcement learning (RL) phase used for training and increased inference-time compute, reasoning models unlock new capabilities, which are used as arguments for placing them in a separate category:
Performance leaps: Models like OpenAI’s o1, o3 outperform traditional LLMs on hard and long tasks, like solving 20-step math problems or generating complex code.
Smaller Reasoning Models outperforming larger LLMs on reasoning benchmarks.
Many major AI labs released reasoning-focused models in 2024–2025.
Branding shift: Models are explicitly marketed as “reasoners”, emphasizing that they are different from general chatbots.
These models are designed not just for output generation, but for reasoning, planning, and tool use – the early foundations of agentic AI.
Nathan Lambert, who emerged as a key explainer of Reasoning Models, suggests calling them “Reasoning Language Models (RLMs)” and notes that their emergence has “muddied” the old taxonomy of pre-training vs fine-tuning. He also emphasized how RLMs redefined the post-training landscape and how Reinforcement Learning with Verifiable Rewards (RLVR) has led to main advancements in model capabilities. (check the Sources section for all links)
While Andrej Karpathy hasn’t explicitly argued for or against a new category, his observations highlight that something qualitatively different is happening inside these RL-trained models: “It's the solving strategies you see this model use in its chain of thought. It's how it goes back and forth thinking to itself. These thoughts are emergent (!!!) and this is actually seriously incredible,” – he wrote in his post on Twitter, on January 27, 2025.
Many academic studies, like “Reasoning Language Models: A Blueprint” echo the idea of dividing Reasoning Models from LLMs, describing them as a “transformative breakthrough in AI, on par with the advent of ChatGPT” moving AI closer to general problem-solving and possibly AGI.
On the other side, many remain cautious about overhyping Reasoning Models as something fundamentally new. Under the hood, they are still transformer-based language models using the same next-token prediction objective. Their improved performance comes not from architectural breakthroughs but from methods like supervised fine-tuning, reinforcement learning, chain-of-thought prompting, longer inference runs, and carefully curated training data.
So the main counterpoints include:
No architectural change:
Reasoning ability is a product of optimized training regimes (e.g., SFT, RLHF, ReFT), inference-time scaffolding (like CoT and majority voting), and increased compute – not new model designs. They are still standard autoregressive LLMs.Limited generalization:
High performance in math, code, or logic puzzles doesn’t generalize to open-ended, commonsense, or causal reasoning. Models often fail on tasks that require real-world understanding, novel abstraction, or reasoning under uncertainty. ARC, counterfactuals, and long-horizon tasks remain challenging.Branding vs. substance:
Terms like “reasoner” risk overstating capabilities and fueling AGI hype.Narrow domains:
Some reasoning-tuned models underperform on general language tasks like storytelling, open-ended dialogue, or real-world question answering. Their specialization can lead to regressions in fluency, creativity, or common sense – casting doubt on how “general” their reasoning really is.
Melanie Mitchell, Professor at the Santa Fe Institute, question whether reasoning models reflect real understanding or simply mimicked heuristics: “The performance of these models on math, science, and coding benchmarks is undeniably impressive. However, the overall robustness of their performance remains largely untested, especially for reasoning tasks that, unlike those the models were tested on, don’t have clear answers or cleanly defined solution steps” (from the article “Artificial intelligence learns to reason”)
But does we really need a completely new architecture with the best capabilities in all domains to define model as a new type? Maybe we should admit that sometimes new concepts, new paradigms and new view on what we already have can lead to opening a new family of models that can be specified for certain tasks.
Follow us on 🎥 YouTube Twitter Hugging Face 🤗
From the Turing Post side, we do define Reasoning Language Models (RLMs) as one of the models types. Here is why it makes sense.
Note: In the post below, we called Reasoning Models LRM, or Large Reasoning Models, but taking into account the variation of size of Reasoning Models that already exist, it’s better to define the entire group as “Reasoning Language Models”, or RLMs.
How Reasoning Models Work: Chain-of-Thought Explained
If we gather the main features of RLMs in one definition we would say that they are advanced AI systems specifically optimized for multi-step logical reasoning, complex problem-solving, and structured thinking. RLMs incorporate test-time scaling, RL post training, Chain-of-Thought reasoning, tool use, external memory, strong math and code capabilities, and more modular design for reliable decision-making.
Let’s break down what stands behind each feature step by step:
Post-training with Reinforcement Learning (RL):
Appearance of RLMs marked a boom of attention paid to RL. RLMs are trained through trial and error and rewarded for correct reasoning steps and answers on challenging tasks like math problems, coding challenges, or logical puzzles. This contrasts with the usual supervised fine-tuning (SFT) for instruction following widely used in LLMs – SFT uses ground truth outputs from humans or trusted models and minimizes token-level loss (like cross-entropy). While SFT is deterministic and controlled, and encourages models to imitate desirable behavior, RL introduces exploration, allowing models to optimize for abstract goals (like helpfulness or safety) even if those deviate from training data.
Training via large-scale RLVR (Reinforcement Learning with Verifiable Rewards) is the main weapon of RLMs for developing useful skills through exploring the best reasoning strategy.
Different reinforcement learning methods – such as PPO (Proximal Policy Optimization), GRPO (Group Relative Policy Optimization), RAFT-style rejection sampling, and newer variants like Multi-Layer GRPO, EM Policy Gradient, etc – are used across various reasoning language models to align multi-step reasoning behavior with desired outcomes and improve reliability across complex tasks.
This runs through the evolution: from PPO and rejection-based sampling (RAFT), to GRPO and its multi-layer self‑correction variant (MGRPO), to more recent frameworks like EM Policy Gradient. Each targets reinforcing structured reasoning, intermediate verification, or efficient trajectory optimization.
Inference-time scaling:
Think properly, then answer – or in proper words, this paradigm allows RLMs to first internally generate a reasoning trace (often a Chain-of-Thought) which can be many tokens long, and then produce the final answer based on that internal computation. The main reasoning process moves to inference stage and RLMs continue leaning on their generated “ideas” during inference. It’s also like doing scratch work and showing intermediate steps. This contrasts LLMs that lack transparency on their reasoning steps. We covered what is test-time compute and how to scale it in one of our AI 101 episodes.
Multi-sampling:
One technique used in RLMs is to sample many different Chains-of-Thoughts or candidate answers and then aggregate them, for example by majority voting or selecting the best via a reward model. Reasoning Models just follow this idea: One pass isn’t enough, generating multiple solutions in parallel and choosing a consensus answer in the way to higher scores, and though to better reasoning. Classic LLMs don’t do this – they rely on more static one-shot generation.
Basically, RLMs form a reasoning structure via chains, trees, graphs and implement reasoning strategies that can be Monte Carlo Tree Search (MCTS), Beam Search, Best-of-N.
Different types of models are used for different tasks. Two important ones are:
Policy model: It’s like an idea generator that suggests the next reasoning step.
Value model: It’s the evaluator that judges how good a certain step or path is.

These main features of RLMs optimizes the model’s reasoning performance rather than just next-word prediction. As models get better at reasoning, they generate more tokens per response, which helps fine-tune reasoning components like strategy and abstraction.
As each Reasoning Model implement its own tips and algorithms and look at their special ideas and concepts mode precisely.
Best Reasoning Models in 2026: o1, DeepSeek-R1, Qwen 3 & More
Thanks to Nathan Lambert, who explores and builds RLMs for Ai2, we can go through a solid comprehensive list of Reasoning Models. Our main goal here is to see the full variety of approaches that different RLMs use, which places them into this category of models.
DeepSeek-R1: It was one of the first openly documented reasoning models. DeepSeek-R1 incorporates a smart training strategy via multi-stage training that adds a “cold start” SFT phase before the main RL training. It uses a specific RL algorithm called Group Relative Policy Optimization (GRPO) to efficiently train its policy without relying on large critic models.
Results: 97.3% on MATH-500, 79.8% on AIME 2024, and 96.3% percentile on Codeforces, plus strong CoT performance on LiveCodeBench.
Kimi-1.5 by Moonshot AI is a multimodal RLM (text+vision) that bets on scaling context window up to 128k tokens to further scale RL. Partial rollouts allows it to reuse large portions of previous trajectories. Policy optimization is handled with a custom variant of online mirror descent (an alternative to standard PPO) for stability. Another cool feature of Kimi-1.5 is “long2short” transfer – training smaller or shorter-output models by distilling knowledge from the long-CoT version.
Results: 77.5% on AIME, 96.2% on MATH 500, 94-th percentile on Codeforces, 74.9% on MathVista.

Image Credit: Kimi-1.5 Original paper
Open-Reasoner-Zero (ORZ) by StepFun proves that plain, standard Proximal Policy Optimization (PPO) with generalized advantage estimation can help to train a strong reasoning model. It removes the typical KL-divergence penalty, so the model is allowed to improve its behavior via RL, guided only by straightforward rule-based rewards. As a result, it started generate longer CoT solutions.
Performance: 36.0% on AIME 2025, and 92.2% on MATH500
Seed 1.5-Thinking by ByteDance Seed introduces a different path by using a Mixture-of-Experts (MoE), with 200B total and ~20B active parameters, to boost reasoning. This design allows efficient use of specialized “thinking” modules and scaling-out the model, adding experts diversity of thought without sacrificing inference speed. The training involves standard instruction tuning plus iterative RL on reasoning-heavy tasks.
Results: 86.7% on AIME 2024, 77.3% on GPQA (science)
Phi-4-reasoning, developed by Microsoft, combines CoT SFT (with selected “teachable” prompts and high-quality step-by-step solutions) and a brief phase of outcome-based RL.
Performance: Phi-4-reasoning is notable for achieving top-tier reasoning performance without enormous model size. With only 14B parameters, it achieves the following results:

Image Credit: Phi-4-reasoning Technical Report
Llama-Nemotron by NVIDIA optimizes inference speed and memory use. It implements neural architecture search from Llama 3 models, automatically finding the most efficient network modifications without sacrificing accuracy. Its two-stage post-training follows: 1) SFT on curated reasoning data, 2) large-scale RL. Uniquely, Llama-Nemotron introduced a dynamic reasoning toggle: at inference, users can switch the model between normal fast chat and a full reasoning mode.
Results: 97% on MATH 500, 75.2% on AIME 2025, and 68.1% on LiveCodeBench
Qwen 3 from Alibaba is a unified model that flexibly handles both quick answers and deep reasoning, managed within one model via system prompts and query analysis. It is a multi-lingual model, using MoE with ~22B active and 235B total parameters.
Its post-training strategy combines: SFT on CoT data, RL and a novel “unified alignment” process to ensure the model’s various reasoning styles and languages remained consistent.
Performance: ~81.5% on AIME’25, 70.7% on LiveCodeBench, and a Codeforces rating around 2056
Skywork Open Reasoner 1 by Skywork AI starts from the DeepSeek-R1-Distill models, but applies an improved RL fine-tuning pipeline. The core algorithm remained reward-driven policy optimization (comparable to PPO), using known-correct answers in math/code as rewards. It also monitors and adjusts entropy to keep the model exploring various solution paths during training.
Performance: 82.2% on AIME24, 73.3% on AIME25, and 63.0% on LiveCodeBench
Xiaomi MiMo is a small (7B parameters) but powerful RLM that optimizes both the pre-training and RL stages end-to-end. The team curated a 25-trillion-token corpus with a three-stage data mixing strategy and introduced a Multi-Token Prediction objective (to predict multiple tokens in one go) for pre-training. MiMo used RL on a custom dataset of 130,000 math and programming problems with carefully shaped rewards and difficulty-driven resampling.
Scores: 55.4% on AIME 2025, 57.8% on LiveCodeBench

Image Credit: MiMo Original paper
Magistral is Mistral AI’s entry into Reasoning Models and the latest RLM, that is notable for using a pure RL approach. Mistral built their own scalable RL pipeline and trained the model using solely their own infrastructure and models. Magistral implements GRPO algorithm and an asynchronous, distributed RL system where generators produce outputs nonstop, verifiers score them instantly, and trainers update the model continuously.
After RL training on text only, Magistral retained strengths like multimodal understanding (from its pre-training), proper instruction following, and even function call formats.
Results: For example, Magistral Small scores up to 62.8 on AIME 2025, 68.8% on GPQA Diamond, and 55.8% on LiveCodeBench

Image Credit: Magistral Original paper
As for the models developed by the biggest AI players, like OpenAI, Anthropic, and Google, they demonstrate interesting agentic along with powerful reasoning capabilities.
Anthropic’s Claude 4 features an “extended thinking” mode, implementing CoT steps internally and invoking external tools, like web browsing or code execution, during reasoning. Claude 4 runs parallel reasoning paths and uses an internal reward model to select the best answers. It can even create “memory files” for long-term knowledge tracking.
OpenAI’s o1 was trained from the start with step-by-step reasoning in mind, using RL techniques to teach it how to plan, reflect, and self-correct internally. o3, the successor to o1, further boosts the CoT reasoning paradigm. It offers adjustable “reasoning effort” levels – for instance, o3-pro uses extra computation, running multiple reasoning chains in parallel and selecting the best answer via an internal scoring model. o3 also retains multimodal capabilities and supports long context windows.
Google’s Gemini 2.5 introduces a “thinkingBudget” interface that lets developers control reasoning depth. It supports dynamic CoT scaling and native planning, helping balance speed, accuracy, and compute.
While many developers continue to focus on improving Reasoning Models, it's important to focus on their weaker spots. Yes, reasoning models are not the ideal AI products, and here is why.
Reasoning Model vs LLM: Key Differences & When to Use Each
The main issues that developers and users of RLMs face is overthinking. Maybe you’ve also experienced the moment when you forgot to switch the model from the reasoning one (say, o1 or o3) to a more usual LLM, like GPT-4o, to perform a very easy task, and an RLM started to thoroughly reason on this easy query, showing all the unnecessary reasoning steps.

KS: I finished editing this article before the model reasoned through this question.
As LRMs are encouraged to produce long reasoning chains, they may sometimes go in circles or include superfluous steps that don’t improve the answer and can even hurt accuracy. This also wastes compute resources. This shows that a model can’t identify when to stop. Ironically, this is also the moment when you think that a simpler LLM might be better.
Another problem is that reasoners may produce internal reasoning chains that are not readable, and may look like symbolic code. This is what the model could capture for itself to increase effectiveness. On the contrary, this problem bypasses typical LLMs, that are trained to always stay aligned with natural language.
And the last but not the least. We’ve already mentioned this but RLMs are not universal reasoners for all tasks, excelling mostly in math, logic and coding tasks, while their creative and open-ended task solving capabilities take a back seat.
The logical question: What to do with these issues? And how Reasoning Models can be evolved further?
Reasoning Models for Coding, Math & Science: What’s Next
Mitigating Overthinking
Overthinking isn’t just wasteful – it’s one of the main bottlenecks for deploying reasoning models in real-world, latency-sensitive environments. Some models already attempt to solve this by constraining the reasoning chain length. For example, Kimi‑1.5 offers a “short-CoT” mode that favors compact reasoning for simpler queries.
Calibration via Reasoning Budget
A growing body of work is focused on calibrating how much effort a model should spend reasoning, depending on the task. Google’s Gemini 2.5 models introduce a thinkingBudget parameter that sets an explicit token budget for reasoning – ranging from 0 (disabling chain-of-thought) to a fixed limit or -1 for dynamic scaling.
Academic research like AdaCtrl and Budget Guidance further explores adaptive reasoning depth. These systems assess question difficulty and either automatically adjust the reasoning path or allow users to control it with simple tags like “[Easy]” or “[Hard].”
While this area is still in early stages, the goal is clear: more reliable, efficient models that can reason deeply when needed – and respond quickly when they don’t have to.
Toward a Standardized Budget Interface
Some commercial RLMs – such as o3, Claude 4, Qwen 3, and Llama‑Nemotron – appear to shift between “light” and “hard” reasoning modes internally, though this functionality remains opaque to users. What’s missing is a clear, standardized interface to manage these tradeoffs explicitly.
If reasoning-budget control became a shared interface across models – either as an API parameter or a user-level toggle – it could unlock new levels of efficiency and control. Models could dynamically allocate resources to match task complexity, or let users steer reasoning depth based on their own tolerance for latency, cost, or verbosity.
Expanding vs. Extending Capabilities
The next big question for RLMs is whether to expand or extend their scope.
In the expansion path, RLMs broaden their domain – from code and math toward more open-ended reasoning, creative problem-solving, and decision-making. This direction leans toward heavier, more general-purpose models capable of tackling a wider range of complex tasks.
The extension path, by contrast, keeps RLMs focused on their strengths: logic, math, and structured problem-solving. Here, the goal is deeper specialization rather than broader reach. This specialization could be a feature, not a bug – enabling a division of labor among model types. If RLMs refine their role as domain-specific experts, other models can evolve in parallel for creative writing, social reasoning, or multi-modal tasks.
RLMs and Agentic Systems
Reasoning models aren’t agents yet – but they’re getting closer.
Agentic systems typically require seven core components: profiling, knowledge, memory, reasoning and planning, reflection, actions, and human-AI communication. RLMs already deliver strong reasoning capabilities and are starting to exhibit reflection behaviors like in-session self-correction like in-session self-correction — an ability closely related to meta-learning, where models adapt quickly to new tasks with minimal data. Some models – like Claude 4 and o3 – also hint at proto-agentic traits, including basic memory traces and tool use in research settings. World models offer another path to grounding reasoning in physical reality. For a look at how code execution and world modeling combine, see Meta's Code World Model — which also adapts GRPO for multi-turn software engineering tasks. For a look at how inference hardware is evolving to run these models faster and cheaper, see The Inference Chip Wars — covering NVIDIA Vera Rubin, Taalas, and MatX
There are two promising directions here:
RLMs as agentic components: In modular agent systems, RLMs could function as plug-and-play reasoning engines, surrounded by complementary modules for planning, memory, and tool execution.
RLMs as proto-agents: Alternatively, developers could progressively augment RLMs with memory persistence, action-taking, and user modeling—evolving them into standalone agentic systems.
Either way, RLMs offer a strong foundation for building the reasoning core of agentic architectures. They may not be agents yet, but they’re a necessary part of getting there.
Conclusion
Reasoning Models are more than just a new marketing term; they represent a tangible evolution in what we can demand from AI. For developers and researchers, they are a powerful new tool, purpose-built for the kind of verifiable, step-by-step problem-solving that underpins scientific discovery and complex software engineering.
However, their rise forces a critical re-evaluation of our approach. We are moving from a world of "one model fits all" to a more specialized toolkit. The central challenge is now one of orchestration: knowing when to deploy a fast, creative LLM versus a deliberate, resource-intensive RLM. The future now is less about building more powerful reasoners, but more about building the wisdom – both in our systems and in ourselves – to know when to "think fast" and when to "think slow." This is the true frontier that Reasoning Models have opened up. It’s sad that Daniel Kahneman isn’t here, he might have appreciated the irony. For years, System 1 and System 2 were metaphors for human cognition; now, they’re becoming runtime decisions in AI workflows.
Sources and further reading
A taxonomy for next-generation reasoning models by Nathan Lambert (blog)
Artificial intelligence learns to reason by Melanie Mitchell (blog)
Andrei Karpathy about Reasoning models (twitter)
Reasoning Model (deepseek-reasoner) from DeepSeek API docs
Magistral (paper)
Phi-4-reasoning Technical Report (paper)
Qwen3 Technical Report (paper)
Gemini Thinking (docs)
Resources from Turing Post










