In July 2025 Kimi K2 from Moonshot AI was the most talked-about massive Mixture-of-Experts (MoE) model with the special focus on Agentic Intelligence. When we compared it to other Chinese open models back then, we noted: pick Kimi K2, if you want a well-rounded, strong open base with agentic plus long-context strength. Among DeepSeek-R1, Qwen3, and GLM-4.5, this one was the most universal.
And now, in November 2025, we have Kimi K2 Thinking – the newest and most capable version of Moonshot’s open-source thinking model. It’s a reasoning agent that solves problems step-by-step and uses a wide range of external tools, like a Python interpreter or web search. Remarkably, it can make 200–300 tool calls in sequence (!) and outperforms top models like GPT-5 and Claude Sonnet 4.5 (Thinking) on many benchmarks.
The model continues Moonshot's global vision of building the strongest agentic models with lossless long context. So let’s explore how Moonshot defines this strategy – and what K2 Thinking contributes to pushing the company one step ahead of other models on its path toward true agentic intelligence.
In today’s episode, we will cover:
Moonshot AI’s global strategy
What is Kimi K2 Thinking?
Tech spec
Architecture and features
Native INT4 quantization
What about Agentic performance
Not without limitations
Early cases of implementation
Conclusion
Sources and further reading
Moonshot AI’s Global Strategy
Before we’ll discuss what is new in Kimi K2 Thinking model, let’s look back at Moonshot AI’s global idea and see how their previous release – Kimi K2, and the fresh Kimi K2 Thinking fit into their strategy.
Moonshot AI’s strategy centers on the lossless long context. They aim to create models that can process and recall massive amounts of text without losing information or performance. It’s a path to making personalization and contextual understanding possible without fine-tuning, using a model’s memory of entire conversations.
Another main focus of Moonshot AI is (obviously) AGI. Yang Zhilin, the founder of Moonshot, outlined their roadmap toward AGI in three layers:
Layer 1: Scaling laws and next-token prediction. It’s the current industry standard.
Layer 2: Overcoming data and representation bottlenecks, enabling self-evolving systems that learn continuously.
Layer 3: Advanced capabilities, such as long-context reasoning, multi-step planning, multimodal understanding, and agentic behavior. Moonshot sees its opportunity to lead the field at this third layer.
The company decisively moves beyond static models toward AI agents that can plan, reason, use tools, and critique themselves – it’s what they also call “Agentic Intelligence.”
In July 2025, Kimi K2 became the cornerstone of Moonshot’s long-context and agentic AI vision, and the first to truly reflect the “Agentic Intelligence” concept. It is a massive 1.04 trillion-parameter Mixture-of-Experts (MoE) model trained on 15.5 trillion tokens that integrates multiple innovations:
The MuonClip optimizer for stable large-scale training.
A Self-Critique Rubric Reward system for self-evaluation on open-ended tasks.
A synthetic data pipeline that rephrases and diversifies knowledge sources.
For its agentic abilities, Kimi K2 trains on a large agentic data pipeline. Kimi Team built a system with 20,000 virtual tools and thousands of agents solving tasks through them and generating detailed agent trajectories.

Image Credit: Kimi K2 original paper
This is how Kimi K2 anchored Moonshot’s “lossless long-context” strategy and became a foundation for their AGI Layer 3. In July, it was the most capable open-weight LLM to date, that rivaled proprietary frontier models and set new standards in real-world, agentic applications.

Image Credit: Kimi K2 original paper
But now it’s time to shine for the Moonshot’s newest model – Kimi K2 Thinking, built on the Kimi K2’s solid agentic foundation with mixed large-scale training, improved tool-use and self-evaluation.
Kimi K2 Thinking is the company’s next step toward true agentic intelligence, extending Kimi K2 into a reasoning and tool-using “thinking agent,” that reasons, plans, and acts autonomously over hundreds of steps. So let’s unpack →
What is Kimi K2 Thinking?
Kimi K2 Thinking is an open-source “thinking” model, essentially an AI agent that can reason step-by-step and use tools like a calculator, code interpreter, or web browser. It can make 200–300 tool calls in sequence keeping consistency, and handles long reasoning chains and complex tasks without human help.
As an agentic model Kimi K2 Thinking plans, check its own work, and refines answers over hundreds of reasoning steps. It switches smoothly between steps like “thinking → searching → reading → coding → thinking again”. This helps it to plan long-term, adapt to new information, and build coherent answers.
What is especially remarkable, through Kimi K2 Thinking, Moonshot AI demonstrates their specific approach to test-time scaling – they improve intelligence by expanding both the amount of thinking (tokens) and the number of actions (tool calls) during inference.
For example, K2 Thinking’s Heavy Mode runs eight reasoning paths in parallel and merges them for more reliable results, similar to GPT-5 Pro’s configuration.
You can try Kimi K2 Thinking now on kimi.com in a lighter chat mode for speed, using fewer tools and shorter reasoning chains. The full agent mode, with autonomous tool use and full multi-step reasoning, will be available soon. The model is also accessible via the Kimi K2 Thinking API. Both the code and model weights are released under the Modified MIT License, allowing open use and research development – which is another cool part about Kimi K2 Thinking.
But before you start playing with it, let’s look what’s inside K2 Thinking. It’s quite impressive →
Tech Spec
Architecture and Features
K2 Thinking has a 1-trillion-parameter Mixture-of-Experts (MoE) architecture with only ≈32 B active parameters used per inference. The model is made up of 61 layers. One of these layers is a dense, or fully connected, layer — a layer where every node connects to every node from the previous layer to combine the learned features. Its hidden dimensions are 7168 for attention and 2048 per expert, supported by 64 attention heads. In total, it contains 384 experts, of which 8 are selected per token during inference. K2 Thinking employs an MLA (Multi-Head Latent Attention) mechanism and uses the SwiGLU activation function for efficient and stable training.
K2 Thinking was built for:
Long reasoning chains with 200–300 tool calls.
Large 256K context window that allows the model to handle very long reasoning chains and documents.
Efficient multi-expert activation.
It automatically identifies when and how to use each tool, for example, calling a weather API when asked. It does this via built-in tool parsing logic that enables fully automated “think → call tool → integrate result” cycles.
At the 1T-parameter scale every bit saved per weight dramatically reduces GPU memory footprint and bandwidth usage. So from a technical standpoint, K2 Thinking still needs to be optimized to generate extremely long reasoning chains without much memory use and slowing down.
That’s why Kimi Team applied a special technique that perfectly works for K2 Thinking's notable efficiency →
Native INT4 Quantization
Lower-precision formats are often used during inference to make large models faster and less memory-intensive. But why exactly INT4? INT4 provides the highest compression – roughly 4× smaller than FP16 – while retaining near-FP16 output quality. In this format, the model’s parameters are stored as 4-bit integers and processed on hardware optimized for integer arithmetic.
However, such aggressive quantization would cause accuracy loss. K2 Thinking prevents this using Quantization-Aware Training (QAT) to its Mixture-of-Experts layers – a weight-only quantization on these layers – and so makes the model robust to INT4 rounding errors:
QAT simulates 4-bit rounding noise during training, so the model learns to maintain accuracy even when quantized.
In this architecture, this is applied only to MoE weights, leaving activations at higher precision for stability.
As a result, this enables INT4 precision inference, using much less GPU memory and running twice as fast, with no accuracy loss.
All benchmark scores, that you'll see in the next section, were achieved under INT4 precision. But anyway, model checkpoints are stored in compressed-tensors format, which can be converted to higher-precision types like FP8 or BF16 if needed.
Kimi K2 Thinking benchmarks: HLE, BrowseComp, SWE-bench
Kimi K2 Thinking achieves state-of-the-art results on several benchmarks and in many difficult tasks. All tests used a temperature of 1.0 and a 256k context window (except SciCode). Reasoning tasks had token limits between 96k–128k, and results were averaged over multiple runs for stability. Here is the performance gains which Kimi K2 Thinking has demonstrated:
It performs on par with or better than other top “thinking agent” models, such as GPT-5 and Claude Sonnet 4.5 (Thinking), in reasoning, coding, and general problem-solving (and – again – it’s open source):
On Humanity’s Last Exam (HLE) 44.9% with tools and 51% in heavy mode vs. GPT-5’s 41.7% – this benchmark shows expert-level reasoning across 100+ subjects.

Image Credit: Introducing Kimi K2 Thinking blog post
60.2% on BrowseComp – a web-based reasoning result doubles the human baseline and beats GPT-5’s 54.9%.
71.3% on SWE-Bench Verified, 61.1% on SWE-Multilingual, 47.1% on Terminal-Bench – software engineering and coding.
Notably, K2 Thinking shines at front-end tasks (HTML, React, UI components), turning written prompts into working code and products. In “agentic” setups, it can act as a co-developer, building full projects from scratch.
The model solved a PhD-level math problem through 23 alternating steps of reasoning and computation. It’s about bringing together multiple capabilities – planning, reasoning, and adapting dynamically, while also using external tools like search or Python – to solve complex comprehensive problems.

Image Credit: Introducing Kimi K2 Thinking blog post
It also performed quite well on Simon Willison’s “Pelican on a bicycle” test :)

Image Credit: Simon Willison’s blog
As for the general capabilities, Kimi Team equipped their best model with creative writing skills for crafting expressive, human-like stories, poems, and scripts and preserving tones and styles better and with more depth.
Practical writing is K2 Thinking’s another strength, allowing it to covers every part of a complex prompt with structured reasoning, which perfectly works for academic or analytical writing, research and professional work. The team also built Kimi K2 Thinking to respond with empathy and nuance for thoughtful, human-sounding guidance.
Since the launch of Kimi K2 Thinking got a lot of attention, it’s also interesting to see what users and developers think about it. Let’s take a look at their honest reactions.
Kimi K2 Thinking vs GPT-5 and Claude: expert reactions
While there were many tweets full of excitement about Kimi K2 Thinking beating GPT-5 on HLE and its unique writing style, several interesting points from influential voices in AI got our attention.
Sebastian Raschka quickly shared an architecture comparison of Kimi K2 Thinking and DeepSeek R1, highlighting that K2 Thinking is an upgraded version of DeepSeek V3/R1 with modest architectural tweaks. Major progress, he says, seems to come from improved data and training recipes rather than model size. “More experts, fewer heads, and even more thinking!” – he posted on his Twitter.

Image Credit: Sebastian Raschka’s X
Other useful insights came from Nathan Lambert. In his Interconnects blog post, he noted that open models are closing the gap with closed systems like OpenAI’s and Anthropic’s – both in performance and quality. He also observed that Chinese labs release models much faster, sometimes within months, giving them an advantage in visibility and iteration speed.
Lambert added that Kimi is becoming a household name internationally, alongside other Chinese labs such as DeepSeek and Qwen.
Perhaps the most interesting point was about tool calling: he highlighted that the ability to perform many tool calls, once seen only in models like o3 or Grok 4, has now appeared in open-source form. This kind of “interleaved thinking” – alternating reasoning steps with tool use – mirrors Claude-style deep reasoning behavior.
However, Nathan Lambert also cautioned:
“This sort of behavior emerges naturally during RL training, particularly for information tasks, when the model needs to search to get the right answer. So this isn’t a huge deal technically.”
Now let’s summarize the general strengths and weaknesses of Kimi K2 Thinking.
Advantages and Limitations of Kimi K2 Thinking
In a few words, the main advantages that K2 Thinking has brought to the AI world are:
Deep thinking and tool orchestration: It is trained end-to-end to mix reasoning with tool use for autonomous workflows in research, coding, and writing.
Stable long-horizon reasoning: Many prior models weaken after 30–50 tool calls. K2 Thinking, in turn, keeps coherent goals across hundreds of steps with 200-300 tool calls.
Efficient architecture solution: Using INT4 precision for inference with Quantization-Aware Training (QAT) ensures 2x faster deployment and low memory use without quality loss.
Open-source accessibility (hosted on Hugging Face): Released with transparent weights and documentation, K2 Thinking provides researchers and developers a rare opportunity to study large-scale reasoning systems directly rather than through black-box APIs.
As for the limitations, there are several you should take into consideration:
Heavy computational demands: Even with INT4 quantization, K2 Thinking’s 1-trillion-parameter MoE architecture requires substantial GPU resources.
Hosting open-weight, tool-using models (with so many tools) adds extra engineering load for stability and latency control.
Quantization-Aware Training (QAT) minimizes accuracy loss, but INT4 quantization can still cause subtle degradation in precision-sensitive tasks.
Not all tasks benefit from much recursion 200–300 sequential tool calls.
Conclusion
Well, once again, a Chinese open-source model is having its moment – this time shining with agentic capabilities and an unprecedented number of tool calls. Moonshot AI keeps advancing its vision for the future of agentic models, positioning Kimi K2 Thinking as a next-level companion available to everyone. It makes us wonder: are they still chasing AGI, or shifting toward the more human-augmenting side of AI?
With models like Kimi K2 Thinking, open source now stands for the power of transparency and collective progress. We’re watching for Moonshot’s next move — likely within half a year — and the question is whether it will center on new training and optimization methods or an entirely new model architecture.
And the last one: these are developments in the world of LLMs – but will we see them applied to spatial intelligence anytime soon? Without that, the path to AGI, if it’s even achievable, remains unrealistic.
Sources and further reading
Introducing Kimi K2 Thinking (Kimi blog post)
Kimi-K2-Thinking (open model on Hugging Face)
5 Thoughts on Kimi K2 Thinking by Interconnects
From Turing Post:









