This month is a treasure trove of hot new model releases. Chinese companies – MoonshotAI, Qwen, and Z.ai – have turned it into a battleground of the strongest agentic models ever. Not only do they mark the beginning of a new era with where just reasoning isn’t enough, they also keep this field open and accessible to everyone. Once again, we witness a huge moment where open technologies perform on par with, or even surpass, the closed models we’ve grown used to.
And, honestly, shame on Meta for backing off open source. In 2024, it was the “Linux of AI”. In 2025, it’s suddenly “be careful what we open”? Not a good vibe.
Anyway, it’s time for agentic innovation, and Chinese models won’t let you get bored.
We’ll start with Kimi K2 – the most talked-about model, then revisit DeepSeek-R1 – the solid baseline for reasoning models, and finally explore the freshest Qwen3, Qwen3-Coder, and the latest GLM-4.5. Join us for this fascinating breakdown.
In today’s episode, we will cover:
Kimi K2 - the Agentic Intelligence ambassador
Exclusive innovations (MuonClip optimizer, Synthetic data and rephrasing are the key, Self-critic – a special shift in reward modeling for open-ended tasks)
Kimi K2’s great achievement
DeepSeek-R1: The reasoning baseline
Qwen3 - the model with controllable thinking modes
Architecture and training strategies
Results of Qwen3-235B
What is Qwen3-Coder?
GLM-4.5 - the hottest Z.ai’s release
How does GLM-4.5 work?
What GLM-4.5 really can
Conclusion: Comparison of the models
Sources and further reading
Kimi K2: Architecture, Training & Agentic Capabilities
Kimi K2 is now the most talked-about massive MoE model, released on July, 12 that comes with the shift towards Agentic Intelligence. It is the result of Moonshot AI researchers’ proficiency and determination to build advanced AI technologies that prioritize lossless long context and personalization. This is about perfect, high-fidelity recall, giving the model full memory of entire conversations. AI-native products built on these principles can deliver highly customized user experiences without the need for traditional model fine-tuning.
And what about the innovations that Kimi K2 brings to the AI world? Well, it’s really one of the most important models of the year as it marks an agentic moment, similar to DeepSeek-R1 reasoning moment. It quickly became a new baseline for agentic behavior coming with the following technical decisions:
Specially built MuonClip optimizer that keeps learning stable and allows training on huge data − 15.5 trillion tokens.
A large-scale synthetic data pipeline which focuses on building agentic capabilities in Kimi K2.
Capability to learn from its own outputs on open-ended questions via Self-Critique Rubric Reward.
Let’s break everything in order.
Kimi K2 Innovations: MuonClip, Synthetic Data, and Self-Critique
Firstly, a little bit about Kimi K2 architecture. It’s a MoE model, which is extended to a massive 1.04 trillion total parameters with only 32 billion of them active at any one time. It increases sparsity level to 48, using 384 total experts and activating 8 per forward pass, without increasing compute cost.
Moonshot AI’s model uses Multi-head Latent Attention (MLA) and has a hidden size of 7168, with expert layers using 2048-dimensional hidden states. It also introduces a smart trade-off in attention head count to keep long-context inference practical (just remember about lossless long context concept). Kimi K2 uses 64 attention heads compared to DeepSeek-V3 128, for example. Less amount of attention helps to make the workflow faster.
Now let’s explore smart innovations we’ve mentioned before.
MuonClip optimizer
We’ll start with the custom optimizer, MuonClip, that is used to train Kimi K2 from the very start.
It’s built on Muon optimizer which bets on efficiency, getting more learning per token. Muon uses RMS scaling (Root Mean Square Normalization) that adjusts the scale of updates across model parameters based on their RMS values to maintain consistent update magnitudes across layers. For a deep dive into the geometry behind Muon and manifold optimization, see our explainer on Modular Manifolds. Weight Decay Integration penalizes large weights during training and in Muon it’s applied in a way that is aware of RMS scaling. In this case regularization doesn’t overwhelm the learning signal.
In Kimi K2 Muon is combined with another smart technique – QK-Clip − to form MuonClip. It prevents the model's attention layers from growing too unstable. QK-Clip works by monitoring extremely large values in the attention layers and gently rescaling the internal weights responsible for those spikes, specifically the query and key projections used in attention.
Overall, MuonClip keeps attention logits well-managed – around 100 at the start, and gradually lower over time, without hurting performance. It allows Kimi K2 to train faster and more reliably on 15.5 trillion token of training data, including web text, code, math and knowledge.

Image Credit: Kimi K2 original paper
Here comes another question – how to get this huge amount of training data?
Synthetic data and rephrasing are the key
One of the big innovations in Kimi K2 over its predecessor is a data rephrasing strategy. Kimi K2 uses synthetic rephrasing to create diverse versions of the content instead of using the same training examples.
Controlled rephrasing pipeline allows to generate diverse knowledge-heavy data via:
Rewriting facts in different styles and perspectives, adding linguistic variety.
Chunk-based generation: Long documents are split into smaller parts, rephrased individually, and then reassembled.

Image Credit: Kimi K2 original paper
Fidelity checks: Each rephrased version is compared to the original to make sure the core meaning is saved.
To improve math reasoning, the model rewrites math texts into a structured “learning note” style which is more student-friendly and adds high-quality translations of tasks from other languages into English to increase variety.
This pipeline allows for better use of data that the model already has.
Moreover, Kimi K2 implements a flexible parallel training setup that mixes pipeline parallelism with virtual stages, expert parallelism across 16 experts, and ZeRO-1 data parallelism for memory efficiency. Due to this, the model can adapt to GPU clusters of various sizes.
As for the agentic capabilities, Kimi K2 also has an agentic data pipeline. Moonshot AI built data synthesis pipeline for tool use, where they created over 20,000 virtual tools and thousands of agents trained to solve tasks using these tools, plus generated agent trajectories for each agent and task. Kimi K2 trains on this and then also trains in real sandboxes like coding environments.

Image Credit: Kimi K2 original paper
Self-critic – a special shift in reward modeling for open-ended tasks
Kimi K2 is the model that (finally!) can easily evaluate open-ended task outputs during reinforcement learning.
The model employs a built-in “critic” (another copy of itself) that scores responses using rubrics − checklists for things like helpfulness, factuality, reasoning, and safety. These rubrics aren’t hardcoded, and the model learns to weigh them dynamically, adapting as it trains. This overall framework is called Self-Critique Rubric Reward. This is what you can rely on when you need to evaluate open-ended tasks like creative writing or reasoning.
What new opportunities does this open? K2 can self-compare its own responses and pick the better one even when no human or ground-truth label is available.
However, to keep the critic grounded and not to lose grip on reality, Kimi K2 uses a closed-loop refinement system. This tight feedback loop trains the critic on objective tasks where right answers are known, then applies that same critique to more subjective prompts, like summarization or open-ended Q&A.
Subjective and open-ended tasks that were hard to handle before, but this new approach lets Kimi K2 scale reinforcement learning for these issues. This brings Kimi K2 closer to the real agent that can reason about its own behavior.
Putting all these improvements together, we can witness what helps Kimi K2 to be a remarkable player in the agentic and reasoning race.
Kimi K2 Benchmarks: Coding, Reasoning & Tool Use
Kimi K2-Instruct ranks as the #1 open-source model and #5 overall on the LMSYS Arena leaderboard (July 2025). It consistently outperforms the previous most-talked about DeepSeek-R1 model across competitive coding, tool use, reasoning, and safety. So it’s quite obvious, that model which beats DeepSeek, marks another special “moment” in AI.
Kimi K2 is well ahead of DeepSeek-R1 in real-world software engineering and coding tasks. It also closes the gap with Claude 4 Opus and even beats Sonnet 4 in some of these problems.
For tool-use tasks, Kimi K2 is best-in-class, outperforming GPT-4.1 and Claude Sonnet and demonstrating 20–30 point lead over DeepSeek-R1.

Image Credit: Kimi K2 original paper
It also shows clear gains across most logic-heavy benchmarks.
In general capabilities it’s best among open models with 92.7% on MMLU-Redux
And, finally, long context is not a problem for K2. It handles up to 128K tokens with strong retention.

Image Credit: Kimi K2 original paper
These stunning results prove that Kimi K2 is the most capable open-weight LLM to date,rivaling proprietary frontier models and setting new standards in real-world, agentic applications. But will this model maintain its lead?
Other determined Chinese players are moving forward with their own unique strategies − and it’s worth watching and analyzing how they do this. But firstly, let's take a quick look back at the main baseline where it all began.
DeepSeek-R1: How It Works & Why It's the Reasoning Baseline
In January 2025, DeepSeek-R1 set the benchmark for reasoning-first open-source AI by embracing deep, step-by-step thinking. The team experimented with reinforcement learning (RL) and scaling test-time compute to push models toward more structured and self-aware reasoning.
It began with DeepSeek-R1-Zero, trained entirely with RL and no labeled data. Using Group Relative Policy Optimization (GRPO), which is now recognized as one of the most effective PO techniques, it reached 86.7% on AIME with majority voting, at that time matching OpenAI’s o1 performance. The model even showed reflective behaviours, “rethinking” steps mid-solution, which is a key sign of reasoning emerging during inference.
To clean up disorganized outputs, DeepSeek-R1 added cold-start fine-tuning on structured reasoning examples before applying RL. This improved alignment, readability, and made the model competitive with OpenAI’s o1-1217 on reasoning tasks.
Here is what it achieved at that very DeepSeek moment.

Image Credit: DeepSeek-R1 original paper
Finally, DeepSeek distilled these powerful abilities into smaller models like Qwen and LLaMA. The distilled 7B model outperformed best Qwen models and set new records for open-source reasoning.
This is how the story of open-source reasoning dominance over closed models began. And here is how it continues with novel amazing models from Chinese best AI companies, that enhance reasoning with agentic capabilities.
Qwen3: Thinking Modes, Architecture & Multilingual Support
Qwen3 is the latest generation in the Qwen language model family released on 14 May by Alibaba Qwen Team. The researchers haven’t stopped on their previous success with Qwen2.5 and created quite a big family of models of the next level. They implemented the key paradigm of this year − the more reasoning tokens the model uses, the better it performs − but made it more controllable.
One of the Qwen3’s key features is that it switches between quick responses and deeper reasoning – in other words, between chat and reasoning versions – based on the task. Users can control how much "thinking budget" the model should apply to the task to balance speed and quality.
Another strength of Qwen3 is democratization. It makes AI more globally accessible for everyone by supporting 119 languages and dialects (up from 29).
Architecture and training strategies
The Qwen3 model family has a wide range of sizes to suit different needs: from lightweight models like Qwen3-0.6B to the large-scale Qwen3-32B, along with two more advanced Mixture-of-Experts (MoE) models. The most interesting for us today is the largest one, Qwen3-235B, that uses 235 billion total parameters, but activates only 22 billion per token. The core architecture builds on what worked in previous versions of Qwen models:
Group Query Attention (processes multiple grouped queries together, reducing redundant key-value lookups)
Rotary Position Embeddings
SwiGLU activation
RMSNorm with pre-normalization (a technique where Root Mean Square Layer Normalization is applied before each transformer sub-layer)
Plus, it adds smart QK-Norm to make training more stable by scaling the query and key vectors in attention mechanisms to control the magnitude of attention scores .
The MoE models, such as Qwen3-235B, are divided into 128 experts, activating 8 at a time, without shared experts, which lets different parts of the model to specialize.
Qwen3 implements Qwen’s tokenizer, designed to handle many languages and data types using a byte-level byte-pair encoding system with over 151,000 tokens. It breaks text into small parts based on how often they appear.
The Qwen3 models are trained on 36 trillion tokens covering 119 languages − double the data and triple the language coverage of Qwen2.5, using a carefully staged process: general knowledge first, then deep technical content like math and code, and finally long-context data.
Post-training adds reasoning skills, following a four-stage pipeline:
Long-CoT cold start teaches reasoning using filtered STEM/math/code tasks with verified answers and rejection sampling.
Reasoning RL improves long-chain reasoning using GRPO and entropy-controlled rollouts
Thinking Mode Fusion unifies reasoning and fast-response modes into one model, with chat templates using /think and /no think flags and thinking budget thresholds to manage reasoning depth.
General RL aligns models with user preferences across 20+ scenarios including instruction-following, tool use, format control, and RAG accuracy.

Image Credit: Qwen3 Technical Report
The team uses popular technique – strong-to-weak distillation from models like Qwen3-235B to improve smaller ones, which benefits over traditional RL training. Well, what about the performance of the best model in the Qwen3 family?
Results of Qwen3-235B
This giant shows leading results across 23 benchmarks, and it’s better to look at them from the perspective of thinking and non-thinking modes.
In thinking mode, it outperforms open DeepSeek-R1, the baseline of reasoning models, on 17 out of 23 tasks, especially in math, agent-based tasks, and coding. Moreover, in multi-step reasoning, Qwen3-235B narrows the gap between open and closed models, like OpenAI-o1, Grok-3-Beta (Think), and Gemini2.5-Pro.

Image Credit: Qwen3 Technical Report
In non-thinking mode, Qwen3-235B remains competitive even without explicit reasoning. It beats top open models like DeepSeek-V3, LLaMA-4-Maverick and even surpasses closed GPT-4o in 18 out of 23 benchmarks.

Image Credit: Qwen3 Technical Report
Qwen3 demonstrates one of the greatest results on ThinkFollow − 98.9% thanks to the fourth stage of post-training.
All models in the Qwen-3 series are open-source under the Apache 2.0 license.
Recently Qwen Team presented another powerful giant but in the agentic coding area. So meet Qwen3-Coder.
What is Qwen3-Coder?
We’re in the age when just “coding” is not what researchers and practitioners are seeking. “Agentic coding” is now the focus for the AI industry. Qwen3-Coder goes completely with this trend with its agentic capabilities like thinking, planning, and acting across complex real-world software tasks.
Its flagship variant, Qwen3-Coder-480B-A35B-Instruct (that’s why we’ve called it a giant), is a 480B MoE model with 35B active parameters per token. It supports a massive 256K context window, and up to 1M tokens with extrapolation. This lets the model operate at repository scale, enabling reasoning across thousands of lines of code in one shot. Codebases, pull requests, documentation − it’s what Qwen3-Coder can actually work with.
Qwen3-Coder is stunning because it is perfectly structured for action. Its Code RL setup for training focuses on execution-based learning, where solutions are validated automatically - “hard to solve, easy to verify”. Qwen Team also proposes long-horizon RL, or Agent RL, that teaches the model multi-turn planning and tool use. Thanks to a system that runs 20,000+ environments in parallel, Qwen3-Coder can handle everything from planning features to fixing bugs and writing tests.
All these strategies make Qwen3-Coder’s performance rival even Claude Sonnet 4 (one of the most liked coder’s models), especially on complex tasks like SWE-Bench Verified.
Leading in agentic coding, browser-based tasks, tool use − this is what defines the successful coding agent now. Plus, don’t forget about openness. Qwen3-Coder has all these advantages. But is it so unbeatable?
GLM-4.5 by Z.ai: Architecture and Agentic Capabilities
On July 28, Z.ai, formerly Zhipu, launched GLM-4.5 model, which is remarkable for the creators in three important aspects:
It’s their first open model with a MoE architecture.
It’s their first model with agentic capabilities built right into the architecture.
It unifies these agentic capabilities with reasoning and coding, to be a universal player for all your needs.
Even OpenAI assumes that Z.ai can build compatible models that can rival the top ones.
With GLM-4.5, researchers integrate more general intelligent capabilities without losing existing ones.
As Zhang Peng, CEO of Z.ai, said on press release: “The first‑principles approach to measuring AGI is to integrate more general intelligent capabilities without losing existing ones. GLM‑4.5 is our first complete realization of this concept.”
Zhang Peng has been obsessed with AGI for many years. Here is how they are building it and what they actually demonstrated with the newest GLM-4.5.
How does GLM-4.5 work?
As we’ve already mentioned, GLM-4.5 uses a MoE architecture. But compared to Kimi K2 that goes wider, GLM-4.5 is going deeper with more layers, which improves reasoning, speed and efficiency. It also uses Grouped-Query Attention (GQA) with partial RoPE and increases the number of attention heads to 96 for a hidden size of 5120 – about 2.5 times more heads than typical configurations. This boosts performance on tasks like MMLU and BBH without improving training loss.
To optimize the workflow, it uses:
Muon optimizer that trains faster with bigger batches.
QK-Norm which keeps attention calculations stable.
MTP (Multi-Token Prediction) Layer that helps with faster speculative decoding during inference.
After pre-training on 15T tokens of general data, 7T tokens of code and reasoning data and fine-tuning on medium-sized domain-specific datasets, GLM-4.5 is refined using specially built reinforcement learning (RL) strategy called slime, to strengthen its agentic capabilities.
slime’s hybrid architecture supports both synchronous training (useful for general reasoning) and asynchronous training, which decouples data collection from model updates. This is especially useful because in agentic tasks external tools or APIs slow down data generation. With rollout engines and training engines operating on separate GPU hardware, slime maintains high GPU utilization and training throughput. It also accelerates rollouts with mixed precision, implementing FP8 for fast, low-memory inference during data collection and BF16 for stable model training.

Image Credits: slime, “GLM-4.5: Reasoning, Coding, and Agentic Abilities” blog post
Then GLM-4.5 undergoes a post-training phase that combines supervised fine-tuning on reasoning, agentic and general scenarios with RL. Agentic training is focused on information-seeking and software engineering.
Though trained on specific domains, gained skills generalize widely and are distilled into a single expert GLM-4.5 model.

Image Credits: “GLM-4.5: Reasoning, Coding, and Agentic Abilities” blog post
Z.ai offers 2 model variants of its flagship model:
GLM-4.5 with 355 billion total and 32 billion active parameters, and
GLM-4.5-Air with 106 billion total and 12 billion active parameters.
They both come with a feature used in today’s top reasoning models − they can switch between thinking mode with complex reasoning and tool use, and non-thinking mode for urgent responses.
GLM-4.5 supports 128k context and native function calling. So…
GLM-4.5 Capabilities: Tool Use, Coding & Agentic Tasks
Tested across 12 benchmarks (agentic, reasoning, coding), it turned out that GLM-4.5 ranks at #3 and GLM-4.5-Air at #6. More precisely, it:
Matches or exceeds Claude Sonnet in real-world agent use, and is clearly ahead of Kimi K2, Qwen3, DeepSeek-R1, GPT-4.1, Claude Opus in tool usage and web reasoning with:
BFCL v3 (Function Calling): 77.8% which is higher than Kimi K2 (71.1%), DeepSeek-R1 (63.8%), Qwen3 (71.9%).
BrowseComp (Web Use): 26.4% (vs. Claude Opus 18.8, o4-mini-high 28.3).
Notably, its tool use success rate is 90.6% compared to Claude Sonnet 89.5, its Chinese competitors Kimi K2 86.2 and Qwen3-Coder 77.1. It also demonstrates high effectiveness.

Image Credits: “GLM-4.5: Reasoning, Coding, and Agentic Abilities” blog post
As for advanced reasoning and math-heavy tasks, GLM-4.5 is strong but slightly behind Qwen3 and DeepSeek-R1.

Image Credits: “GLM-4.5: Reasoning, Coding, and Agentic Abilities” blog post
In agentic coding tasks, GLM-4.5 wins 53.9% vs. Kimi K2 and dominates Qwen3-Coder with 80.8% success.
What is also important, GLM-4.5 excels at full-stack development (frontend + backend), complex artifact generation (games, simulations, HTML/SVG/Python), and slides/posters creation using web + image tools. It supports Z.ai, API, local deployment via HuggingFace, ModelScope, SGLang, and (this is so convenient) can be integrated in coding agent frameworks like Claude Code.
Overall, the image below will help us to create a final comparison of 4 models that we’re looking at today.

Image Credits: “GLM-4.5: Reasoning, Coding, and Agentic Abilities” blog post
Kimi K2 vs Qwen3 vs DeepSeek-R1 vs GLM-4.5: Which Is Better?
Today, we have explore four innovative powerful Chinese models, Kimi K2, DeepSeek-R1, Qwen3 plus Qwen3-Coder, and GLM-4.5. Here is the summary of their strengths and how they compete to each other.

And where does each of them excel most? Depends on what you need from the model:
If you want a well-rounded, strong open base with agentic plus long-context strength, pick Kimi K2.
Choose DeepSeek-R1 if reasoning accuracy is all you care about, and agentic capabilities are not your priority.
Qwen3 is the best if you need control, multilingualism, and switching between thinking/non-thinkin modes.
Use Qwen3-Coder for repo-scale coding with agentic behavior. It is really powerful!
And finally, if you want the most tool-savvy, agent-native model out there today, choose GLM-4.5.
This the a cool month with cool releases and we’re curious to see what everyone can build on these open powerful technologies.
Sources and further reading
Qwen3-Coder: Agentic Coding in the World (blog post)
Chinese Progress at the Front by OpenAI Global Affairs
Open Source AI is the Path Forward by Mark Zuckerberg (2024)
Personal Superintelligence by Mark Zuckerberg (2025)
Resources from Turing Post










