This website uses cookies

Read our Privacy policy and Terms of use for more information.

Quick answer: What is the Nemotron Coalition, and what is Nemotron 3?

The Nemotron Coalition is NVIDIA’s attempt to build open frontier AI models with a network of partner labs and product companies, including Mistral AI, Cursor, LangChain, Perplexity, Reflection AI, Sarvam AI, Black Forest Labs, and Thinking Machines Lab. Nemotron 3 is the technical foundation of that effort: an open-weight model family designed for agentic workloads, built with a hybrid Transformer + Mamba architecture, Mixture-of-Experts routing, multi-token prediction, and NVIDIA’s NVFP4 training stack. The bigger story is not just one model release, but NVIDIA trying to make open AI development happen on its compute, tooling, and ecosystem rails.

What this article explains:

  • Why NVIDIA is open-sourcing Nemotron 3 and the broader Nemotron development stack

  • How Nemotron 3 works under the hood: Mamba, MoE, LatentMoE, multi-token prediction, and NVFP4

  • What the Nemotron Coalition actually is, who contributes what, and where the power sits

  • What this means for open frontier models, sovereign AI, and NVIDIA’s long-term role in the AI ecosystem

For years, the AI race has looked like a set of parallel sprints – each lab running fast, but mostly alone. But what if some influential AI labs with skilled developers and researchers try to not compete but assemble and build frontier models together?

It may sound unbelievable, but this just happened with the Nemotron Coalition. NVIDIA created a global collaboration of leading AI companies to develop Nemotron family of models, and they aligned that announcement with open sourcing Nemotron 3.

We hope everyone knows these names: Black Forest Labs, Cursor, LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam AI and Thinking Machines Lab. Together, they are pooling what usually stays locked inside: data, evaluation systems, research insights, and even compute. The goal is to create shared high-end foundation models that are stronger than anything any one of them could build alone, and then specialize them further. It is a common starting point, a kind of public infrastructure for AI, where progress compounds across the ecosystem. Anyone can take that foundation, adapt it, and build on top of it.

This topic is super interesting, because we’re witnessing something new not only on the tech stack side, but also in terms of developer collaboration. Today, we’re diving into what NVIDIA has actually built, how Nemotron 3 works, the real power dynamics behind the collaboration, and what all of this means for open AI’s future – things that you could have missed behind all other news from NVIDIA GTC.

Nemotron is not a model. It is our entire approach towards supporting an open ecosystem for artificial intelligence.

Bryan Catanzaro, VP of Applied Deep Learning Research at ‪NVIDIA

In today’s episode:

  • Why does NVIDIA build Nemotron 3 and make it open-source?

  • Nemotron 3 Under the Hood

    • Hybrid architecture: Transformer + Mamba

    • Mixture-of-Experts (MoE) and LatentMoE

    • Multi-token prediction

    • NVFP4 Precision: Acceleration is intelligence

    • Training stack

    • Summarizing design principles

  • The Nemotron Coalition: Who Builds What – and Who Holds the Power

  • What This Reveals

  • Conclusion

  • Sources and further reading

Why does NVIDIA build Nemotron 3 and make it open-source?

Nemotron 3 represents a very different approach from the closed-lab paradigm that has defined the last two years. First, NVIDIA is open-sourcing not just the model, but the entire development process: training data, including large-scale synthetic reasoning datasets; training recipes, including the pretraining and reinforcement learning setup; post-training pipelines; and tooling such as NeMo, NeMo RL, and NeMo Gym. Second, it serves as the starting point for joint projects with companies in the coalition, each contributing in its own domain. We’ll get into that later in the article, so stick with us.

Why is NVIDIA rewriting the rules of the game? There’s a clear rational incentive behind this.

Nemotron's first job is to make it possible for NVIDIA to continue to exist as a company.

Bryan Catanzaro, VP of Applied Deep Learning Research at ‪NVIDIA, in the Interconnects AI interview

NVIDIA builds accelerated computing – fast, specialized hardware. But to know what to accelerate, they need to deeply understand AI workloads from the inside. You can't just survey Meta or Google and ask "what should we build next?" because that information is too expensive to derive and too closely held by competitors. So NVIDIA trains its own frontier models to answer critical hardware design questions, such as: What precision levels actually matter? How does particular architecture influence chip design? What happens during training? Without building Nemotron, NVIDIA would be flying blind when designing the next generation hardware.

Another reason is that investing in open models is a compound bet on the expansion of the entire AI market. It is a long-term bet. Today, NVIDIA strongly emphasizes that it works with almost everyone: hyperscalers, tiny AI startups, legacy enterprise companies, countries, and governments worldwide. So every time AI scales in any direction, NVIDIA benefits.

But NVIDIA also benefits not only from the deployment of AI, but from the process of building it. And that is where it has found a distinctive role in the open ecosystem. Bryan Catanzaro, VP of Applied Deep Learning Research at NVIDIA, highlighted one of the most underreported facts in AI during his GTC talk:

It's usually less than a third of the compute that goes towards building AI that is actually building the model itself. About two thirds – or more, maybe three quarters – of the compute is spent on experiments and synthetic data generation, and things around the process of building a model.

Bryan Catanzaro, VP of Applied Deep Learning Research at ‪NVIDIA

This is why NVIDIA is releasing everything: recipes, datasets, ablation studies, RL rollouts, not just weights. The part of the process that most organizations keep secret is exactly where NVIDIA believes it can contribute most to the open ecosystem (and make it bigger).

Now let’s break down how Nemotron 3 is built to understand what makes its architecture and tech stack unique and worth a closer look.

Nemotron 3 Under the Hood

The latest Nemotron 3 is less about pushing raw model intelligence and more about resolving a systems-level bottleneck that has emerged with agentic AI. Together with multi-agent pipelines comes the problem of the latency and optimization of sustained reasoning over long, evolving contexts. In these systems, context scales nonlinearly. Each agent interaction reintroduces prior state, tool outputs, and intermediate reasoning, leading to sequences that are often an order of magnitude larger than standard conversational inputs. This inevitably leads to cost increase, slower inference, and introduces instability, where agents may drift from their original objective. At the same time, every step in such pipelines requires reasoning, creating a “thinking tax,” when large dense models are repeatedly invoked for small subtasks despite they don’t always require full model capacity. So on one side, we have the cost of maintaining a long, growing context, and on the other – the cost of repeated reasoning at every step.

Nemotron 3 can be seen as an architectural response to these two pressures. Technically, it introduces several design decisions that perfectly complement each other.

Hybrid architecture: Transformer + Mamba

Maybe the most interesting part is the hybridization of sequence modeling paradigms. Developers needed to add →

stronger long-context capabilities to the beloved transformers, that typically come with quadratic scaling penalties, and teach the model to sustain a 1M-token context window.

So at the sequence level, Nemotron 3 replaces most transformer attention with Mamba-2 state space layers. Unlike attention, which requires storing and attending over a growing KV cache (O(n) memory, O(n²) compute in training), Mamba maintains a fixed-size hidden state and performs sequence updates via linear recurrences. This shifts long-context processing from memory-bound attention operations to streaming state updates, which scale linearly with sequence length and are much more cache-efficient. Attention is still retained in a few layers to preserve global token interaction capacity, but it is no longer the dominant cost driver. The roles become well divided:

  • Transformer layers handle high-level reasoning and token interactions.

  • Mamba layers manage memory and sequence propagation across very long contexts.

Image Credit: NVIDIA Nemotron 3 original paper

This kind of design allows the model to scale to ~1M token context.

Mixture-of-Experts (MoE) and LatentMoE

On top of this, the model uses a Mixture-of-Experts (MoE) design to decouple parameter count from compute. Each token is routed to a small subset of experts (top-K routing), so the active parameter count remains low while total model capacity grows.

For example, Nemotron 3 Super version is a large AI model with 120B parameters, and with MoE it can operate as a sparse system with only 12B active parameters per token. So you get the both benefits: representational richness of a large system and inference cost closer to a much smaller model.

The next innovational step is LatentMoE which makes sparce MoE more efficient by doing routing in a smaller compressed space. Instead of sending full-size token representations (dimension d) to experts, the model first compresses them to a smaller dimension (ℓ < d). All expert routing and computation happens in this smaller space. This reduces:

  • memory cost because of smaller weight matrices,

  • communication cost since less data moves between experts.

Image Credit: NVIDIA Nemotron 3 original paper

This technical decision enables the use of four experts at a compute cost closer to a single expert.

LatentMoE is a strategic move on multiple fronts because the saved capacity is then used to increase the number of experts and activate more experts per token, so the model gets more expressive power, also meaning better accuracy, without increasing overall cost.

Multi-token prediction

Inference optimization is a separate bottleneck and another efficiency layer comes from multi-token prediction (MTP). It shifts the generation process from strictly autoregressive (one token per forward pass) toward predicting short token spans in parallel. So once the weights are loaded into the GPU, you can use them more efficiently: instead of 1 prediction, you do 2–4 predictions in the same pass.

This improves training signals – the model plans ahead, reduces the number of decoding steps and enables faster generation through speculative decoding, especially in long-form outputs. For example, Nemotron 3 Super version benefits about 3× faster inference with fewer forward passes from MTP. But it works well only if those extra predicted tokens are correct.

NVFP4 Precision: Acceleration is intelligence

Nemotron models run on NVIDIA Blackwell GPUs using NVFP4 precision (4-bit floating point), and this is used for both inference and training. It is about 4× faster than FP8 (Hopper GPUs) and uses much less memory with no accuracy drop and <1% loss gap vs. BF16.

The precise figure is NVFP4 = 4.75 bits, not a pure 4-bit, because these three-quarters of a bit also matter for precision.

All major tensors (weights, activations, gradients) are quantized to this 4-bit floating point with hierarchical scaling (micro-block + global scaling). But to maintain stability, sensitive layers (attention projections, Mamba outputs) remain in higher precision, and stochastic rounding and input transforms are used to stabilize training.

And here comes the most interesting, even historic, part. As Bryan Catanzaro claims:

No one outside NVIDIA has ever pre-trained a model at this scale using four-bit math.

Bryan Catanzaro, VP of Applied Deep Learning Research at ‪NVIDIA

The developers found that publicly released Nemotron pre-training dataset can achieve up to 4× faster convergence than standard open web datasets. This reflects NVIDIA’s core idea behind Nemotron’s design: “Acceleration is intelligence,” which follows the company’s philosophy, mentioned by Jensen Huang at his GTC keynote:

Energy is intelligence. Any watt that is wasted is going to make our AI dumber.

Jensen Huang

Training stack

It is also important to say a couple of words about overall training stack. Nemotron 3 models are trained using multi-environment reinforcement learning, where different tasks (coding, math, tool use, long-context reasoning) are optimized simultaneously. This provides more stable behavior and better generalization across agent-like tasks.

NVIDIA also separates post-training into distinct stages – broad RLVR, dedicated software-engineering RL, and RLHF – because long-horizon agentic tasks are much slower and structurally different from standard reasoning workloads. The whole setup is also built around large-scale asynchronous RL infrastructure, which makes training across many environments and thousands of GPUs practical in the first place.

Summarizing design principles

If we sum up, we can see the detailed picture of where Nemotron’s speed and efficiency gains come from:

  • MoE → fewer active parameters; LatentMoE → cheaper experts, so you can use more of them

  • Mamba layers → efficient long-context processing

  • Multi-token prediction → more tokens predicted at one and fewer steps

  • NVFP4 → faster hardware execution and accelerated pre-training

Combine these four aspects together, and you get up to 5× throughput and up to 2x higher accuracy than the previous Nemotron version.

Image Credit: Nemotron 3 Super Technical Report

The models also support reasoning budget control – you can explicitly limit how many tokens the model spends “thinking,” regulating accuracy–speed trade-off. 

Across the members of Nemotron 3 family, the differences are mostly about where they sit in the cost–capability tradeoff:

  • Nemotron Nano (30B with 3B active parameters) released in December 2025 is optimized for throughput and low-cost tasks.

  • Nemotron Super (120B with 12B active parameters) released just recently on March, 11 2026 is mostly suited for multi-agent coordination.

  • Nemotron Ultra (~500B) is built for deep reasoning workloads (coming soon).

Image Credit: Bryan Catanzaro talk at GTC

  • Nemotron Omni is a multimodal model, also coming soon.

  • Nemotron Voice Chat enables full duplex speech – a genuinely new paradigm for AI voice interaction. It was beta released at NVIDIA GTC

Nemotron 3 Super is the latest released model from Nemotron 3 family with open weights and full training transparency. It can be fine-tuned via NVIDIA NeMo and deployed across cloud, on-prem, or hybrid setups, packaged as an NIM microservice and supported by a broad ecosystem of cloud, inference, and enterprise platforms.

It is optimized for agent pipelines and, for example, it can act as:

  • A software agent, loading entire repo into context, debugging + writing code end-to-end;

  • A research agent that reads thousands of pages and better maintains reasoning consistency;

  • A tool-using agent, delivering improved function calling and handling large tool libraries.

To be more concrete, here’s the scale of what NVIDIA has released alongside the Nemotron models:

  • 25 trillion pretraining tokens (including synthetic and curated data)

  • 40 million post-training samples

  • 37 reinforcement learning datasets

  • 21 RL environments

  • 1.2 million RL rollouts

Together, all the Nemotron 3 models are unified with the following principles:

  • Faster models are smarter models. More speed means more pretraining tokens seen, more RL rounds, more reasoning cycles at deployment.

  • Design to keep production systems fully loaded. You need to maximize tokens per watt across a full data center, not just peak latency.

  • Design for a specific accelerated system. Every Nemotron model begins by identifying its exact deployment configuration first.

Image Credit: Bryan Catanzaro talk at GTC

The Nemotron Coalition: Who Builds What – and Who Holds the Power

Nemotron is not just about a frontier open model, but also about the distribution of power among labs working on a single project and merging high-level expertise from different areas.

Why is this happening now? Because frontier open models are becoming too expensive, too specialized, and too dependent on surrounding infrastructure for every company to build alone. NVIDIA suggests to pool the costly foundation work, then let others build their own differentiated layers on top. That is consistent with the Nemotron strategy the company has been laying out publicly: open models, datasets, and tools for agentic AI, all framed around openness, specialization, and sovereign deployment. At GTC, NVIDIA made that strategy impossible to miss, putting open frontier models at the center of the conversation, including in Jensen Huang’s panel with major open-model players.

What makes this moment especially interesting is that NVIDIA is no longer just describing that strategy. It is now operationalizing it through a new collaborative model.

According to Bryan Catanzaro, the coalition is organized around self-contained projects, each with defined partners, goals, licensing, and contributions across data, compute, and expertise. The first project is pretraining a base model with Mistral. That what’s become Nemtron 4 base. Later projects, including post-training, will involve more partners, each shaping the model for its own applications and needs

So who is doing what (most likely)?

  • NVIDIA is supplying the compute, the umbrella, the branding, the political story, and probably the real gravitational pull. It provides compute and co-leads training.

  • Mistral AI looks like the actual flagship model-building partner.

  • Black Forest Labs is the provider of multimodal capabilities (images, video, visual intelligence).

  • LangChain brings expertise in agent systems (tool use, long-horizon reasoning, orchestration).

  • Cursor is responsible for evaluation datasets + real-world performance requirements. It is about making sure the model works well for developers in practice.

  • Perplexity focuses on making models useful at scale with applied AI systems + user-facing performance insights.

  • Sarvam AI contributes to the collaboration with multilingual, culturally adapted AI systems (voice-first, local languages, regional use cases).

  • Reflection AI is positioned around RL-driven post-training and reasoning.Thinking Machines Lab around data, evaluation, and research collaboration (and fantastic blog writing, including their research on modular manifolds and geometry-aware optimization) — but in this coalition, both function as early capability bets and credibility signals rather than already proven pillars

Here we see the distribution of forces: NVIDIA and Mistral build the engine, while others improve it from all sides. Maybe that will be reshuffled later.

One more interesting moment is this: is NVIDIA “rescuing” Reflection or Thinking Machines? Maybe not financially. Maybe not operationally. But it is giving them something very valuable: borrowed stature, immediate relevance, and a place in the narrative before the scoreboard is settled. In return, NVIDIA gets their aura. Mira Murati’s name and Reflection’s pedigree help make this feel like the center of gravity for open frontier AI, even if the real work is initially happening elsewhere.

What the Nemotron Coalition Reveals About Open AI

Firstly, this distribution of power reflects a fairly neutral observation that is increasingly supported by what we’re seeing in AI today: frontier model development is becoming modular. One company trains, another evaluates, another builds the harness, another supplies domain data, another handles multimodality, another localizes.

But secondly, NVIDIA wants to orchestrate that modular stack. They do not want OpenAI, Anthropic, Google, and maybe Meta to be the only story. This is NVIDIA assembling an alternate power bloc in public.

It’s fair to say that NVIDIA turning “open AI” into a managed industrial zone, where everyone gets to keep their logo, give a quote about openness, and then plug into NVIDIA’s compute, roadmap, and distribution machine. They want “openness” without losing control. That is the key tension here. The release is full of words like transparency, collaboration, and sovereignty. But nothing in it suggests a democratic governance structure. It sounds open in output and centralized in coordination.

For those who listened to the Open Models section moderated by Jensen Huang at GTC” we are all in Jensen’s living room, and it’s his rules we are playing by.

Nemotron Coalition: Ecosystem Strategy, Not Just a Model

Datasets, techniques, recipes, research findings, training infrastructure, synthetic data pipelines – all of that is Nemotron. The model is only the visible surface.

Bryan Catanzaro’s GTC talk makes clear that Nemotron is NVIDIA’s vertically integrated bet on the full AI value chain, from transistors to training tokens. The FP4 pretraining milestone, the coalition structure with Mistral, Black Forest Labs, Cursor, and others, the reported 4x dataset efficiency gain, and the LatentMoE design all point to something more ambitious than another open model release. NVIDIA is trying to build a shared foundation for high-quality open-weight AI, and, as CUDA already proved, it is comfortable playing a 20-year game.

Still, the coalition announcement is strategically powerful, not yet product-level astonishing. It becomes truly big news only if the resulting Nemotron 4 base model is strong enough to serve as a real open foundation for developers, and strong enough to seriously challenge the closed labs. Until then, this is a sophisticated piece of ecosystem architecture.

And one more thing: NVIDIA and its partners badly need better naming. Because if everyone in this coalition starts saying they are building “open AI,” that phrase will become even more confusing than it already is.

Sources and further reading

  • NVIDIA Nemotron 3: Efficient and Open Intelligence | paper

  • Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning | paper

  • NVIDIA Debuts Nemotron 3 Family of Open Models | Nvidia blog post

  • New NVIDIA Nemotron 3 Super Delivers 5x Higher Throughput for Agentic AI | Nvidia blog post

  • NVIDIA Open Models (including Nemotron 3) | Models

  • NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models | Nvidia blog post

  • Why NVIDIA builds their own open models | Nemotron w/ Bryan Catanzaro | Interconnects AI interview

Resources from Turing Post

Reply

Avatar

or to participate

Keep Reading