Finally, we’re getting to the discussion of Thinking Machines Lab, an AI startup founded by Mira Murati, ex-CTO of OpenAI. It’s the team that has impressed everyone with their hands-on research and clever tools like Tinker, their API for fine-tuning models. Their studies on LoRA and nondeterminism in LLMs lift the curtain on questions that have puzzled many developers, researchers, and users. They also raised about $2 billion at a valuation of $12 billion, even though they had not yet launched a full commercial product. But their blog is excellent ;) Today, we’ll focus on what we believe truly deserves special attention and a closer look – modular manifolds.
Modular manifolds step into the area of geometry-aware optimization for neural networks. They treat each network layer as a geometric module, and transform the entire neural networks into coordinated geometric system, showing how geometry, sensitivity to updates, and learning dynamics interact across layers.
Thinking Machines gathered the foundations of earlier research – including its team’s own studies and those from the broader community – to prove that this approach can make optimization more stable and consistent across the model.
To give you the full picture of this emerging direction in optimization, we'll walk through key concepts such as weights, gradients, norms, modular norms, manifolds, modular manifolds, and modular duality to understand how it all works at both the single-layer and full-system levels. This is something that can breathe new life into current optimizers. And after all, it’s just beautiful.
In today’s episode, we will cover:
What are weights, activations, and gradients and their normalization?
The idea of manifold optimization
What is Manifold Muon?
Modular Manifolds: Network-wide optimization
Advantages of modular manifolds
Not without limitations
Conclusion
Sources and further reading
What are weights, activations, and gradients and their normalization?
~Our audience is very mixed – from researchers with deep knowledge to those who are just starting out – so we’re offering something for every level. If you already know the basics below, feel free to skip this part. If you’d like a quick refresh, read on~
When training neural networks, we need this process to be stable, predictable and under control. This means keeping all the numbers inside the networks’ internal processes – the weights, activations, and gradients – from getting too big or too small. Let’s see what these terms mean:
Weights – numerical parameters that define the strength of connections between neurons in a neural network, or, simply, the matrices that connect layers. They influence how much one neuron’s output affects the next one. In an image model, weights identify which patterns (edges, colors, shapes) are important; in a language model – which words or concepts are related and how strongly. During training, the model adjusts the weights to reduce its errors.
Activations – while weights determine how strongly inputs are connected, activations represent how much each neuron “fires” after processing those inputs. They are functions applied to a neuron's output to determine if and how strongly it should pass a signal to the next layer. Each neuron takes some inputs, multiplies them by weights, adds a bias, and then passes the result through an activation function (like ReLU, sigmoid, or tanh). And the number that comes out of this step is the activation. Activations introduce non-linearity, allowing the model to learn complex patterns.
Gradients – vectors that guide the model in adjusting its parameters (weights) to minimize its error, measured by a loss function. They help the model find a better solution for predictions. The process of using these gradients to update the weights is called backpropagation.
To avoid changes in big scale, researchers usually use normalization – rescaling values so they stay within a reasonable range (for example, keeping activations around a mean of 0 and a standard deviation of 1). It is commonly applied to activations and gradients, but less commonly to weight matrices.
So Thinking Machines Lab introduced an interesting idea to keep weight matrices under control: constraining them to certain structured, meaningful spaces or “manifolds” during training. This approach blends geometry and optimization, leading to more stable training, better numerical properties, and a new way to co-design optimizers with the spaces they operate in.
Let’s look at this process closer to explore this new way for optimization. Get prepared for some geometry :)
The idea of manifold optimization
Firstly, we need to clarify the main terms. A manifold is a curved space/surface that looks flat if you zoom in close enough. In real world, the best example of it is our planet Earth. In math and AI, manifolds describe spaces where data or model parameters “live” under specific constraints. For example:
The surface of a sphere is a manifold where every point has the same distance from the center.
The Stiefel manifold is the space of matrices whose columns are all orthogonal and have unit length, often used to keep weight matrices structured and numerically stable.

Image Credit: Computing the Riemannian Logarithm on the Stiefel Manifold: Metrics, Methods, and Performance
So how do manifolds help with optimization?
Manifold optimization constrains model parameters to remain on a specific geometric surface (a manifold) rather than moving freely in flat Euclidean space. Instead of taking standard gradient steps and then projecting the parameters back, the optimizer moves within the manifold’s tangent space – a small, locally flat region that touches the manifold at the current point. It’s like a local playing field for taking tiny, careful steps. This ensures that each update stays consistent with the manifold’s geometry.

Image Credit: Jeremy Bernstein, "Modular Manifolds", Thinking Machines Lab: Connectionism, Sep 2025
After each step, the parameters are brought back onto the manifold through a retraction – a small correction that restores the constraint, for example, keeping unit norm or orthogonality. This approach lets the learning rate correspond closely to the actual step length in the manifold’s geometry – something that projection-based methods can’t guarantee.

Image Credit: Jeremy Bernstein, "Modular Manifolds", Thinking Machines Lab: Connectionism, Sep 2025
In general, the process of manifold optimization looks like this:
Find the best direction to move along the surface (inside the tangent space).
Take a small step, according to the chosen distance measure – often Euclidean, which gives round geometry, but others like the ℓ₁ norm (Manhattan), which gives diamond-shaped geometry, are also possible.
Retract: slide gently back onto the manifold’s surface.
The choice of manifold and distance metric defines the optimizer’s behavior. For instance:
Using a Euclidean space gives standard gradient descent.
Using a hypersphere yields hyperspherical descent.
Constraining a matrix by its spectral norm gives Muon.
Using the Stiefel manifold with a spectral norm constraint gives Manifold Muon – an extension that maintains orthogonality and stability in large-scale neural network training. And it’s a good example to look a little bit deeper in manifold optimizers.
Manifold Muon Optimizer: How It Works
To understand Manifold Muon, let’s start with what happens inside a neural network layer. A weight matrix 𝑊 takes an input vector 𝑥 and transforms it into an output 𝑦, so 𝑦=𝑊𝑥. Ideally, this transformation shouldn’t stretch or shrink the input too much, and small updates to 𝑊 shouldn’t cause big, unstable changes in 𝑦.
To understand how a matrix behaves like this, we use Singular Value Decomposition (SVD). SVD breaks any matrix 𝑀 into three parts: 𝑀 = 𝑈 Σ 𝑉⊤. Here, 𝑈 and 𝑉 are matrices with orthonormal columns (their columns are at right angles and have unit length), and Σ is a diagonal matrix containing singular values.

Image Credit: Jeremy Bernstein, "Modular Manifolds", Thinking Machines Lab: Connectionism, Sep 2025
Singular values describe how far the matrix can stretch a vector in any direction. If all singular values are equal to 1, the matrix preserves the length of every input vector it transforms – it neither amplifies nor dampens the signal. This is exactly what we want from a stable weight matrix. So how does Manifold Muon help here?
Manifold Muon builds on the Muon optimizer, which already limits the size of weight updates using the spectral norm (the largest singular value of a matrix). In addition to this, Manifold Muon adds one more layer of structure: it constrains the weight updates to stay within the tangent space of the Stiefel manifold, ensuring the columns of matrix 𝑊 remain orthogonal.
In experiments, Manifold Muon achieved higher training and test accuracy than AdamW on a small CIFAR-10 benchmark and kept singular values close to 1. Preserving this values near 1 means the model passes information cleanly between layers without excessive stretching or collapsing.

Image Credit: Jeremy Bernstein, "Modular Manifolds", Thinking Machines Lab: Connectionism, Sep 2025
(P.S. AdamW is an optimizer that adjusts each weight in a neural network with its own adaptive learning rate – like the Adam optimizer – while also applying a separate weight decay term to keep the weights from growing too large over time.)
That was an essential theoretical part – now we’re moving to the main focus of the Thinking Machines Lab’s study →
Modular Manifolds: Network-wide optimization
What was explained above applies to individual layers of a model. However, real neural networks are built by stacking many layers together, each transforming its input and passing it forward. So if every layer follows its own geometric rule, how do those rules combine when we connect layers together?
This can be described by what is called the theory of modular manifolds.
Modular manifolds give us a way to extend geometric reasoning and manifold optimization from single layers to entire networks, showing how to budget learning rates across layers in a principled, consistent way.
Here is the core idea of how it works.
Each layer (or “module”) in a network can be described by three ingredients:
A forward function – how the layer maps input to output, for example, 𝑦 = 𝑊𝑥 for a linear layer.
A manifold constraint – the geometric surface its weights must stay on, like the Stiefel manifold, where columns remain orthogonal.
A norm – the way we measure the size of weight changes, for instance, the spectral norm tracks how much the matrix can stretch a vector.
This setup lets us reason about how sensitive the layer’s output is to changes in its weights – something we can measure through the Lipschitz constant (the maximum ratio between how much the output moves and how much the input moves). It shows how strongly a layer can amplify or distort small input changes. A layer that’s 1-Lipschitz means small weight (w) changes lead to proportionally small output changes, making it predictable and stable.
When we connect layers, modular manifolds define how their geometric and optimization rules combine:
The new module’s forward function is just one layer’s output feeding into the next. If two layers have forward functions 𝑓1 and 𝑓2, then their composition is:
f3((w1,w2), x) = f2 (w2, f1 (w1,x)).
A combined manifold constraint is like stacking geometric surfaces together:
M3 = M1 × M2.
This new constraint is the Cartesian product, meaning it combines two manifolds by pairing every point of one with every point of the other, so in modular manifolds, it represents all possible joint configurations of two layers (like how a line and a circle form a cylinder).
And finally, the new norm combines the norms from both layers, scaled by coefficients that act like learning rate budgets across the network:
∥(w1,w2)∥3 = max (s1∥w1∥1, s2∥w2∥2),
where s1 and s2 are scaling coefficients.
In practice, this means each layer still uses its own manifold optimizer, but its learning rate is adjusted depending on how sensitive that layer is in the context of the whole model.
Modular manifolds help us design optimizers that understand how layers interact, ensuring that updates in one part of the model don’t accidentally destabilize another.
The modular norm part deserves a little bit more attention, as it’s responsible for regulating the sensitivity of the weights. It’s a way to make training scale smoothly and stably. The modular norm provides a consistent way to measure and normalize weight updates across an entire architecture, combining the natural norms of individual layers into one global norm that reflects the structure of the whole network. It’s like giving each layer its own learning budget through mass parameters that control the relative learning rate of submodules.
The modular norm is defined recursively from the network’s architecture, with each layer (module) contributing its part, and it captures how sensitive the network’s output is to weight changes at each layer. It ensures that when manifolds are combined across layers, their sensitivities and learning rates remain properly balanced.
In practice, the modular norm is used to normalize weight updates inside any base optimizer (like Adam or SGD), yielding normed optimizers that are architecture-aware. This normalization makes the learning rate transferable across model sizes, so you can scale up a network’s width or depth without retuning learning rate schedules or adding extra correction factors.
This idea of modular normalization naturally leads to the next concept – modular dualization – which extends the same recursive framework to the gradient side of training. It defines how gradients should be mapped back into the weight space in a geometry-aware way.
Why is it called dualization? It comes from the notion that gradients are dual vectors – they belong to the dual space of the parameters, not the same space as the weights themselves.
Before subtracting a gradient from a weight, we should first map it through a duality map, which converts it from the dual space back into the “primal” space of the parameters. So modular dualization builds a recursive, geometry-aware duality map that ensures each gradient step happens in the correct metric space for the model’s architecture. By respecting this rule, modular dualization corrects the mismatch between how gradients are computed and how they should act in curved, multi-layered parameter spaces. Practically, this geometric correction improves both speed and scalability.
Together this three pieces – modular manifolds, modular norms, and modular dualization – form one unified theoretical and practical framework for geometry-aware optimization in neural networks, each responsible for its own part:
Modular manifolds – describe how geometry ties layers of the neural network together.
Modular norm – provides a consistent way to measure and normalize weight updates across this global manifold, guiding the optimizer how to move through that geometry safely and efficiently.
Modular dualization – applies the same modular framework to gradient updates, converting gradients to weight updates consistent with the modular norm and manifold geometry at every layer.
They turn neural networks into coordinated geometric system, giving us a structured way to think about how geometry, sensitivity, and learning dynamics interact across layers.
Putting all the aspects together, let’s conclude the pros and cons of the modular manifolds optimization concept.
Advantages of modular manifolds
Geometric consistency across layers: Modular manifolds keep each layer’s optimization aligned with the network’s overall geometry, ensuring stable weight dynamics across depth.
Built-in normalization and stability: Constraining weights on manifolds keeps them well-scaled, preventing exploding or vanishing gradients.
Structured learning rate budgeting thanks to modular norm.
Implicit regularization that limits the degrees of freedom of weight updates, keeping optimization trajectories smooth and improving generalization.
The framework integrates with optimizers like Muon or AdamW, adding manifold-aware updates without changing the training loop.
And finally, modular manifolds offer a unified view of optimization, geometry, and models’ architecture design, where layers are treated as modular building blocks with shared mathematical rules.
As it is quite a fresh field, these are some drawbacks you should keep in mind.
Not without limitations
Manifold operations introduce extra matrix computations (like SVDs or matrix sign functions), which increases training cost.
It’s an open research area, and the approach hasn’t yet been fully proven on large-scale networks yet.
Picking the right manifold for each module may depend on architecture semantics.
Constrained weight spaces can interact unpredictably with mixed-precision or quantized training. Low-precision numbers can introduce rounding error, leading to less stable training.
Modular manifolds’ convergence properties, layer interactions, and behavior with stochastic gradients are still being explored.
Conclusion
The modular manifolds optimization concept is still in its early stages, but this approach has the potential to make deep learning systems not only faster and more reliable, but also more interpretable, turning the art of optimization into something closer to engineering.
By combining geometry, optimization, and modular design (modular approach is actually one of the most interesting and promising directions in AI and machine learning now), modular manifolds hint at a future where neural network training becomes as structured and principled as the architectures themselves.
Thinking Machines Lab encourages everyone to explore and contribute to the evolution of the multi-faceted field of modular manifolds. And it’s exciting when there is plenty of room to make real progress in how we optimize neural network models. Maybe your studies will lead to the next step in this direction?
Sources and further reading
Other Thinking Machines Lab studies and developments
Tinker (a training API)
Resources from Turing Post

