- Turing Post
- Posts
- AI 101: What are Modular Manifolds?
AI 101: What are Modular Manifolds?
How Thinking Machines Lab is redefining neural network optimization through geometry-awareness
Finally, we’re getting to the discussion of Thinking Machines Lab, an AI startup founded by Mira Murati, ex-CTO of OpenAI. It’s the team that has impressed everyone with their hands-on research and clever tools like Tinker, their API for fine-tuning models. Their studies on LoRA and nondeterminism in LLMs lift the curtain on questions that have puzzled many developers, researchers, and users. They also raised about $2 billion at a valuation of $12 billion, even though they had not yet launched a full commercial product. But their blog is excellent ;) Today, we’ll focus on what we believe truly deserves special attention and a closer look – modular manifolds.
Modular manifolds step into the area of geometry-aware optimization for neural networks. They treat each network layer as a geometric module, and transform the entire neural networks into coordinated geometric system, showing how geometry, sensitivity to updates, and learning dynamics interact across layers.
Thinking Machines gathered the foundations of earlier research – including its team’s own studies and those from the broader community – to prove that this approach can make optimization more stable and consistent across the model.
To give you the full picture of this emerging direction in optimization, we'll walk through key concepts such as weights, gradients, norms, modular norms, manifolds, modular manifolds, and modular duality to understand how it all works at both the single-layer and full-system levels. This is something that can breathe new life into current optimizers. And after all, it’s just beautiful.
In today’s episode, we will cover:
What are weights, activations, and gradients and their normalization?
The idea of manifold optimization
What is Manifold Muon?
Modular Manifolds: Network-wide optimization
Advantages of modular manifolds
Not without limitations
Conclusion
Sources and further reading
What are weights, activations, and gradients and their normalization?
~Our audience is very mixed – from researchers with deep knowledge to those who are just starting out – so we’re offering something for every level. If you already know the basics below, feel free to skip this part. If you’d like a quick refresh, read on~
When training neural networks, we need this process to be stable, predictable and under control. This means keeping all the numbers inside the networks’ internal processes – the weights, activations, and gradients – from getting too big or too small. Let’s see what these terms mean:
Weights – numerical parameters that define the strength of connections between neurons in a neural network, or, simply, the matrices that connect layers. They influence how much one neuron’s output affects the next one. In an image model, weights identify which patterns (edges, colors, shapes) are important; in a language model – which words or concepts are related and how strongly. During training, the model adjusts the weights to reduce its errors.
Activations – while weights determine how strongly inputs are connected, activations represent how much each neuron “fires” after processing those inputs. They are functions applied to a neuron's output to determine if and how strongly it should pass a signal to the next layer. Each neuron takes some inputs, multiplies them by weights, adds a bias, and then passes the result through an activation function (like ReLU, sigmoid, or tanh). And the number that comes out of this step is the activation. Activations introduce non-linearity, allowing the model to learn complex patterns.
Gradients – vectors that guide the model in adjusting its parameters (weights) to minimize its error, measured by a loss function. They help the model find a better solution for predictions. The process of using these gradients to update the weights is called backpropagation.
To avoid changes in big scale, researchers usually use normalization – rescaling values so they stay within a reasonable range (for example, keeping activations around a mean of 0 and a standard deviation of 1). It is commonly applied to activations and gradients, but less commonly to weight matrices.
So Thinking Machines Lab introduced an interesting idea to keep weight matrices under control: constraining them to certain structured, meaningful spaces or “manifolds” during training. This approach blends geometry and optimization, leading to more stable training, better numerical properties, and a new way to co-design optimizers with the spaces they operate in.
Let’s look at this process closer to explore this new way for optimization. Get prepared for some geometry :)
The idea of manifold optimization
Firstly, we need to clarify the main terms. A manifold is a curved space/surface that looks flat if you zoom in close enough. In real world, the best example of it is our planet Earth. In math and AI, manifolds describe spaces where data or model parameters “live” under specific constraints. For example:
The surface of a sphere is a manifold where every point has the same distance from the center.
The Stiefel manifold is the space of matrices whose columns are all orthogonal and have unit length, often used to keep weight matrices structured and numerically stable.

Image Credit: Computing the Riemannian Logarithm on the Stiefel Manifold: Metrics, Methods, and Performance
So how do manifolds help with optimization?
Join Premium members from top companies like Microsoft, Nvidia, Google, Hugging Face, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Learn the basics and go deeper👆🏼

Reply