🎙️What AI Is Missing for Real Reasoning?

Is math the ultimate test for AI reasoning? Or is next-token prediction fundamentally incapable of discovering new truths and discovering conjectures? Carina Hong, co-founder and CEO of Axiom Math, argues that to build true reasoning capabilities, we need to move beyond "chatty" models to systems that can verify their own work using formal logic.

They also just published a very interesting research “Transformers know more than they can tell Learning the Collatz sequence”. This work shows a small transformer can actually learn the hidden control logic of a hard Collatz-style algorithm, not just fit patterns. Its mistakes are structured and explainable, so Collatz becomes a microscope for how transformers acquire real algorithmic reasoning. It’s the first clean proof that a tiny transformer can infer and internalize the control flow of a genuinely hard mathematical process. Not arithmetic tricks, not memorized mappings, but the loop structure itself. And when it errs, the errors expose the learned algorithm instead of random hallucinations. This gives us a rare, mechanistic window into how transformers acquire algorithmic reasoning. Fascinating.

Subscribe to our YouTube channel, or listen the interview on Spotify / Apple

In this episode of Inference, we get into:

Why current LLMs are like secretaries (good at retrieval) but bad at de novo mathematics
The three pillars of an AI Mathematician
How AlphaGeometry proved that symbolic logic and neural networks must merge
The difference between AGI and Superintelligence
Why "Theory Building" is harder to benchmark than the International Math Olympiad (IMO)
The scarcity of formal math data (Lean) compared to Python code

We also discuss the bottlenecks: the "chicken and egg" problem of auto-formalization, why Axiom bets on specific superintelligence over general models, and how AI will serve as the algorithmic pillar for the future of hard science.

Carina has a very clear vision and it was a pleasure talking to her. Watch the interview

This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription – do it here.

We prepared a transcript (edited by Genspark Super Agent) for your convenience. But as always – please watch the full video, subscribe, like and leave your feedback. It helps us grow on YouTube and bring you more insights ⬇️

Ksenia Se: Hello everyone and welcome to Inference by Turing Post. I'm very excited to be joined today by Carina Hong. She's co-founder and CEO of Axiom Math, a fascinating new company that builds a full stack for machine-checkable mathematical reasoning by turning math into a programmable, verifiable language and using it to push towards self-improving AI, which is super hard. Welcome, Carina.

Carina Hong: Great to be here.

Why Can't AI Add?

Ksenia: You know what's jarring? I'll start with a simple question. Everyone is talking about AGI, and then I'm talking to my model and it says: "This meal is 400 calories, this meal is 350, the snack is 200 calories, and dinner is 350. Overall it's 800 calories, you're totally fine." And I'm just staring at it. Why is math so hard?

Carina: That's a great question. Current models are trained by next token prediction. That isn't exactly what's going on when you're doing mathematical reasoning. If you're a human mathematician, you don't give an answer just because other people say it's correct, right?

Currently, machines are also trained on human preferences. It's not exactly how human reasoning works. You can have very similar concepts and a totally different problem. So just looking at training data relating to that concept isn't going to help you.

We do see machines being good at some parts of math – for example, being almost like the secretary for a mathematician. Looking up citations, looking up what has shown up in an ancient paper or even in a German mathematician's notebook. That sort of retrieval and searching in the training data is something machines are good at. But they're really not great at coming up with de novo mathematical knowledge.

When it comes to arithmetic, like the examples you give, it's quite interesting that it gets that wrong or gets some basic linear algebra proof wrong. We also think it's reasonable to say that when the problem becomes very complex – when it's a multi-step reasoning problem, when the tree gets very deep – models are going to struggle even more.

Three Pillars of an AI Mathematician

Ksenia: But you are trying to build an AI mathematician that will be basically independent from a human, right? What are the steps that you need to make? What are the main bottlenecks? I know it's a lot, but give me an overview.

Carina: We believe there are some pillars that are important to the AI mathematician.

First, we want a prover system that can give mathematical proofs when it's asked a question and can do that very well. This is almost like the focus of some other formal math research work – to have a system that can come up with proofs. That's different from an informal model. Large language models give you just the numerical answer.

When we talk about math as a verifiable domain, what we want to see is not just the numerical answer – it's verifiable because a machine can grade that 957 is 957. What that misses is the intermediate steps. We want all the intermediate steps, the reasoning process, to be verifiable as well. That's the prover system, number one.

Number two, we want the model to have a great knowledge base. Know what's already in the knowledge base and what is not. Be able to introduce new nodes to the knowledge base as the prover generates more correct proofs, turning conjectures into theorems, and even bringing in new definitions and concepts that are useful for coming up with new problems in the knowledge base. This knowledge base could have a graph structure, a citation-based knowledge graph.

Number three, we really want to have a conjecture system – a model that can propose interesting new math questions. Then this conjecture model is going to give the challenge to the prover model, and the two of them can just talk. That forms a self-improving loop. The conjecturer can look at what the prover gets right and wrong and come up with an entirely new curriculum of problems to help the prover model hill-climb.

So these are the three parts. And there's also auto-formalization, or the ability to turn natural language math into Lean, the programming language for proofs. This is sort of interweaving all these different parts.

For example, one might have a really good knowledge base of informal mathematics. We need auto-formalization to turn that into a Lean knowledge base. One may have a good conjecture that's able to give some high-level intuitions. Also, auto-informalization – turning Lean back into English – will be needed for the conjecture.

If it can be something in informal language and the prover proves it in formal language, turning that back adds to existing human mathematics by auto-informalizing it back into English. Human mathematicians might find inspirations in that.

I would say one final thing on the AI mathematician point: while we believe this model is going to have incredible capability, having the model be able to collaborate with human mathematicians is also a really important part.

We think that human understanding and mathematical capability are sometimes different from what is difficult and what is easy from what a machine, an AI mathematician, may think. So there's definitely a lot of collaboration space. Because if they're good at exactly the same thing and bad at exactly the same thing, then you might argue that when the AI mathematician becomes really superintelligent, why is there a need for human mathematicians to exist?

The answer is: they're good at very different things. So therefore, the collaboration and the synergy becomes incredibly interesting.

The Conjecture Problem

Ksenia: That's incredibly interesting, because I want to go back to this conjecture model and how you're even going to create it. Math is so endless – how is it going to narrow down to conjecture? And also it's somehow related, I think, to intuition that a human mathematician can have.

So I think it's a very important topic of conjecture creation, and also what role intuition plays here and how – or even can – you build it into a model?

WATCH THE FULL VIDEO

Carina: My eureka moment – "Wow, this is so amazing that we are probably functioning in a space where we can create new knowledge" – was the AlphaGeometry paper.

In AlphaGeometry, Google DeepMind had all the Euclidean geometry figures like triangles, circles, lines, intersection points turned into symbolic language. They turn that into vector-based language and then in fact generate de novo geometry problems by functioning in the vector-based language. So they generate code snippets. And what they do is turn these into geometry representations after they're done generating these new problems.

So almost like extremely creative geometry problems are generated synthetically. That's an incredibly powerful idea.

We've also seen this in the line of mathematical discoveries work by my colleague François Charton and Alberto Bietti. They are researchers who were previously at Meta's Fundamental AI Research lab, but currently at Axiom. They're able to use machine learning to generate very interesting constructions and examples – not conjectures, but mathematical objects.

By observing the underlying pattern of a family of them, the human mathematicians they work with were actually able to come up with de novo conjectures. So using these machine-generated objects to guide human intuition – that's what they do in discovery.

Now, the question of intuition is very big, right? It's very broad. One important part is the ability to generate these examples and constructions. The other part is being able to synthesize them into – the human would synthesize them and formulate interesting conjectures.

Or you could say that the model in the formal language space can generate synthetically new problems and statements by taking in a couple of seed statements. This is what people have been trying, for example, in STP, Self-Taught-Theorem-Prover work from Stanford. They're looking at seed statements from a dataset called Lean Workbook, a high school level math dataset.

Then they try to generate conjectures that are fully in Lean. And then they have the prover model to actually try to give reward to the conjecture model. But obviously there are certain additional designs they need to do, because one can easily imagine a case where the prover-conjecture system didn't get anywhere. Well, is it the prover that's too bad or is it the conjecture that's too bad?

This sort of difficulty – troubleshooting or disentanglement – is something that a lot of researchers are still working on. I do think that by combining formal language, which is a different abstraction level, with informal, you have more space for conjecturing.

The Return of Symbolic AI?

Ksenia: Lean is a symbolic language, right? Is this the return of symbolic machine learning?

Carina: That's a great question. Back in the 1980s, when people were quite excited about formal verification, we didn't have LLMs yet, right? Now we do.

We have large language models being good informal reasoners about general high-level intuitions. They're able to give rough sketches – the paper draft sketch in a proof is still good and sound. And then you have amazing code generation capability on the programming language side. And then you have Lean.

So I think three things coming together makes now really good timing to build an AI mathematician using a hybrid method of formal verification and also informal reasoners. We're not taking the approach of AlphaProof, a purely formal system. We're also not obviously a large language model company like OpenAI. We're taking the strength of these two worlds and two abstractions and putting them together.

Specialist vs. General Models

Ksenia: That's very interesting. I also see a trend, because you launched basically within days of Periodic Labs, and both companies just blew my mind because of the problems you're trying to solve. So do you see it as a trend – specialist scientific models? Or will giant multi-modal models eventually swallow all these domains? How do you think about it?

Carina: Yeah, I think that's a very interesting question. I know Liam (Periodic Labs’ co-founder) well – we're friends – and I'm really excited about what they're doing at Periodic as well.

Periodic and Axiom build in a way on a similar philosophy: that fundamental science and fundamental mathematical discoveries are going to have transformative power in the world.

In the past, fundamental science was almost overlooked because each fundamental breakthrough might take about 100 years to trickle through to the real-world economy to have implications in the commercial world. Now, because AI is compressing the timeline of everything – and the additional reason that a lot of the ideas in the past stayed in that narrow field and didn't diffuse into other adjacent fields.

If I'm a mathematician in the past, right? If I'm Hardy, I will work with Littlewood and I will stay at Cambridge University and never go see the world. I will not talk to that physicist or definitely not talk to that molecular biologist.

So the top minds in mathematics and the ability to make fundamental theoretical breakthroughs were constrained within academia's ivory tower. They don't even necessarily collaborate with other human mathematicians. Definitely they're not going to collaborate with applied scientists or even AI scientists.

We think of AI for math as also the algorithmic pillar of AI for science. We make the radical breakthroughs, and then real-world testing – which is something I think Periodic Labs and other companies are doing incredible work on – they're going to put the theory into empirical environments and gather real-world testing results. And I think that cycle is going to be incredibly exciting.

Beyond IMO: Benchmarking Research Mathematics

Ksenia: The things you're going to build – how are they going to be evaluated and benchmarked? Because everyone is excited about IMO, but I don't think it applies to what you're building, right?

WATCH THE FULL VIDEO

Carina: So I strongly recommend everyone to read a blog by Thomas Wolf called "AI Einstein."

Ksenia: It is a great post!

Carina: Fascinating blog. The idea is that Thomas is obviously great – straight-A student. He was talking about it in the blog. And he realized that research and discovery sometimes are much longer cycle than the exams or competitions you have in school. And it requires different skills.

Like, besides character – grit and resilience – it requires different skills as well. The ability to connect different fields, to think broadly in addition to deeply. The ability to analyze close misses, right? All the things that I think might work and didn't – how can I perturb that idea a little bit and then get something else that's useful? Rather than in an academic environment where sometimes you're not given that opportunity to do another shot.

I am actually, fascinatingly, the other kind as characterized in the "AI Einstein" blog. I was horrible. If I went to the IMO, I would not be a top-tier student. I mean, I did win some Olympiads, like the AMC or the AIME level, but not IMO level. On the other hand, I was a reasonable research mathematician. I think the skills required are different.

I would say that when it comes to building an AI research mathematician, capabilities such as theory building, such as library learning – which is introducing new concepts that will be helpful to construct the following lemmas or theorems – become incredibly important.

Theory building, as I mentioned, is like long-context reasoning instead of a problem as an isolated thing, but rather a node in a big graph. These kinds of abilities become very useful. And I think you're right – we are lacking benchmarks. We are totally lacking benchmarks.

We know that at the IMO level, the high school math level, benchmarks are pretty abundant. We have miniF2F, you can probably have the model take the IMO exam, you can scrape some other countries' math olympiad problems.

But when it comes to PhD qualifying exams, first of all, the cost of doing those benchmarks increases dramatically. Each IMO problem's proven solution costs something like 200-300 bucks, and then PhD qualifying exam is going to be like a thousand, right? The cost 4Xs, for example.

And the question of taste becomes relevant. How do you construct the right curriculum of benchmarks? Benchmarking and especially measuring serious mathematics ability is something that Axiom definitely aspires to do and to be a community leader in this space. We want to be able to have AI mathematicians that can do research in addition to winning bright high school kids' competitions.

What Can You Do Right Now?

Ksenia: I always try to find something practical for my readers from every person I talk to. What can it be from Axiom Math? I'm not sure, because it feels very theoretical right now. But what can you offer that people can do related to math?

Carina: Yeah, so I think two baskets. One for individual audiences.

If you are new to Lean, you haven't heard about it, there's this very good beginner starter game called the Natural Number Game that can pretty much – like learning Duolingo, but for Lean – try and play with these exercises and learn.

If you have some familiarity in Lean, I think pick your favorite result in math and just build it out. Mathlib currently has so much wonderful coverage in domains such as algebra and analysis, right? But in other domains such as geometry and topology and partial differential equations, it's not there yet. There's not that many yet.

So if, on the way of formalizing your favorite result that turns out to be something in low-dimensional topology, you might be able to help build these dependencies that other people will find incredibly helpful for.

That's on the math audience we are passionate about and curious what you would do.

On the enterprise side, I think that the technology of large language model times formal verification – the times as in the intersection – is going to be incredibly promising in the coming years.

We know that a lot of companies serving large enterprises, especially enterprises in heavily regulated industries, do need formal verification to pass compliance regulatory checkmarks. That's actually why a lot of customers – like Amazon, AWS, which has its own neurosymbolic reasoning – are interested.

I also found that there are a lot of unexpected applications of the technology of formal math proving and auto-formalization applied in these other areas. We at Axiom are very much excited to chat with folks who think there might be a synergy.

I think formal verification for both software and hardware are going to be incredibly strategic areas to work on. If you have, for example, legacy code that you want to replace or upgrade, and you want to make sure they still cover all the critical business areas – there's no edge case that's important and dismissed – let's definitely talk about what that looks like.

If you think there are certain hardware circuit property testing or hardware design questions, let's definitely talk as well. I think these are on the formal verification side.

On the mathematical discovery side, it doesn't involve Lean, but it's mostly the practice of generating interesting constructions, examples, mathematical objects that are incredibly relevant to some very deep research open problems, and also deeply relevant to the real world – like graphs, right?

If you're having a particular kind of graph theory problem in mind – logistics, routing –let's definitely talk as well. So I think there are a lot of things to be defined. That is not opening new use cases and markets – that is just scaling the power of LLM-assistive formal reasoning in these existing areas, scaling performance in these existing markets to win. And then there will definitely be new use cases, but that's kind of like a second phase.

Bottlenecks

Ksenia: What are the main bottlenecks and roadblocks on your way to solve all this. First, what are they and then – how do you solve them?

WATCH THE FULL VIDEO

Carina: There are a few bottlenecks. One is data scarcity. There are more than one trillion tokens of Python code. There's only probably about 10 million tokens of Lean code. That's 100,000 times the data gap. So I think that's one problem we're actively navigating.

As a corollary of that, because models have not seen much, the ability to auto-formalize – translating, though translating is not the right word because it's a technically stubborn problem – the process of converting natural language math into Lean is going to be difficult just because they haven't seen much. So you have a cold-start problem, almost chicken and egg.

And we definitely do not want to just pay Lean experts to manually code all the proofs into Lean. That does not scale that well. That's one bottleneck we are facing.

At Axiom, we have three pillars we believe are important to the vision: AI, programming languages, and math. So therefore, on the talent side, when you talk about folks who are working in other parts of computer science, we have very good applied AI people and we also have very good programming languages, compiler people. We also have great mathematicians – both informal mathematicians who work with English and mathematicians who are the pioneers of building out Mathlib, the largest math library in Lean. So that's the crowd we have to climb this pretty difficult, challenging mountain.

Besides data scarcity, I think there are also other bottlenecks. For example, the problem of conjecturing and intuitions is very, very difficult. It requires very good experiment design. And in a startup environment, we definitely want to accelerate all these different directions with the constrained resources we have. So it's all very exciting, all very ambitious and even creative sometimes.

Superintelligence, Not AGI

Ksenia: I wanted to get back to AGI. What is your thinking about AGI and superintelligence? And do you think formal reasoning is essential for AGI?

Carina: I like the word superintelligence better, always better than AGI, because it's hard to –

Ksenia: I don't know really the difference. They are so similar in a sense.

Carina: I think superintelligence can be defined as domain-specific. So I can be a superintelligent AI mathematician, right? It's very hard for this superintelligent AI mathematician to fold laundry the best. I think it's not quite general in a way.

So I think the thing with AGI is I don't quite know what it means. I think what it means, the difference could be: if you think about the world of problems as a plate, and in the middle you have the easy problems like one plus one equals two, print "Hello World," right? And then on the edge, you have many different points.

Maybe the north point is cure cancer. And the west point is solve the Riemann hypothesis. The south point can be write a novel that's going to win the Nobel Prize in Literature. And then you can have the east one as – I don't know – the ultimate question of physical intelligence.

So the AGI concept seems to be that from the center, you enlarge the annulus slowly. And you're betting on a system that is really good at writing the Shakespeare equivalent work that can also solve the Riemann hypothesis and cure cancer. That, to me, feels like AGI. The plate just keeps being enlarged to the edge.

For me, superintelligence is: Hey, I pick the point in the middle, and then I pick the goal, and then I just go there. I just go there in one line. And that's what Axiom is doing. We are very much bullish in building a superintelligent AI mathematician. But I'm not bullish about my model being good at poetry. In fact, my model might not be very good at poetry. So I think that might be the difference.

Ksenia: That's kind of related to the question about Periodic Labs and Axiom Math and the bet on specific models or general models. So you definitely bet on specific models.

Carina: Yes, I bet on the specific model. I also bet on AI for math being the algorithmic pillar, the reasoning platform of AI for science. And that's going to be – I believe that both Axiom and Periodic, these two companies are working on incredibly important things. And that's going to be the future. I think that's going to be the next frontier in AI: AI for math, AI for science.

Books That Shaped a Mathematician

Ksenia: My last question is usually about books. What book has formed you or influenced you recently? It can be either your childhood or something recent.

WATCH THE FULL VIDEO

Carina: I read perhaps too many math books for better or for worse…

Let me answer this for math books and non-math books, because this is going to be fun.

For math books, I think algebraic geometry textbooks some people use Hartshorne, some people use Ravi Vakil's The Rising Sea. I use The Rising Sea just because I'm not smart enough to understand Hartshorne. The Rising Sea has a lot more illustrative examples.

It's a book that formed me. It helps me understand what Grothendieck means by theory building. Once you introduce the concept of schemes, you're able to understand things in a completely different perspective. And algebraic geometry is, I think, a perfect example of theory building rather than problem-solving type of mathematics.

On the other hand, I think there are books such as Davenport on analytic number theory. It introduces a lot of tricks to help you problem-solve, to help you set bounds. There are Cauchy-Schwarz kind of arguments, inequality-type arguments to help you calculate. When you see an inequality in analytic number theory, it's almost like reflexes in your brain. In that way, it has changed math to a more finite search space like Go and chess, rather than like an infinitely exploring one.

So I think these two books each teach me a very different lesson.

And when I was a kid, I loved Proofs from THE BOOK. It was wonderful. The idea that God has a book – this is Erdős's quote – that God has a book of the best mathematics, and there are proofs from the book that we can now read. It feels almost like a very spiritual experience for me.

Mathematicians are almost like truth seekers, messengers. They climb ladders to pick the sweetest apples and then spread it to humanity. These theoretical truths bolster different things, right? Plant a really great number theory breakthrough, and then you have implications in cryptography.

Hardy used to apologize for it in A Mathematician's Apology – a sarcastic apology that, well, my math is just totally useless. Turned out to be quite useful in the war. That's just fascinating to me.

On the non-math side, I love actually a lot of humanities and social science things. I did law school for two years before I took the leap.

Ksenia: That's why you like Thomas Wolf.

Carina: I love reading about law and economics. I think Andrei Shleifer's work is really wonderful, and having him as my professor in the Law and Economics Seminar at Stanford Law was a great experience. Learning from those who write the textbooks and mathematically modeling the effect of certain laws and policies on deterrence versus retribution – that's just fascinating to me.

On literature, there's this book about a generation of Chinese writers and artists: Old Tales of Dayaobao. They all lived in the same Hutong complex in Beijing. And it's almost like the French salon in Renaissance times. These people communicate ideas and bridge between different fields. The one who's an architect can say something that somehow inspires the one who makes beautiful artworks.

Craftsmanship flows through this conversation, and how they were almost collectively having pain and suffering during the Cultural Revolution. I think that was another book that shaped my personality in that I always aspire for art, for artistic expressions and creations. And I think that math and intuition are closer to art than to science.

Practical Tools for Mathematical Discovery

Ksenia: I love these recommendations. I came up with the question about practical stuff. As a human mathematician, can you give an idea how to use AI so everyone can do that? Math-related.

Carina: Oh yes! There are code bases for mathematical discoveries.

The way to interact with them is not exactly how you prompt a large language model, but instead to work with people who are developers of these toolkits to really ponder very hard open math questions.

And these two code bases I want to highlight are called Pattern Boost and End-to-End. So Pattern Boost and End-to-End are work of my colleague François, and he has been collaborating with many mathematicians and theoretical physicists on using Pattern Boost to generate interesting examples, which is generative methods. And End-to-End is the translative method – so it translates problems to solutions.

The other one can generate a family of interesting mathematical objects. And then the purpose is for this to be very interactive. So if I'm a mathematician and I see a lot of graphs being generated, I might have beliefs about the problem that I didn't know before.

So I would encourage folks to just try to get in the mix. There are AI for Math conferences throughout the year, from Oberwolfach to – I think the next one is going to be JMM. And after JMM, there's Aarhus, that Axiom is actually co-organizing.

So I think get in the mix, talk to people who are doing frontier work in AI for math discovery, and see what problems that previously felt not tractable are suddenly perhaps within the grasp.

🎙️What AI Is Missing for Real Reasoning?

Why Can't AI Add?

Three Pillars of an AI Mathematician

The Conjecture Problem

The Return of Symbolic AI?

Specialist vs. General Models

Beyond IMO: Benchmarking Research Mathematics

What Can You Do Right Now?

Bottlenecks

Superintelligence, Not AGI

Books That Shaped a Mathematician

Practical Tools for Mathematical Discovery

Do leave a comment

Reply

Keep Reading

Turing Post