Turing Post
Posts
FOD#125: What is nanochat? (it's bigger than you think)

FOD#125: What is nanochat? (it's bigger than you think)

how to understand machine intelligence through a tiny model

Ksenia Se
November 03, 2025

This Week in Turing Post:

Wednesday / AI 101 series: From BF16 to FP16 (Defeating the Training-Inference Mismatch via FP16)
Friday / An AI Unicorn You’ve Probably Never Heard Of

🤝 Recommended: NODES 2025 → Where Graphs Meet AI Innovation – streaming live Nov 6

Must see: Andrew Ng and Emil Eifrem open the day with a fireside chat on the future of AI. After that, choose from 140+ community sessions, packed with real-world use cases in GraphRAG, context engineering, and knowledge graphs.

Discover how developers everywhere are building the next wave of graph-powered AI agents. Secure your free spot today!

Our news digest is always free. Click on the partner’s link to support us or Upgrade to receive our deep dives in full, directly into your inbox →

Editorial: While there are a lot of things happening in the world of AI, here we are with our monthly overview of Karpathy’s posts. And this time, we’ll look at everything through the lens of the tiny model he just created – nanochat (it already has over 35k stars on GitHub).

Why is it so important? Because it’s deeply connected to education. And if there’s a way to change the world for the better, it starts with education. As you might know, Andrej Karpathy left all the big labs to focus on Eureka Labs, something he describes as “a new kind of school that is AI native.” In a recent interview, though, he sounded unsure – as if even he doesn’t yet fully know what Eureka Labs will become.

And that’s exactly why nanochat is so interesting. He calls it a ramp, but it’s actually a lab of its own – a miniature system where anyone can experiment alongside him or independently.

In other words: Eureka is still a vision, full of uncertainty, but Nanochat is real and working – a small workshop where that vision learns to walk. And we can walk along.

So what is nanochat and how you can use it?

Karpathy introduced nanochat in the beginning of the month, and then spent weeks teaching, tuning, and talking about it – a miniature language model that costs anything from $100 (~4 hours on an 8XH100 node) to train and behaves like a small, curious creature. He described it as a “kindergarten child”: cheerful, error-prone, sometimes absurd, always revealing. The project became his way of thinking through the entire ecosystem of large models, only at a scale that can fit on a desk. A lens on everything through this baby AI.

nanochat is both an object and a method. It represents a full learning loop – pretraining, supervised fine-tuning, and reinforcement learning – compressed into a form that anyone can inspect. For Karpathy, it’s an antidote to abstraction. The trillion-parameter world of frontier labs is hard to grasp – a micro-model you can retrain overnight restores visibility. Through it, he asks: what actually happens when a model begins to learn?

Image Credit: Karpathy’s X

Learning Through Synthetic Worlds

He spent the middle of the month adding identity to nanochat. Using synthetic conversations, he taught it who it is: a small model named nanochat d32, built by Andrej, aware of its limitations, occasionally royal enough to call him “King.” Later, he gave it a new skill – counting letters in words – through a small synthetic dataset called SpellingBee. Each exercise was a study in how data variety and task framing shape capability.

The point was not to make the model clever, but to show how easily personality, knowledge, and reasoning can be layered into weights. Nanochat became a teaching aid for anyone curious about how identity and behavior emerge from data.

Image Credit: Karpathy’s X

New Ways of Thinking About Thought

Even when Karpathy wrote about other topics – diffusion, pixels, attention – it all returned to the same lens.

He compared autoregressive text models to diffusion: one writes token by token, the other rewrites a full canvas of tokens until the noise settles. The second feels closer to a thinking process – reflective, bidirectional, revising itself as it goes. He wondered what it would mean to train nanochat this way.

He also questioned the entire text pipeline. Why keep feeding models discrete tokens when the input could be pixels? Rendering text as images could compress information and preserve structure, typography, color, and emotion. He dreams of a model that reads the world visually – a system free from tokenization, Unicode, and all their inherited mess.

Even new methods immediately are going to a different branch on nanochat (note: stay tuned, we will be covering this BF16-FP16 paper on Wednesday).

Image Credit: Zichen Liu’s X

The Shape of the Signal

Through Karpathy’s play with nanochat, we see education shrinking to something smaller, more hands-on, and more visible. It’s open-sourced and open for everyone. Large systems will keep expanding, but real understanding now comes from microcosms – models light enough to experiment with and clear enough to learn from. It becomes increasingly valuable to spend just a few hours exploring it – to see, up close, how we are currently building intelligence.

How to try: If you want to try it yourself → launch a single 8×H100 node (Karpathy uses Lambda Labs) and run bash speedrun.sh. The script trains the ~$100 tier end-to-end in about four hours and then you can serve a simple chat UI with python -m scripts.chat_web to talk to your model. Or you can train it for about 33 hours for ~$800. There’s also a CPU/Mac (MPS) route via dev/runcpu.sh for tiny, slow runs that are still useful for seeing the code paths. After training, open the generated report.md to review CORE, MMLU, ARC, and GSM8K scores, then try the “Infusing identity” and “SpellingBee” guides to add self-description and a small skill.

More on nanochat GitHub: https://github.com/karpathy/nanochat

Attention Span: A look into the future – Three modalities of Quantum Computing, the missing NVQLink, and how it all works. Watch it here →

Curated Collections – PO

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

News from The Usual Suspects ©

An awesome guide from @huggingface: The Smol Training Playbook
Covers the behind-the-scenes of training SmolLM3 - what real LLM training looks like:
- Strategy and cost decisions before starting
- Pretraining: Data, ablations, architecture & tuning
- Post-training: Refining
— TuringPost (@TheTuringPost)
11:16 AM • Nov 2, 2025

A Symphony of Coders: Cursor 2.0 and Composer Arrive
Cursor 2.0 lands with a sharp new interface and its own frontier coding model, Composer – 4x faster than peers and trained for multi-step logic across sprawling codebases. Add in parallel agents, auto-testing, and a UI shift from files to outcomes, and it’s clear: Cursor is aiming to be more that an IDE.
Emergent introspective awareness in LLMs
Researchers from Anthropic used concept injection to test whether LLMs can introspect on their internal states. Claude Opus 4.1 and 4 detected injected concepts with 20% success at optimal layers, distinguished internal "thoughts" from text inputs, and identified whether outputs were intentional. Models also modulated internal states when prompted (e.g. "think about X"). Introspective ability varied by model and context, suggesting emerging, unreliable, and mechanistically diverse self-awareness.
OpenAI: Aardvark Digs Deep for Defenders
OpenAI unveils Aardvark, an autonomous GPT-5–powered security researcher now in private beta. Designed to patrol codebases like a tireless digital sleuth, Aardvark identifies, validates, and patches vulnerabilities at scale. It’s already uncovered CVEs in open-source projects and proved its chops internally. Think of it as a teammate who never sleeps and always spots the bug.

Interesting Index:

Remote Labor Index: Measuring AI Automation of Remote Work by Center for AI Safety and Scale AI – it’s an interesting benchmark of 240 real-world freelance projects across 23 categories. These projects represent over 6,000 hours of work worth $140,000. Evaluated via rigorous human judgment, top AI agents like Manus and Sonnet 4.5 achieved automation rates below 2.5%, showing that economically valuable remote labor remains largely unautomated. Agents failed due to file corruption (17.6%), incompleteness (35.7%), and poor quality (45.6%) →read the paper

Models to pay attention to

🌟 MiniMax M2 & Agent – optimize an open LLM for tool use and coding with fast, low-cost inference, deep search, and built-in Shell, Python, and Browser toolchains for agent workflows →read the paper
gpt-oss-safeguard – apply developer-defined safety policies at inference using chain-of-thought, output decisions with rationales, and match or beat larger baselines while shipping as open weights under Apache 2.0 →read the paper
Kimi Linear – replace full attention with a hybrid linear attention stack (KDA + MLA) to surpass full attention under equal training, cut KV cache by up to 75%, and 6× decode throughput at 1M context →read the paper
MASPRM: Multi-Agent System Process Reward Model – estimate per-agent, per-action progress from MCTS rollouts and guide beam search or MCTS at inference to focus compute on promising branches for more reliable multi-agent reasoning →read the paper
Ouro: Looped Language Models – pretrain reasoning with latent iterative computation and learned depth allocation to scale latent reasoning and let 1–3B models match far larger CoT-tuned baselines→read the paper
Emu3.5 – learn a native multimodal world model that predicts interleaved vision–language next states, then accelerate inference with Discrete Diffusion Adaptation for fast, consistent generation and editing →read the paper
Tongyi DeepResearch – specialize a 30.5B agentic LLM (3.3B active per token) for long-horizon deep research via agentic mid-training and post-training, achieving state-of-the-art on web research benchmarks →read the paper
Ming-Flash-Omni – unify vision, speech, and language in a sparse MoE (100B total, 6.1B active) to advance contextual ASR, text-to-image, identity-preserving editing, and generative segmentation in one model →read the paper
LongCat-Video – generate minutes-long 720p 30fps videos with a DiT backbone using coarse-to-fine temporal–spatial generation, block-sparse attention, and multi-reward RLHF across T2V, I2V, and continuation →read the paper

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Reinforcement Learning for Reasoning & Agents

🌟🌟 Supervised Reinforcement Learning (SRL): From Expert Trajectories to Step-wise Reasoning (by Google) – reformulate problem solving as action sequences with step-wise rewards from expert traces to teach small models before refining with RLVR →read the paper

Image Credit: SLR

🌟🌟 SPICE: Self-Play In Corpus Environments Improves Reasoning (by Meta) – co-evolve a Challenger and Reasoner grounded in a document corpus to generate curricula and sustain self-improvement across domains →read the paper

Image Credit: SPICE

🌟 Reasoning-Aware GRPO using Process Mining (by Pusan University) – augment GRPO with a conformance reward over reasoning steps via process mining to align policies with teacher procedures and boost multi-step reasoning →read the paper
Repurposing Synthetic Data for Fine-grained Search Agent Supervision – assign dense entity-aware rewards (E-GRPO) to learn from near-miss trajectories and induce fewer, more effective tool calls →read the paper
Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning – train a critic with staged rewards to improve discriminability and helpfulness, yielding stronger feedback-driven refinement →read the paper
Multi-Agent Evolve: LLM Self-Improve through Co-evolution – co-evolve Proposer–Solver–Judge agents with RL to generate, solve, and evaluate tasks, improving general reasoning without heavy human supervision →read the paper

Agent Organization, Planning & Markets

🌟🌟 Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets (by Microsoft) – simulate two-sided markets of assistant and service agents to study welfare, bias, manipulation, and search dynamics at scale →read the paper

Image Credit: Magentic Marketplace

🌟 The Era of Agentic Organization: Learning to Organize with Language Models (by Microsoft) – orchestrate asynchronous thinking with an organizer–workers protocol and optimize the structure with RL to cut latency and raise accuracy →read the paper
ReCode: Unify Plan and Action for Universal Granularity Control – represent plans as recursive code with placeholder functions that expand into actions, enabling fluid shifts in decision granularity →read the paper

Inference, Decoding & Serving Systems

Batch Speculative Decoding Done Right – solve ragged-tensor synchronization to guarantee output equivalence, then group by length (EXSPEC) for higher throughput in production serving →read the paper
The End of Manual Decoding: Towards Truly End-to-End Language Models – predict token-level temperature and top-p to learn decoding policies in-model, enabling instruction-steerable, end-to-end generation →read the paper

Architecture & Efficiency (Attention, Routing, Long Context)

Knocking-Heads Attention – enable cross-head interactions via a shared diagonally-initialized projection so heads “knock” before attention, stabilizing training and improving downstream performance →read the paper
Sparser Block-Sparse Attention via Token Permutation – permute tokens to concentrate dependencies within blocks, increasing block-level sparsity and accelerating long-context prefilling with custom permuted-FlashAttention →read the paper
Parallel Loop Transformer for Efficient Test-Time Computation Scaling – parallelize looped computation across tokens (CLP) and reuse first-loop KV with gated sliding-window attention to keep latency and memory near baseline →read the paper
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance – guide MoE routing in DiTs with conditional and prototypical assignments plus routing contrastive loss to specialize experts and lift ImageNet results →read the paper

That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How did you like it?

Reply

or to participate.