Turing Post
Posts
FOD#128: Universe of Incredible Models

FOD#128: Universe of Incredible Models

An insanely saturated week with models you actually need to know about

Ksenia Se
November 24, 2025

This Week in Turing Post:

Wednesday / AI 101 series: What is Continual Learning? (Including Nested Learning)
Friday / Interview: Carina Hong at Axiom Math

Last week most popular deep dives:

What is LeJEPA? The Theory Upgrade JEPA Has Been Missing
AIE CODE Summit Recap: State of AI Coding: Context, Trust, and Subagents with Dex Horthy, Michele Catasta, Steve Yegge, Beyang Liu, Katelyn Lesse, Kath Korevec, Nik Pash, Eno Reyes, Max Kanat-Alexander, Kevin Hou and others

Our news digest is always free. Click on the partner’s link to support us or Upgrade to receive our deep dives in full, directly into your inbox →

Attention Span: if last year was the year of prompt engineering, this year is the year of context engineering and context management. Because it’s a real pain in the neck. Why and what to do about it? At the AI Engineer Code Summit I cornered a few builders from Replit, Cursor, HumanLayer, Sourcegraph, MiniMax, Block, and Google DeepMind and asked two simple questions: Why is the context window such a nightmare, and how do you work around it? Watch it here →

Editorial: Fight, Models, Fight

If this was the only week of model releases, it would already cover 2-3 months of news flow. Yet somehow we got it all in seven days. Open source, closed source, some gravity, some antigravity. While human supermodels were pondering the stage last week, big labs were shipping their supermodels as if also competing for Miss Universe. Coincidence? I don’t think so. Supermodel wars are here in both realms.

Let’s break down what arrived, what each model is actually good for, and how to pick the right one for the job.

Starting with open source because they are crushing it for our benefit.

Olmo 3: Open source gets a full training storyboard

AI2 did something the big labs still refuse to do. They shipped not only weights, but the flow: data, code, checkpoints, and evaluation pipeline. Every stage, from pretraining to RL to post-training checkpoints, is public. Bravo and thank you.

Why it matters:
– You can see how reasoning emerges stage by stage
– You can fork at any point: mid-training, RL, post-training
– 32B Think variant is competitive with Qwen-scale open models on math, code, and reasoning

Olmo 3 is less a model drop and more an invitation: “Here is how to build your own RL-trained thinker without guessing.” →read the paper

— (@)

Today’s Release: Fara-7B: On-device web supermodel

While everyone else was flexing at the leaderboard level, Microsoft quietly shipped Fara-7B: a 7B computer-use open-weight model that drives the browser like a person.

How it works:
– It sees screenshots, not DOM trees
– It predicts clicks, scrolls, keypresses and macro actions like web_search or visit_url
– It learns from 145k synthetic trajectories produced by a multi-agent Magentic-One system, distilled into one compact model

Benchmarks:
– 73.5 percent on WebVoyager, ahead of other 7B-class CUAs
– Strong results on DeepShop and Online-Mind2Web
– 38.4 percent on WebTailBench, which covers “boring real life” tasks like booking tickets, applying for jobs, or comparing prices

Computer-use agents are moving from “cloud experiment” to “thing that quietly runs on your laptop and books your dentist appointment.” →read their blog

P1: Open physics supermodel

P1 is the first open model that can hang with the very top tier on real physics Olympiads. Gold-medal performance on IPhO 2025, trained with reinforcement learning from verifiable physics problems, not just nicer prompts.

What it is good for:
– Olympiad-level physics, math, and code, with detailed step reasoning
– Scientific assistants that must be right for concrete, numeric questions
– Agentic systems where you want a reasoning core that can be audited

This is the first time an open model seriously claims “I can do hard physics” and you do not have to squint →read the paper

Nemotron Elastic: Many-in-one reasoning family

NVIDIA took the Matryoshka idea and applied it to hybrid Mamba–Transformer reasoning models. Train a 12B once, get 9B and 6B “nested” models out of the same weights, no extra training runs.

Why it is interesting:
– Training cost drops by an order of magnitude compared to training each size separately
– Deployment cost is constant in memory: you store the big one and slice
– All sizes keep strong performance on math and code benchmarks

In practice, this gives you a single reasoning family you can fit into everything from a laptop to a cluster →read the paper

MiMo-Embodied: One model for robots and cars

Xiaomi’s MiMo-Embodied is the first open model that performs at state-of-the-art on both embodied robotics benchmarks and autonomous driving ones.

What it can do:
– Affordance prediction and task planning for robots
– Perception, prediction, and planning for driving scenes
– Spatial reasoning across 3D environments, video, and language

The significance: the field is slowly moving away from one-model-per-vertical toward shared “embodied backbones.” Super interesting. →read the paper

Now the big labs

Today’s Release: Claude Opus 4.5: Long-horizon workhorse

Opus 4.5 is tuned for real workloads: tools, long contexts, and spreadsheets.

It adds:
– Better SWE-bench performance with fewer tokens
– Improved multi-agent orchestration inside the Claude Code ecosystem
– Stronger resistance to prompt injection

If you are building agents that live inside Office-style environments and run for hours, this is a serious contender.

Fresh impressions:

— (@)

GPT-5.1-Codex-Max: Long-horizon coding brain

Codex-Max is OpenAI’s new agentic coding model built for huge, multi-stage software tasks. Its standout trick is compaction: it can operate across multiple context windows as one coherent chain, letting it work over millions of tokens without losing the plot.

Where it fits:
– Full-repo agents that track intent across long workflows
– Large refactors that used to break model coherence
– Engineering tasks that blend coding, research, testing, and documentation

It was released before Claude Opus 4.5 but didn’t get nearly as much attention. →read their blog

Google stack: Gemini 3, Antigravity, Nano Banana Pro

Gemini 3 is the new flagship. It’s a very impressive model. Pro is already strong, Deep Think mode is the real story: higher GPQA and ARC-AGI-2, better long chains, better tool use.

Use it when:
– You want a serious “study partner” for dense material
– You need agents that must carry context over long workflows
– You want robust multimodal reasoning across text, images, and video

→read the CEO’s note

— (@)

Nano Banana Pro is the image model upgrade that finally treats text inside images as a first-class citizen. It is indeed very good, and if you read me you know how much trouble I had with Nano Banana before. Posters, UI mocks, packaging, coherent explainers with actually correct language – it holds up.

Use it for: infographics, storyboards, brand-consistent marketing visuals, and any “explain this concept visually” work. →read their blog

Google Antigravity is another very interesting launch. It’s not a model but the IDE reimagined for agents. It has:

– An editor view for devs
– A manager view for orchestrating parallel agents across browser, terminal, and editor
– Artifacts like plans, screenshots, and walkthroughs for verification

The standout feature is the artifacts. They matter because they give you a full, inspectable trail of what an agent actually did: plans, screenshots, step traces, browser sessions, terminal output, the whole flight recorder. This solves the classic problem of agentic coding: you get a diff, but no story.

If Antigravity delivers what it promises, Cursor and alike might feel it. →read their blog

— (@)

But they immediately ran into some limitations, working around the clock to fix it.

— (@)

Grok 4.1: Personality with a brain

Grok 4.1 pushes two dials at once: emotional intelligence and reasoning stability. Better EQ-Bench, better creative-writing scores, lower hallucination rate for info-seeking prompts.

Where it fits:
– Consumer-facing chat experiences
– Creative and social applications where flat tone is a bug
– Assistants that need a distinct voice without losing rigor

I’m not an active user of Grok 4.1 → read more about it on their blog

Curated Collections – Spatial Intelligence

Click the link to open the full list

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟.

Highlight

🌟🌟 Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models (by Stanford) – this paper flips the usual “small models just reason worse” story. They show that when you shrink a multimodal model, what really collapses is perception – the ability to reliably extract visual details – more than the language side. Then they fix it with EXTRACT+THINK: a two–stage pipeline where a tiny VLM is trained to explicitly dump instruction–relevant visual facts, and a slightly bigger LLM reasons over that, hitting strong benchmarks with surprisingly little data and parameters. Quite interesting →read the paper

Multimodal, Spatial Intelligence & 3D / Vision

🌟 WorldGen: From Text to Traversable and Interactive 3D Worlds (by Meta) – convert text prompts into coherent, traversable 3D environments by combining LLM layout reasoning, procedural generation, and diffusion-based 3D synthesis inside standard game engines →read the paper
RoMa v2: Harder Better Faster Denser Feature Matching – redesign dense feature matching with a new architecture, loss, and curated training regime to solve harder correspondence problems faster and more robustly →read the paper
🌟 Mixture of States: Routing Token-Level Dynamics for Multimodal Generation (by KAUST and Meta) – fuse modalities in diffusion models via a token-wise router that sparsely mixes hidden states over time, matching or beating much larger models for text-to-image and editing →read the paper
Scaling Spatial Intelligence with Multimodal Foundation Models – scale a family of multimodal models (SenseNova-SI) on 8M spatially curated samples to push SOTA on multiple spatial benchmarks while preserving general multimodal performance →read the paper
Back to Basics: Let Denoising Generative Models Denoise – return diffusion to predicting clean images directly with large-patch pixel Transformers (JiT), leveraging manifold structure to get strong ImageNet results without tokenizers or extra losses →read the paper

Agentic Science, Autonomous Research & Self-Evolving Agents

OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists – embed citation graphs, collaboration protocols, and an open evaluation arena into an AI-science stack so human and AI researchers can co-evolve on shared infrastructure →read the paper
🌟 What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity (Meta) – correlate and experimentally manipulate ideation diversity in AI research agents, showing more diverse idea sets reliably improve MLE-bench performance →read the paper
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning – co-evolve a curriculum agent and executor agent (with tools) from a base LLM so they bootstrap ever-harder tasks and significantly improve reasoning without external data →read the paper
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning – formalize RL for LLM agents via an extended MDP view and provide a modular framework that makes RL-based agent training easier to adapt across environments →read the paper

Model Scaling, Training Efficiency & Reasoning Computation

🌟 Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance (Meta) – soup category-specific expert models with non-uniform weights (SoCE), using benchmark structure to build a single averaged model that improves robustness and hits SOTA on function-calling and more →read the paper
🌟 Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning (by Moonshot AI and Tsinghua) – accelerate synchronous RL rollout by leveraging shared generation patterns among identical prompts, dynamically splitting rollouts, scheduling by context similarity, and performing adaptive grouped speculative decoding, yielding 74–97% throughput gains and 75–93% long-tail latency reduction →read the paper
Virtual Width Networks – decouple representational “width” from backbone width to expand embedding dimension with almost constant compute, accelerating optimization and revealing log-linear virtual-width scaling trends →read the paper
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models – run recurrent refinement only on predicted-hard tokens via a neural decider and LoRA-shifted objective, boosting reasoning while avoiding overthinking and keeping most tokens single-pass →read the paper

Evaluation, Calibration & Text Geometry

Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story – link intrinsic dimension to interpretable textual properties, showing scientific writing is representationally “low-ID” while fiction and opinion are higher-ID due to humanized signals →read the paper
Mitigating Label Length Bias in Large Language Models – calibrate predictions at the whole-label level with normalized contextual calibration (NCC) to fix multi-token label length bias, improving F1 and confidence reliability in classification and MCQ settings →read the paper

Safety, Red-Teaming & Adversarial Creativity

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs – evolve code-level jailbreak algorithms with a multi-agent, self-correcting framework (EvoSynth) that invents new attack methods and achieves very high ASR on robust models →read the paper

Specialized Tasks & Applications

Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework – leverage decoder-only LLMs and pooled vision embeddings (ViXML) to scale XMC with text+image, outperforming prior SOTA and showing small encoders plus images can beat big text-only decoders →read the paper

That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How did you like it?

Reply

or to participate.