- Turing Post
- Posts
- FOD#128: Universe of Incredible Models
FOD#128: Universe of Incredible Models
An insanely saturated week with models you actually need to know about
This Week in Turing Post:
Wednesday / AI 101 series: What is Continual Learning? (Including Nested Learning)
Friday / Interview: Carina Hong at Axiom Math
Last week most popular deep dives:
AIE CODE Summit Recap: State of AI Coding: Context, Trust, and Subagents with Dex Horthy, Michele Catasta, Steve Yegge, Beyang Liu, Katelyn Lesse, Kath Korevec, Nik Pash, Eno Reyes, Max Kanat-Alexander, Kevin Hou and others
Our news digest is always free. Click on the partner’s link to support us or Upgrade to receive our deep dives in full, directly into your inbox →
Attention Span: if last year was the year of prompt engineering, this year is the year of context engineering and context management. Because it’s a real pain in the neck. Why and what to do about it? At the AI Engineer Code Summit I cornered a few builders from Replit, Cursor, HumanLayer, Sourcegraph, MiniMax, Block, and Google DeepMind and asked two simple questions: Why is the context window such a nightmare, and how do you work around it? Watch it here →
Editorial: Fight, Models, Fight
If this was the only week of model releases, it would already cover 2-3 months of news flow. Yet somehow we got it all in seven days. Open source, closed source, some gravity, some antigravity. While human supermodels were pondering the stage last week, big labs were shipping their supermodels as if also competing for Miss Universe. Coincidence? I don’t think so. Supermodel wars are here in both realms.
Let’s break down what arrived, what each model is actually good for, and how to pick the right one for the job.
Starting with open source because they are crushing it for our benefit.
Olmo 3: Open source gets a full training storyboard
AI2 did something the big labs still refuse to do. They shipped not only weights, but the flow: data, code, checkpoints, and evaluation pipeline. Every stage, from pretraining to RL to post-training checkpoints, is public. Bravo and thank you.
Why it matters:
– You can see how reasoning emerges stage by stage
– You can fork at any point: mid-training, RL, post-training
– 32B Think variant is competitive with Qwen-scale open models on math, code, and reasoning
Olmo 3 is less a model drop and more an invitation: “Here is how to build your own RL-trained thinker without guessing.” →read the paper
Today’s Release: Fara-7B: On-device web supermodel
While everyone else was flexing at the leaderboard level, Microsoft quietly shipped Fara-7B: a 7B computer-use open-weight model that drives the browser like a person.
How it works:
– It sees screenshots, not DOM trees
– It predicts clicks, scrolls, keypresses and macro actions like web_search or visit_url
– It learns from 145k synthetic trajectories produced by a multi-agent Magentic-One system, distilled into one compact model
Benchmarks:
– 73.5 percent on WebVoyager, ahead of other 7B-class CUAs
– Strong results on DeepShop and Online-Mind2Web
– 38.4 percent on WebTailBench, which covers “boring real life” tasks like booking tickets, applying for jobs, or comparing prices
Computer-use agents are moving from “cloud experiment” to “thing that quietly runs on your laptop and books your dentist appointment.” →read their blog
P1: Open physics supermodel
P1 is the first open model that can hang with the very top tier on real physics Olympiads. Gold-medal performance on IPhO 2025, trained with reinforcement learning from verifiable physics problems, not just nicer prompts.
What it is good for:
– Olympiad-level physics, math, and code, with detailed step reasoning
– Scientific assistants that must be right for concrete, numeric questions
– Agentic systems where you want a reasoning core that can be audited
This is the first time an open model seriously claims “I can do hard physics” and you do not have to squint →read the paper
Nemotron Elastic: Many-in-one reasoning family
NVIDIA took the Matryoshka idea and applied it to hybrid Mamba–Transformer reasoning models. Train a 12B once, get 9B and 6B “nested” models out of the same weights, no extra training runs.
Why it is interesting:
– Training cost drops by an order of magnitude compared to training each size separately
– Deployment cost is constant in memory: you store the big one and slice
– All sizes keep strong performance on math and code benchmarks
In practice, this gives you a single reasoning family you can fit into everything from a laptop to a cluster →read the paper
MiMo-Embodied: One model for robots and cars
Xiaomi’s MiMo-Embodied is the first open model that performs at state-of-the-art on both embodied robotics benchmarks and autonomous driving ones.
What it can do:
– Affordance prediction and task planning for robots
– Perception, prediction, and planning for driving scenes
– Spatial reasoning across 3D environments, video, and language
The significance: the field is slowly moving away from one-model-per-vertical toward shared “embodied backbones.” Super interesting. →read the paper
Now the big labs
Today’s Release: Claude Opus 4.5: Long-horizon workhorse
Opus 4.5 is tuned for real workloads: tools, long contexts, and spreadsheets.
It adds:
– Better SWE-bench performance with fewer tokens
– Improved multi-agent orchestration inside the Claude Code ecosystem
– Stronger resistance to prompt injection
If you are building agents that live inside Office-style environments and run for hours, this is a serious contender.
Fresh impressions:
GPT-5.1-Codex-Max: Long-horizon coding brain
Codex-Max is OpenAI’s new agentic coding model built for huge, multi-stage software tasks. Its standout trick is compaction: it can operate across multiple context windows as one coherent chain, letting it work over millions of tokens without losing the plot.
Where it fits:
– Full-repo agents that track intent across long workflows
– Large refactors that used to break model coherence
– Engineering tasks that blend coding, research, testing, and documentation
It was released before Claude Opus 4.5 but didn’t get nearly as much attention. →read their blog
Google stack: Gemini 3, Antigravity, Nano Banana Pro
Gemini 3 is the new flagship. It’s a very impressive model. Pro is already strong, Deep Think mode is the real story: higher GPQA and ARC-AGI-2, better long chains, better tool use.
Use it when:
– You want a serious “study partner” for dense material
– You need agents that must carry context over long workflows
– You want robust multimodal reasoning across text, images, and video
Nano Banana Pro is the image model upgrade that finally treats text inside images as a first-class citizen. It is indeed very good, and if you read me you know how much trouble I had with Nano Banana before. Posters, UI mocks, packaging, coherent explainers with actually correct language – it holds up.
Use it for: infographics, storyboards, brand-consistent marketing visuals, and any “explain this concept visually” work. →read their blog
Google Antigravity is another very interesting launch. It’s not a model but the IDE reimagined for agents. It has:
– An editor view for devs
– A manager view for orchestrating parallel agents across browser, terminal, and editor
– Artifacts like plans, screenshots, and walkthroughs for verification
The standout feature is the artifacts. They matter because they give you a full, inspectable trail of what an agent actually did: plans, screenshots, step traces, browser sessions, terminal output, the whole flight recorder. This solves the classic problem of agentic coding: you get a diff, but no story.
If Antigravity delivers what it promises, Cursor and alike might feel it. →read their blog
But they immediately ran into some limitations, working around the clock to fix it.
Grok 4.1: Personality with a brain
Grok 4.1 pushes two dials at once: emotional intelligence and reasoning stability. Better EQ-Bench, better creative-writing scores, lower hallucination rate for info-seeking prompts.
Where it fits:
– Consumer-facing chat experiences
– Creative and social applications where flat tone is a bug
– Assistants that need a distinct voice without losing rigor
I’m not an active user of Grok 4.1 → read more about it on their blog
Curated Collections – Spatial Intelligence
Follow us on 🎥 YouTube Twitter Hugging Face 🤗
The freshest research papers, categorized for your convenience
We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟.
Highlight
🌟🌟 Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models (by Stanford) – this paper flips the usual “small models just reason worse” story. They show that when you shrink a multimodal model, what really collapses is perception – the ability to reliably extract visual details – more than the language side. Then they fix it with EXTRACT+THINK: a two–stage pipeline where a tiny VLM is trained to explicitly dump instruction–relevant visual facts, and a slightly bigger LLM reasons over that, hitting strong benchmarks with surprisingly little data and parameters. Quite interesting →read the paper
Multimodal, Spatial Intelligence & 3D / Vision
🌟 WorldGen: From Text to Traversable and Interactive 3D Worlds (by Meta) – convert text prompts into coherent, traversable 3D environments by combining LLM layout reasoning, procedural generation, and diffusion-based 3D synthesis inside standard game engines →read the paper
RoMa v2: Harder Better Faster Denser Feature Matching – redesign dense feature matching with a new architecture, loss, and curated training regime to solve harder correspondence problems faster and more robustly →read the paper
🌟 Mixture of States: Routing Token-Level Dynamics for Multimodal Generation (by KAUST and Meta) – fuse modalities in diffusion models via a token-wise router that sparsely mixes hidden states over time, matching or beating much larger models for text-to-image and editing →read the paper
Scaling Spatial Intelligence with Multimodal Foundation Models – scale a family of multimodal models (SenseNova-SI) on 8M spatially curated samples to push SOTA on multiple spatial benchmarks while preserving general multimodal performance →read the paper
Back to Basics: Let Denoising Generative Models Denoise – return diffusion to predicting clean images directly with large-patch pixel Transformers (JiT), leveraging manifold structure to get strong ImageNet results without tokenizers or extra losses →read the paper
Agentic Science, Autonomous Research & Self-Evolving Agents
OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists – embed citation graphs, collaboration protocols, and an open evaluation arena into an AI-science stack so human and AI researchers can co-evolve on shared infrastructure →read the paper
🌟 What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity (Meta) – correlate and experimentally manipulate ideation diversity in AI research agents, showing more diverse idea sets reliably improve MLE-bench performance →read the paper
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning – co-evolve a curriculum agent and executor agent (with tools) from a base LLM so they bootstrap ever-harder tasks and significantly improve reasoning without external data →read the paper
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning – formalize RL for LLM agents via an extended MDP view and provide a modular framework that makes RL-based agent training easier to adapt across environments →read the paper
Model Scaling, Training Efficiency & Reasoning Computation
🌟 Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance (Meta) – soup category-specific expert models with non-uniform weights (SoCE), using benchmark structure to build a single averaged model that improves robustness and hits SOTA on function-calling and more →read the paper
🌟 Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning (by Moonshot AI and Tsinghua) – accelerate synchronous RL rollout by leveraging shared generation patterns among identical prompts, dynamically splitting rollouts, scheduling by context similarity, and performing adaptive grouped speculative decoding, yielding 74–97% throughput gains and 75–93% long-tail latency reduction →read the paper
Virtual Width Networks – decouple representational “width” from backbone width to expand embedding dimension with almost constant compute, accelerating optimization and revealing log-linear virtual-width scaling trends →read the paper
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models – run recurrent refinement only on predicted-hard tokens via a neural decider and LoRA-shifted objective, boosting reasoning while avoiding overthinking and keeping most tokens single-pass →read the paper
Evaluation, Calibration & Text Geometry
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story – link intrinsic dimension to interpretable textual properties, showing scientific writing is representationally “low-ID” while fiction and opinion are higher-ID due to humanized signals →read the paper
Mitigating Label Length Bias in Large Language Models – calibrate predictions at the whole-label level with normalized contextual calibration (NCC) to fix multi-token label length bias, improving F1 and confidence reliability in classification and MCQ settings →read the paper
Safety, Red-Teaming & Adversarial Creativity
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs – evolve code-level jailbreak algorithms with a multi-agent, self-correcting framework (EvoSynth) that invents new attack methods and achieves very high ASR on robust models →read the paper
Specialized Tasks & Applications
Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework – leverage decoder-only LLMs and pooled vision embeddings (ViXML) to scale XMC with text+image, outperforming prior SOTA and showing small encoders plus images can beat big text-only decoders →read the paper
That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.
How did you like it? |

Reply