FOD#130: Where Is AI Heading in 2026?

were you right a year ago? Let's see! Compare the predictions and make new ones. Plus we share a very important shift in research

This Week in Turing Post:

  • Wednesday / AI 101 series: The state of RL

  • Friday / AI Interview: Ben Goodger: why OpenAI needs its own browser?

If you like Turing Post, consider becoming a paid subscriber or sharing this digest with a friend. It helps us keep Monday digests free →

Were we and you right? Revisiting our early-2025 predictions and making new ones

Making predictions, especially about the future, is famously tricky yet remains a favorite year-end tradition. But, as Antoine de Saint-Exupéry said: "Your task is not to foresee the future but to enable it." 

So let’s enable it! This is our yearly tradition. What do you want AI in 2026 to be? What do you think it will be the year of?

Send us your thoughts – [email protected] – to be featured in the special Predictions Edition of Turing Post at the end of 2025!

→ OR SIMPLY REPLY TO THIS EMAIL WITH YOUR PREDICTIONS ←

And now let’s see if any of us were right 12 months ago:

Last December, we made a bold bet: 2025 would be the "Year of Inference-Time Search." We predicted a massive industry pivot from models that talk fast to models that think slow. Looking back, that prediction defined the entire year.

We also said: “We think that Google will start to dominate the scene.” December proved it correct: OpenAI declared an internal “Code Red” as Gemini 3 threatened their lead. We even canceled our Pro subscription because there is literally no reason to keep paying $200 when Google (and other models) are now so good.

Here is how our 2025 scorecard looks in general.

The Big Win: The "Thinking" Shift François Chollet nailed it. Inference-time search now drives capabilities. The leaderboard has shifted away from parameter counts. Today, the best reasoning chains win. We are finally seeing "System 2" thinking in silicon. However, Chollet’s hope for a solved ARC-AGI benchmark proved too optimistic. We made massive progress. General intelligence remains an unsolved puzzle.

The Reality Check: Agents Stalled John K. Thompson correctly identified the macro timeline: AGI is nowhere near ready. His prediction of "millions of active agents," however, missed the mark. 2025 proved that building an agent is easy. Making it reliable is excruciatingly hard. We remain in the pilot phase rather than the deployment phase.

The Sleeper Hit: Efficiency While the media chased trillion-parameter giants, real progress often came from the bottom up. Ronen Eldan, Will Schenk, and Maxime Labonne pointed early to the rise of compact, task-specialized models. And 2025 proved them directionally right: some of the most practical tools this year were small, efficient models that ran cheaply, handled math surprisingly well, and outperformed far larger systems in specific workflows. Examples include rStar-Math beating larger LLMs on reasoning tasks, Phi-3 Mini matching older frontier models on-device, and Qwen2.5-Coder outperforming bigger models in developer environments.

The Interface: The Death of Typing swyx predicted that voice would become the default. He was directionality right. In 2025, we still do type a lot, but conversation with a model became new normal.

The Verdict: The "Big vs. Small" debate turned out to be a false dichotomy. 2025 proved the necessity of both. Massive foundation models provided the broad reasoning substrate. Nimble, inference-heavy search solved the specific, hard problems. The industry figured out how to make them work together instead of declaring a single winner. We didn't abandon scale. We learned that intelligence requires time as much as it requires data. Perhaps most significantly, this was the year we finally got used to AI.

What’s next? What 2026 will surprise us with? Send us your thoughts – [email protected] – to be featured in the special Predictions Edition of Turing Post at the end of 2025!

→ OR SIMPLY REPLY TO THIS EMAIL WITH YOUR PREDICTIONS ←

Topic 2: When everyone flew to NeurIPS, I went to Art Basel Miami to see how AI is doing in the wild (art). Why it’s good when machines hallucinate, and how much a robodog with Elon Musk’s head costs → check it out here

We are also watching/reading:

Follow us on  🎥 YouTube Twitter  Hugging Face 🤗

Survey highlight – Deep Research: A Systematic Survey

Researchers from various institutions present a survey on Deep Research (DR). The paper reviews optimization methods – prompting, supervised fine-tuning, and agentic RL – and outlines evaluation criteria and unresolved challenges to guide future DR systems →read the paper. If Deep Research is of interest, you might also like the paper “How Far Are We from Genuinely Useful Deep Research Agents?” (→read it here)

Research this week – the center of gravity is shifting

I want to make a few observations about the last week in the research world. If you felt a sudden drop in pure LLM papers, you’re not imagining things. The frontier is in a temporary holding pattern. Or not temporarily – talking about predictions ha – it might be that we’re finally seeing a significant shift in research effort: from LLMs being the star of the show to LLMs becoming the engine under the hood. The spotlight is moving up the stack toward world models, agents, multimodal systems, simulation loops, and the efficiency work that turns frontier models into everyday tools.

And underneath all that, you can sense another shift brewing. The field keeps bumping into the edges of the transformer recipe. Long-context tricks feel like hacks, inference costs refuse to cooperate, and most reasoning gains now come from scaffolding rather than architecture. It’s the kind of pressure that usually precedes a break. We see early hints in video models, optimization-time reasoning, memory modules, and agent papers that stretch beyond text prediction. These are silhouettes of a new blueprint for intelligence, not the blueprint itself. But it might be coming very soon. (as always, 🌟 indicates papers that we recommend to pay attention to)

Research highlight

Models and General Architectures

  • 🌟 DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models – Presents an open frontier model family that combines sparse attention, scaled RL, and large-scale agentic task synthesis to rival proprietary systems on reasoning and tool-use benchmarks →read the paper

  • 🌟 Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction – Builds a full-stack ecosystem for training agent models across diverse, hierarchical environments, tying together complexity, diversity, and real-world fidelity into one scalable training loop →read the paper 

  • TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models – Constructs a truly unified visual latent space where the same representation supports understanding and generation for images and video, showing that shared encoders can boost both sides when trained jointly →read the paper

  • Vision Bridge Transformer at Scale – Instantiates large Brownian bridge models as a transformer that directly links input and output latents, enabling efficient, conditional image and video translation without the usual noise-to-data detour →read the paper

  • TV2TV: A Unified Framework for Interleaved Language and Video Generation – Couples a language tower and a video tower so the model can “think in text” between visual segments, interleaving tokens and frames to achieve better controllability and long-horizon video reasoning →read the paper

  • 🌟 SIMA 2: A Generalist Embodied Agent for Virtual Worlds – Extends a foundation model into an embodied agent that can follow goals, converse, and learn new skills across many 3D environments, exploring what a generalist “game-world worker” looks like in practice →read the paper

  • WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning – Designs an agent with episodic, semantic, and visual memories that can adaptively retrieve across multiple temporal scales, pushing long-video QA beyond text-only summaries →read the paper

Agentic RL, Alignment & Deep Research Systems

  • Stabilizing Reinforcement Learning with LLMs: Formulation and Practices – Formalizes when token-level surrogate rewards faithfully approximate sequence-level objectives, explaining why techniques like importance sampling, clipping, and routing replay actually stabilize LLM RL in practice →read the paper

  • 🌟 Guided Self-Evolving LLMs with Minimal Human Supervision – Introduces a challenger–solver loop where small amounts of grounded human data guide large-scale synthetic evolution, showing how to push math and reasoning skills without the usual catastrophic drift →read the paper

  • ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning – Turns reward models into agents that call tools, crop images, and fetch pages to justify their scores, improving multimodal alignment by forcing the reward to actually look at evidence →read the paper

  • 🌟 retrainZero: Reinforcement Active Pretraining – Extends RL from narrow post-training into the pretraining stage, letting “reasoners” actively choose spans from raw corpora to predict and breaking the assumption that reasoning requires pre-verified labels →read the paper

  • Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach – Treats test-time scaling for embodied VLAs as an anti-exploration problem, selecting action chunks with pseudo-count estimates to stabilize behavior without expensive RL updates →read the paper

  • 🌟 Rectifying LLM Thought from Lens of Optimization – Reinterprets chain-of-thought as an optimization trajectory and defines process-level rewards that favor stable, efficient reasoning, improving RLVR pipelines while reducing overthinking and flailing →read the paper

  • 🌟 On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral – Diagnoses why GRPO-style search-integrated RL collapses by tracking likelihood drift over time, then proposes a lightweight regularizer that preserves likelihood and rescues tool-augmented training runs →read the paper

  • SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment – Uses stable rank of hidden states as an internal, supervision-free signal for quality, then plugs it into GRPO to align models with their own representational geometry instead of external reward models →read the paper

  • SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling – Decomposes math problems into sub-tasks and routes test-time compute selectively, spending “System 2” effort only on hard bits to improve accuracy while cutting total tokens →read the paper

Multimodal Generation, Vision & Inference-time Control

  • 🌟 Glance: Accelerating Diffusion Models with 1 Sample – Splits denoising into semantic and cleanup phases with separate LoRA experts, showing that a base model plus tiny adapters can deliver big speedups with almost no retraining or quality loss →read the paper

  • UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers – Analyzes why diffusion transformers fall apart at ultra-high resolutions and introduces frequency corrections plus adaptive attention sharpening to reach up to 6K×6K generation without repeating patterns →read the paper

  • 🌟 Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation – Treats the prompt itself as something to scale at inference, iteratively revising text based on verifier feedback across samples to push alignment beyond just more steps or more seeds →read the paper

Efficiency, Scaling, Quantization & Developer Infrastructure

  • 🌟 SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs – Uses a distilled model to predict which attention heads and tokens actually matter, pruning retrieval and overlapping KV-cache movement with compute to push long-context throughput on limited hardware →read the paper

  • WUSH: Near-Optimal Adaptive Transforms for LLM Quantization – Derives data-aware linear transforms that minimize quantization error for joint weight–activation blocks, turning a Hadamard-like backbone into a provably near-optimal, still-efficient transform →read the paper

  • 🌟 CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning – Lets an RL loop auto-design HGEMM kernels guided purely by runtime rewards, systematically beating heavily hand-optimized baselines like cuBLAS in both offline and server settings →read the paper

  • 🌟 PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing – Embeds a multi-agent assistant directly inside LaTeX editors, orchestrating tools for search, review, patching, and diffing so that academic writing workflows become continuous, in-editor agent collaborations →read the paper

That’s all for today. Thank you for reading! It’s shorter today due to Columbus Day. Hope you can also have some time off. Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How did you like it?

Login or Subscribe to participate in polls.

Reply

or to participate.