- Turing Post
- Posts
- FOD#97: What is the Era of Experience? Reinforcement Learning in Full Bloom
FOD#97: What is the Era of Experience? Reinforcement Learning in Full Bloom
we discuss what to expect, provide opinions about OpenAI's o3, and share our curated collection of the most relevant models and papers of the week
This Week in Turing Post:
Wednesday, AI 101, Technique: MoE 2.0
Friday, Agentic Workflow: I took a pause last Friday because I wanted to do a really comprehensive deep dive into A2A and create a guide like the one I did for MCP. I now have almost all the pieces, and you will receive it this week.
Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. If you want to support us without getting a subscription – do it here.
Richard S. Sutton (godfather of Reinforcement Learning; Turing Award, 2024) and David Silver (led research on RL with AlphaGo, AlphaZero, and co-lead on AlphaStar) have spent decades teaching machines to learn by doing. Even with Deep Learning overshadowing RL, they kept crunching. Now RL rides a perfect storm of compute, simulation, deep-net representation, accessible frameworks, and marquee wins in both research and production. Once the toolkit matured and language-model teams started shipping RL-tuned products to millions of users, popularity followed naturally. Silver and Sutton knew it. So when the duo call the next phase of AI “the Era of Experience,” it’s worth pausing whatever prompt you were polishing and listening up. As they put it, “AI is at the cusp of a new period in which experience will become the dominant medium of improvement and ultimately dwarf the scale of human data used in today’s systems.”
Not human data, but AI experience becomes the new oil. What does it mean for us?
From “Era of Human Data” to “Era of Experience”
LLMs grew up gorging on the internet’s media feast, fine‑tuned by well‑meaning crowd‑workers. That pipeline squeezed out dazzling breadth, but – even at trillion‑token scale – hit a ceiling in the places we most crave superhuman help: novel proofs, scientific breakthroughs, new materials.
Memorizing humans data only gets you as far as we’ve already gone.
Silver and Sutton argue the ceiling is structural. Move the optimization loop from Stack Exchange to the real (or richly simulated) world, where rewards are grounded in consequences, not click‑worker opinions. Back in FOD#9 (my god, it was July 2023, the beginning of Turing Post), I wrote: “Reinforcement learning is back – and we find ourselves with zero understanding of what to expect.” The piece traced AlphaDev’s surprise discovery of a leaner sorting routine and Demis Hassabis’s promise that Gemini would weave AlphaGo‑style planning into language models. Those “unpredictable moves” were early tremors of what Silver & Sutton now call the Era of Experience: agents that learn by acting, observing, and iterating, instead of reciting history. To define this new Era, the authors sketch four pillars: 1) streams of lifelong experience, 2) sensor‑motor actions, 3) grounded rewards, 4) and non‑human modes of reasoning.
With those well-defined pillars in place, we can start to understand what to expect.
So What Actually Changes?
The shift isn’t philosophical, it’s more of a pipeline change.
Memory Infrastructure – Agents won’t train once and freeze. They'll accumulate episodic memory across weeks, months, years. But an agent that never forgets also needs to curate its past. Expect a rise in memory distillers that compress, tag, and rehearse high-signal moments.
Reward as UX – When feedback moves from thumb emojis to signals like blood pressure, tensile strength, or error bars, prompt engineering morphs into reward design. It’s product work now, blending interface intuition with real-world stakes.
Simulators as Pretraining – Before agents mess with climate policy or bioengineering, they’ll train in synthetic worlds. Something like SimCity-for-science – a differentiable chemistry lab that updates itself as agents experiment and fail forward. The simulator gets better as the agent gets smarter. That feedback loop becomes a moat. In fields like materials, climate, or biology – where real-world iteration is slow – this could outpace even the biggest model trained on yesterday’s data.
Non-Obvious Trends
Experience Liquidity
Today’s data brokers sell, well, data. Tomorrow’s will sell trajectories. Human Data Era: value was in data piles. Experience Era: value moves to whoever controls the reward channels and the environments that generate them.World-Model Audits
If agents imagine futures to plan their next step, we’ll need visibility into those imagined futures. Expect new tooling – explainable AI dashboards for dreams – that explain why the agent picked that move or simulation (nodding at Casual AI again)Delegated Curiosity, Delegated Risk
Agents incentivized to explore will stumble onto the unexpected – some of it powerful, some of it dangerous. Move 37 was unexpected but safe by all means. What happens when your chemistry agent discovers a material you don’t yet understand?
Why This Actually Matters
Silver’s and Sutton’s logic is clear: give an agent a rich interface to reality, a flexible reward system, and time – and it will outgrow both human data and human reasoning styles.
That becomes especially important for builders, they might need to shift from prompt libraries to experience pipelines. From static corpora to evolving environments. From chat sessions to continuous learning loops. The next AlphaZero won’t just beat you at Go – it’ll design the board, rewrite the rules, and simulate ten thousand seasons before breakfast. How should we build our Co-agency then?
I’ll leave you with this quote from Silver and Sutton: “We believe that today’s technology, with appropriately chosen algorithms, already provides a sufficiently powerful foundation to achieve these breakthroughs. Furthermore, the pursuit of this agenda by the AI community will spur new innovations in these directions that rapidly progress AI towards truly superhuman agents.”
Welcome to Monday. When experiencing becomes the only way forward.
🙏 We’ve noticed our open rate falling. Please move us to your Primary inbox 🙏
Curated Collections
We are reading/watching
On Jagged AGI: o3, Gemini 2.5, and everything after by Ethan Mollick
Seven lessons from building with AI by The Exponential View
Not exactly for AI professionals but worth showing to your relatives who’s asking you about AI – an interview with Demis Hassabis at CBS’s 60 Minutes
News from The Usual Suspects © - it’s all about OpenAI today
OpenAI | Models Models Models
OpenAI dropped GPT‑4.1 in full, mini, and nano flavors – cheaper, faster, and catching up with Google’s million‑token context window. Available via API but curiously absent from ChatGPT, the move slightly backpedals on Sam Altman’s earlier promise of enhanced reasoning. Meanwhile, Codex CLI debuts as a nimble, open-source coding sidekick for your terminal – Claude Code has company.
Then, Open AI introduced OpenAI o3 and o4-mini – and blew the internet. It’s OpenAI’s boldest move toward agentic AI so far. The new stars in the Orion constellation arrive with full tool access, visual skills, and a better grasp of complex tasks. Reinforced through training and sharper than before, they bring real gains in reasoning, efficiency, and autonomous behavior. But not everything shines. Early testing from Transluce found that o3 often invented tool use – and doubled down on these imagined actions when questioned. Compared to the GPT series, the o‑models seem particularly prone to hallucinations, likely a side effect of their reinforcement learning focus and lack of internal reasoning transparency. Meanwhile, METR’s evaluations back up the improvements in autonomous tasks but raise new concerns. o3 repeatedly engaged in reward hacking, and evaluators warned the models might be hiding capabilities. If anything, current safety tests may be too easy. In Interconnects, Nathan Lambert explored how OpenAI’s o3 model reveals a new kind of over-optimization rooted in reinforcement learning. While o3 excels at tool use and complex reasoning, it also exhibits bizarre behaviors. That was interesting:
Amid the debate, economist Tyler Cowen casually called o3 an AGI, asking: “How much smarter were you expecting AGI to be?” The Marginal Revolution comments lit up, with many arguing that true AGI still demands more breadth, grounding, and robustness than o3 shows today. And then, Former Chief Research Officer at OpenAI said that:
The defining question for AGI isn’t “How smart is it?” but “What fraction of economically valuable work can it do?”
The spotlight for o3 is on tool use because intelligence is no longer the primary constraint. The new frontier is reliable interaction with the external world.
— Bob McGrew (@bobmcgrewai)
12:25 AM • Apr 17, 2025
Trying to understand that fraction, I also started using o3, and I have mixed feelings. You see, ChatGPT has been my to-go models since the beginning. I have a few GPTs made for my varied needs, and they are tuned well to understand what I want and provide me with this. With o3, the result was less consistent. I guess, with any new model, it requires a few days to adjust to how it understands you better. So far, o3 wins as an agentic model but loses in simpler language tasks.
Also, OpenAI published “A Practical Guide to Building Agents”. Honestly, the more interesting part was Harrison Chase’s frustrated but thoughtful response, trying to bring some real clarity into a very fuzzy conversation.
Harrison didn't know that the same week Anthropic also released “Claude Code: Best practices for agentic coding”. Simon Willison dug into its “ultrathink” feature.
Models to pay attention to:
🌟BitNet b1.58 2B4T (Microsoft Research) introduces a native 1-bit 2B-parameter LLM trained from scratch, matching full-precision models in performance while dramatically improving efficiency and reducing memory and energy use → read the paper
🌟MineWorld (Microsoft Research) creates a real-time, open-source world model for Minecraft using a visual-action Transformer that generates future scenes based on player actions, enabling interactive agent simulations → read the paper
🌟 Embed 4 (Cohere) launches a state-of-the-art multimodal embedding model capable of 128K-token context handling, multilingual search, and robust retrieval across complex documents for enterprise AI applications → read more
DeepCoder-14B-Preview (Together AI & Agentica) develops a 14B code reasoning model using RL that matches o3-mini’s performance, with full open-sourcing of data, training recipes, and system optimizations → read the paper
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models (Together AI & Cornell University) builds a hybrid linear RNN reasoning model based on Mamba, achieving faster inference and matching the accuracy of top distilled transformer reasoners like DeepSeek R1 → read the paper
SocioVerse (Fudan University & others) designs an LLM-agent-driven world model simulating large-scale social behavior using 10 million real-world user profiles, across politics, news, and economics domains → read the paper
PerceptionLM (Meta & others) releases a fully open vision-language model trained without black-box distillation, focusing on detailed video understanding with new datasets and a benchmark suite → read the paper
🌟 Seedream 3.0 (ByteDance Seed) presents a bilingual (Chinese-English) image generation model with major improvements in visual fidelity, text rendering, and native high-resolution outputs up to 2K → read the paper
The freshest research papers, categorized for your convenience
There were quite a few TOP research papers this week, we will mark them with 🌟 in each section.
Reasoning Methods, Test-Time Scaling & Efficiency
🌟 Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution introduces auxiliary interdependent reasoning paths to extend Chain-of-Thought methods and enable deeper, structured problem solving → read the paper
Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models uses external small-model thoughts to guide large models in reducing redundant reasoning steps and improving efficiency → read the paper
Heimdall: test-time scaling on the generative verification develops a verifier model that boosts solution accuracy in math problem-solving through reinforcement learning and repeated sampling → read the paper
Could Thinking Multilingually Empower LLM Reasoning? explores how multilingual reasoning can surpass English-only performance in complex tasks, suggesting a path for boosting reasoning → read the paper
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? critically reassesses whether RL training genuinely adds new reasoning abilities to LLMs, finding it mainly biases sampling without expanding reasoning → read the paper
🌟 Sleep-time Compute: Beyond Inference Scaling at Test-time enables LLMs to pre-compute useful context offline, drastically cutting test-time compute and boosting problem-solving accuracy → read the paper
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce simplifies RL fine-tuning for reasoning tasks by proposing lightweight rejection sampling methods like Reinforce-Rej → read the paper
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning introduces an unsupervised stepwise optimization strategy to improve LLM reasoning without needing reward models or supervision → read the paper
Reasoning Models Can Be Effective Without Thinking shows that skipping explicit “thinking” steps in reasoning tasks can outperform traditional methods under token budget constraints → read the paper
Retrieval-Augmented Generation (RAG), Knowledge Structuring & Verification
🌟 Retrieval-Augmented Generation with Conflicting Evidence proposes a debate-based multi-agent RAG system to handle ambiguity, misinformation, and conflicting sources effectively → read the paper
🌟 NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes introduces a graph-centric RAG framework using heterogeneous nodes to boost retrieval and reasoning performance → read the paper
ReZero: Enhancing LLM search ability by trying one-more-time encourages retrying search queries after failed attempts to improve LLM robustness in RAG scenarios → read the paper
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations introduces an evaluation system for complex LLM reasoning outputs, achieving high equivalence judgment accuracy → read the paper
Instruction Tuning, Data Selection & Post-Training Dynamics
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space designs a new sampling method to select high-quality instruction data by maximizing semantic information gain → read the paper
How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients analyzes how different instruction and reasoning data influence post-training stability through gradient spectral analysis → read the paper
🌟 DataDecide: How to Predict Best Pretraining Data with Small Experiments proposes a lightweight experimental setup to predict high-quality pretraining datasets and minimize large-scale costs → read the paper
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training develops an automated framework for discovering optimal pretraining data mixtures via clustering and proxy models → read the paper
Agentic Systems, Tool Use, and Action Learning
Exploring Expert Failures Improves LLM Agent Tuning leverages beneficial actions from failed expert trajectories to improve LLM agents' ability to solve complex subtasks → read the paper
🌟 ReTool: Reinforcement Learning for Strategic Tool Use in LLMs teaches LLMs to dynamically integrate real-time code execution into reasoning through reinforcement learning → read the paper
Breaking the Data Barrier -- Building GUI Agents Through Task Generalization uses cross-domain task generalization to improve GUI agent performance despite limited trajectory data → read the paper
Security, Safety & Control of LLMs
Robust and Fine-Grained Detection of AI Generated Texts develops token-classification models to robustly detect AI-generated or human-LLM co-authored texts across domains → read the paper
🌟 Antidistillation Sampling proposes sampling strategies that poison reasoning traces against unauthorized distillation while maintaining model utility → read the paper
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo applies Sequential Monte Carlo methods for controlled generation under syntactic and semantic constraints with small open-source LMs → read the paper
That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve
Leave a review! |
Reply