Turing Post
Posts
FOD#103: The Paper You Missed, and maybe the one we’ll be quoting a year from now

FOD#103: The Paper You Missed, and maybe the one we’ll be quoting a year from now

we discuss ATLAS – Google’s latest memory architecture that may reshape how models learn from recent experience

Ksenia Se
June 02, 2025

This Week in Turing Post:

Wednesday, AI 101/Concept: let’s discuss Meta Learning
Friday, we continue our Agentic Workflow series with one fascinating development

Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Hugging Face, Microsoft, Google, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

Every other week, it seems, we hear about another "Transformer-killer" or a new "efficient architecture" promising to solve the long-context puzzle that still trips up even our biggest models. Most fade quickly, but occasionally, a paper lands that feels different, not just another tweak but a rethink. Google Research's “Atlas: Learning to Optimally Memorize the Context at Test Time,” which quietly appeared on arXiv last week (and got modest 18 upvotes on huggingface/papers), might just be one of those. That’s not the first time when Google’s research was overlooked in the beginning.

As we all know: Transformers are amazing, but their attention mechanism, which looks at every token pair, gets quadratically expensive. This means feeding them a whole book or a long transcript is either eye-wateringly slow or just plain impossible. Sure, we've seen a lot of contenders – RetNet, RWKV, Mamba, and even Google's own Titans – trying to chip away at this with clever recurrence or state-space tricks. They're faster, often much faster, but when it comes to truly understanding and recalling information from super-long sequences, many still hit a wall.

So, what's Atlas bringing to the table? Let's get a bit nerdy. The core idea, as I see it, is to treat the model's memory not as a passive bucket you just stuff information into, but as an active, optimizable component, especially during inference. Instead of the memory just reacting to the last token it saw, Atlas proposes a way for the memory to look back at a window of recent tokens and intelligently decide what's important to keep and how.

This is where their Omega rule comes in. What is it? → Most recurrent models update their memory based on the current input. The Omega rule says, "Hold on, let's look at the last 'c' tokens (say, the last 50) and optimize the memory state based on all of them together." This allows the model to learn to memorize "context" rather than just isolated facts (like updating your understanding based on recent, broader experience – not just the last thing that happened.) The paper shows this approach making a real dent, particularly on benchmarks like BABILong, where they're reporting strong performance on sequences stretching out to a staggering 10 million tokens (which is truly epic).

But there’s more to Atlas than one core idea – it introduces a collection of complementary tools:

Enhancing Memory Capacity
Atlas increases memory capacity by applying polynomial and exponential feature mappings to keys and queries. This enables the model to represent more distinct associations without expanding the size of the memory itself. The approach builds on ideas from prior work, including PolySketchFormer and extensions of Hopfield networks.
Adjusting Memory More Effectively
Instead of using basic gradient descent for memory updates, Atlas incorporates the Muon optimizer, which leverages second-order information. This helps the memory module update more effectively and avoid common optimization pitfalls. The idea is similar to using more advanced optimizers during training, but here it’s applied to the memory mechanism during inference.
Revisiting Transformer Design
The paper also presents DeepTransformers and Dot (Deep Omega Transformers), which extend the Transformer architecture by replacing fixed attention with a learnable memory module governed by the Omega rule. These models are framed as generalizations of the original Transformer, suggesting that attention can be viewed as one instance of a broader memory-based formulation.

This is a dense paper, pulling together threads from associative memory theory, optimization, and architectural design. It follows a lineage of work that's been questioning the "online" nature of RNN updates – papers like Titans (also from Google) and TTT (Test-Time Training) have explored similar themes of dynamic, inference-time adaptation.

Now, is Atlas the final word? Probably not. Its mechanisms are undoubtedly more complex than just throwing more layers at a Transformer. The real acid test will be how well these ideas scale, how easy they are for others to implement and build upon, and whether the performance gains hold up across a wider array of tasks. Many applications still live comfortably within shorter context windows.

But the direction Atlas charts feels significant. Instead of relying solely on scaling up existing paradigms, it emphasizes building smarter systems – models that actively learn how to manage and optimize their internal memory based on the task at hand, especially when faced with vast amounts of information. If the initial promise holds, this focus on dynamic, context-aware memory optimization could be a crucial step towards AI that can not only process but truly comprehend and utilize the massive datasets that define our modern world. The era of simply scaling attention may be evolving, giving way to a more nuanced exploration of how AI learns, remembers, and reasons. Atlas, with its deep dive into optimally memorizing context, looks like a compelling chapter in that ongoing story.

Welcome to Monday. Do you follow Omega rule in your life?

Curated Collections

The following collection is a great addition to our recent article What Is MCP, and Why Is Everyone – Suddenly!– Talking About It? →

Click to read

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

We are reading/watching

From Typewriters to Transformers: AI is Just the Next Tools Abstraction by Steven Sinofsky
Trends – Artificial Intelligence by legendary Mary Meeker (it’s worth to browse through – but prepare yourself for 340 pages of (mostly) graphs)
Watch this, it’s fun :)

News from The Usual Suspects ©

Hugging Face Grows Arms and Legs
Hugging Face has introduced two open-source humanoid robots – HopeJR (starts at $3000), a full-size bipedal bot with 66 degrees of freedom, and Reachy Mini (starts at $300), a desktop companion with conversational skills. Built after acquiring Pollen Robotics, these bots aim to democratize robotics by staying affordable and transparent. Very exciting to see this development.
Anthropic open-sourcing circuit tracing tools
- In collaboration with Decode Research, they released an open-source library to generate attribution graphs that reveal internal reasoning paths of LLMs. These graphs support models like Gemma-2-2b and Llama-3.2-1b, and are viewable via Neuronpedia’s interactive frontend. The tools allow users to modify feature values, test hypotheses, and trace circuits. The release aims to accelerate interpretability research and understanding of model behaviors across multilingual and reasoning tasks →read more in their blog or try out
- Also – in just five months, Anthropic has tripled its annualized revenue – from $1B to $3B – thanks to surging business demand for AI, especially in code generation. Wow.
Telegram & xAI: Chat Happens
Deal made in heaven: two controversial characters – Pavel Durov and Elon Musk – signed a deal to embed Grok – xAI’s signature chatbot – directly into the Telegram’s app. Telegram also gets 50% of all Grok subscription revenue sold through the platform. The partnership caused TON to surge 18.5%, though it mysteriously jumped hours before the announcement. Market clairvoyance, anyone? Meanwhile, Telegram is also raising $1.5B in bonds, with BlackRock and Citadel backing the play.
Meta and Anduril Suit Up
Meta and defense-tech firm Anduril have joined forces to bring mixed reality to the battlefield. It’s less a gaming headset now, more like a combat-ready technomancer gear. Their XR-powered integration will enhance soldiers’ perception and link directly into Anduril’s Lattice AI system. Privately funded and built from commercial tech, the partnership claims it could save the Pentagon billions – and make Palmer Luckey’s warfighter dreams a vivid, augmented reality. Can we please also have wars solely in Metaverse?
Mistral Gives Agents Their Marching Orders
Mistral AI has launched its Agents API – a toolkit turning AI from passive scribe into active problem-solver. With built-in connectors for code execution, web search, and more, plus orchestration features and persistent memory, it’s built for serious enterprise-grade workflows. From coding and finance to nutrition planning, Mistral is quietly building an ecosystem of autonomous assistants that don't just talk – they do.

It's interesting how the major LLM API vendors are converging on the following features:
- Code execution: Python in a sandbox
- Web search - like Anthropic, Mistral seem to use Brave
- Document library aka hosted RAG
- Image generation (FLUX for Mistral)
- Model Context Protocol
— Simon Willison (@simonw)
2:58 PM • May 27, 2025

Models and datasets to pay attention to:

Adaptive reasoning model ARM (Fudan University and The Ohio State University) is a model that adaptively chooses among four reasoning formats (Direct Answer, Short CoT, Code, Long CoT) based on task difficulty. Trained with Ada-GRPO, a GRPO variant preventing format collapse, ARM maintains accuracy while reducing token usage by ~30% on average and up to 70%. It achieves ~2× training speedup and supports Adaptive, Instruction-Guided, and Consensus-Guided reasoning modes.
TabSTAR (Technion-IIT) is a tabular foundation model using unfrozen text encoders and target-aware tokens to produce semantically aligned, task-specific embeddings. TabSTAR supports classification and regression without dataset-specific parameters and excels in textual tabular datasets. Evaluated on 50 datasets, it outperforms GBDTs and other TFMs in classification, achieving up to 0.874 normalized AUROC. Pretraining across 350 datasets shows scaling laws, with performance increasing with more data. TabSTAR trains on a single A40 GPU in under 48 hours.
rStar-Coder (Microsoft Research Asia) is a dataset of 418K competitive-level programming problems and 580K long-reasoning solutions, all verified via diverse test cases. They curate 37.7K expert problems, synthesize 380K new ones, and use a three-step test input generation method plus mutual verification for output correctness, achieving 96.8% labeling accuracy. Models trained on rStar-Coder surpass QWQ-32B, with a 7B model scoring 57.3% on LiveCodeBench and 16.15% on USACO 2025, outperforming much larger models.

Featured Eval

🌟 REWARDBENCH 2 (Allen Institute for AI (Ai2), University of Washington, and Cohere) is a benchmark of 1,865 unseen human prompts across six domains (Factuality, Precise Instruction Following, Math, Safety, Focus, and Ties) designed to assess reward models (RMs) via best-of-4 accuracy. REWARDBENCH 2 reduces top RM scores by ~20 points vs. its predecessor, improving difficulty and correlation with downstream performance. Evaluations of 113 RMs show strong correlation (r=0.87) with best-of-N sampling, but PPO performance depends on alignment between RM and policy model training. Congrats to our subscribers from Ai2 and Cohere! →read the paper

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Reasoning and Inference Optimization

WebDancer builds end-to-end agentic information-seeking agents using a structured training pipeline of browsing, sampling, fine-tuning, and RL to solve research tasks efficiently →read the paper
Let's Predict Sentence by Sentence lifts LLMs to sentence-level reasoning by predicting semantic and contextual sentence embeddings, reducing inference cost while preserving performance →read the paper
REARANK improves reranking in IR by explicitly reasoning with listwise feedback using reinforcement learning, matching GPT-4 on complex tasks →read the paper
Universal Reasoner adds a plug-and-play reasoning module to frozen LLMs, enabling composable reasoning without retraining or compromising core capabilities →read the paper
PATS adapts LLM reasoning strategy dynamically at the step level based on task difficulty, optimizing both speed and accuracy →read the paper
R2R routes only divergent reasoning tokens to large models while letting small models handle the rest, achieving speedups with minimal accuracy trade-offs →read the paper
Table-R1 applies inference-time scaling to table reasoning using distillation and RLVR, matching GPT-4-level performance with a 7B model →read the paper
MetaMind simulates human Theory of Mind via a multi-agent LLM system, achieving human-level performance in social reasoning tasks →read the paper
Multi-Domain Explainability of Preferences explains LLM preferences across domains using concept-based vectors and a white-box regression model, improving interpretability and alignment →read the paper

Training Strategies and Reinforcement Learning

Deciphering Trajectory-Aided LLM Reasoning interprets reasoning as meta-learning, showing how LLMs adapt to new tasks via inner-loop optimization mechanisms →read the paper
Advancing Multimodal Reasoning via RL with Cold Start combines supervised fine-tuning and RL to boost multimodal reasoning, outperforming both standalone methods on math/visual benchmarks →read the paper
Surrogate Signals from Format and Length uses format and response length as proxy rewards to train LLMs for math without needing correct answers, improving efficiency →read the paper
Reinforcing General Reasoning without Verifiers trains LLMs using RL without external verifiers, showing strong performance on real-world reasoning benchmarks using reward-free methods →read the paper
Learning to Reason without External Rewards trains LLMs using self-certainty as an internal feedback loop, removing the need for labeled data or reward models →read the paper
The Climb Carves Wisdom Deeper Than the Summit demonstrates LLM robustness to noisy rewards in RL, showing how key reasoning phrases can substitute for ground-truth verification →read the paper
SynLogic synthesizes logical reasoning tasks at scale for RL training, improving general reasoning across math, logic, and coding →read the paper
The Entropy Mechanism of RL for Reasoning LLMs analyzes entropy collapse in RL for LLMs and proposes techniques to preserve exploration during training, boosting performance →read the paper
Unsupervised Post-Training for MLLMs via GRPO improves multimodal reasoning without supervision by using majority voting as self-reward in a scalable RL loop →read the paper
Enigmata provides a synthetic puzzle benchmark and RL training suite to boost logical reasoning in LLMs using verifiable puzzle tasks →read the paper

Efficiency, Compression, and Acceleration

QwenLong-CPRS compresses long-context inputs using dynamic optimization and outperforms GPT-4o and Claude3 on long-context benchmarks →read the paper
Fast-dLLM adds KV cache and confidence-aware parallel decoding to Diffusion LLMs, achieving 27.6× speedups with minimal quality loss →read the paper
SageAttention2++ accelerates attention via FP8 matrix operations while preserving accuracy, outperforming FlashAttention with 3.9× speedups →read the paper
Shifting AI Efficiency from Model-Centric to Data-Centric Compression advocates token compression over parameter reduction as the future of AI efficiency, laying a new research agenda →read the paper

Adaptation and Fine-Tuning Techniques

GraLoRA introduces a granular low-rank adaptation structure that prevents overfitting in PEFT and improves performance over standard LoRA →read the paper
How Does Alignment Enhance LLMs' Multilingual Capabilities? analyzes neuron-level changes in multilingual LLMs after alignment, offering insights into spontaneous cross-lingual adaptation →read the paper

Applications and Systems

Paper2Poster automates academic poster creation using a multi-agent system that distills papers into structured layouts with minimal cost →read the paper
ZeroGUI trains GUI agents online with zero human input by using VLMs to generate tasks and assess outcomes, enabling self-sufficient GUI automation →read the paper
Discrete Markov Bridge learns discrete data representations with a novel matrix/score learning combo, outperforming baselines on text and image benchmarks →read the paper
Are Reasoning Models More Prone to Hallucination? explores how different training methods affect hallucination in LLMs and links model uncertainty to factuality errors →read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve

Leave a review!

Reply

or to participate.