Continual Learning in LLMs: Why AI Models Need Sleep

Today’s editorial: continual learning in LLMs, why AI models may need offline consolidation, and what “sleep” means for AI memory, agents, and catastrophic forgetting.

→ Continual Learning Is Back, and It’s About to Put Models to Sleep

By coincidence, last week was all about models and their precious sleep. On May 25, a paper from Carnegie Mellon and the University of Maryland asked: Do Language Models Need Sleep? On June 2, a paper from Google-affiliated researchers answered almost directly: Language Models Need Sleep. This funny timing we can use as a signal: continual learning is back at the center of AI research, now under a different set of pressures.

Continual learning in AI is not a new problem. In classical machine learning, it usually meant training a model on a sequence of tasks without destroying what it had already learned. A model learns task B, then suddenly becomes worse at task A. This is catastrophic forgetting, and the field spent years trying to reduce it through replay, freezing, regularization, routing, and other methods.

LLMs changed the shape of the problem. Today, the question is broader: how can AI systems stay current, specialize to domains and users, learn from experience, and improve after deployment without breaking what they already know? Brutally hard.

A 2026 survey, Continual Learning in Large Language Models, gives a good map of the current field. It divides LLM continual learning into continual pre-training, continual fine-tuning, and continual alignment. It means that a model may need to absorb new general knowledge, adapt to a specific domain or task, or adjust its behavior without losing the alignment that made it useful. The survey’s conclusion states that current methods work in limited settings, but we still do not have smooth learning across tasks and time.

But what is it about sleep?

Of course, models do not literally need sleep. What they need is an offline phase for consolidation. Constant live updating is risky, while doing nothing leaves models stale. There needs to be a phase between seeing something and changing from it. This is what the sleep metaphor is trying to capture: offline processing, when the model is not simply answering the next prompt, but organizing recent experience before deciding what should persist.

The CMU/Maryland paper looks at this from the inference side. Long context is expensive because the KV cache grows as the model attends to more tokens. Some hybrid architectures compress older context into fast weights, but the paper shows that compression alone is not enough. If the model has to reason about information it can no longer directly attend to, it needs more computation before that context is cleared. Their proposed sleep phase gives the model offline recurrent passes over recent context, and the biggest gains appear on tasks that require deeper reasoning. That is the important part: memory is not only storage, it is processing.

The Google-affiliated paper moves closer to continual learning. It starts from a simple limitation: LLMs can adapt inside a context window, but that knowledge usually disappears when the session ends. Its Sleep paradigm proposes two steps. First, “Knowledge Seeding” consolidates short-term knowledge into more stable parameters. Then “Dreaming” uses model-generated synthetic data to rehearse what was recently learned. Biological terms aside, what it means is that durable learning should be separated from live interaction.

This separation may be the useful architecture for continual learning. Without it, the choices are too crude. Either the model stays mostly static and relies on retrieval, or it updates too directly and risks drift. Sleep gives researchers a third frame: the system interacts, collects experience, processes it offline, and only then decides what should remain temporary, what should become memory, and what is allowed to affect future behavior.

This is especially important for agents, because their experience is richer than a document stream. It includes tool calls, failed attempts, user corrections, environmental feedback, and repeated workflows. Recent agent-learning work points in the same direction. A roadmap on lifelong learning for LLM agents frames the problem through perception, memory, and action. Another June 2026 paper, Rethinking Continual Experience Internalization for Self-Evolving LLM Agents, shows why this is still fragile: repeated learning cycles can collapse instead of compound when experience is internalized poorly.

I also want to mention OpenAI’s June 4 memory update for ChatGPT called Dreaming. Its “dreaming” system synthesizes user memory in the background to improve freshness, continuity, and relevance across conversations. This is system-side memory, not proof that parametric continual learning is solved. But still, it shows the same pressure appearing in production: memory cannot remain a static list of notes forever.

What we see is that the field needs to move beyond the idea of continuous updating. What feels new this week is the search for a controlled phase between experience and change. Sleep becomes interesting as a boundary: a moment when the system can decide what deserves to persist, what should stay temporary, and what should be discarded. We anticipate a few breakthroughs in continual learning coming this year.

If any of those thoughts resonate with you – share them across your social networks. Let’s keep the conversation going.

Comment

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

Twitter Library

Reasoning RL in 2026: GRPO, DPO, RLVR, Agentic PO & Beyond

GRPO, DPO, RLVR, DAPO, GSPO, ARPO, VPO – 2026 reasoning RL methods in one place. A reference guide for training reasoning models with RL.

Turing Post • Alyona Vert.

We are reading / watching

When AI builds itself by Anthropic
Paving the way for agents in biology by Anthropic

— # (#)

News from the usual suspects ™

Axiom pushed formal verification beyond pure math into economics. It announced EconLib, a Lean-based library for economic theory, starting with a formalization of Robert Aumann’s “agreeing to disagree” theorem. AxiomProver didn’t just verify the proof; it surfaced an implicit assumption in the underlying logic, then also proved the Monderer-Samet p-belief version. The project aims to become a Mathlib-style foundation for game theory, Nash equilibria, auction theory, information economics, and prediction-market logic – read the paper, see the code
Sakana AI made recursive self-improvement its explicit research agenda. It launched the Sakana AI RSI Lab in Tokyo, a dedicated group focused on using AI to redesign the AI development process itself. The lab brings together Sakana’s recent line of work on AI-generated optimization algorithms, self-rewriting agents, program evolution, self-learning reinforcement agents, adversarial coevolution, and The AI Scientist.
OpenAI pushed Codex beyond software engineering with role-specific plugins, Sites, and annotations for analysts, marketers, designers, sales teams, investors, and bankers. It also upgraded GPT-Rosalind for life sciences workflows and began rolling out Dreaming, a more scalable memory system for ChatGPT.
Anthropic published a cyber-threat analysis showing how AI-enabled attackers are moving deeper into the attack chain and exposing gaps in existing security frameworks like MITRE ATT&CK →read the report
NVIDIA turned South Korea into the week’s AI infrastructure stage. It announced deals with SK Hynix, SK Telecom, Naver, Doosan, LG, and Hyundai around memory supply, AI factories, robotics, data centers, autonomous mobility, and AI-powered manufacturing. Separately, Naver said it will build gigawatt-scale AI factories using NVIDIA technology, while LG is working with NVIDIA on humanoid robots and future data centers.
Meta entered the enterprise-agent race with Meta Business Agent, expanding AI agents across WhatsApp, Messenger, and Instagram for customer support, sales, bookings, and business operations. But the week also exposed friction: its Muse Spark API was reportedly delayed, and Meta removed face-recognition code from its smart-glasses companion app after WIRED scrutiny.
Apple finally gave WWDC an AI answer: Siri AI, a more conversational, contextual, systemwide assistant designed to work across apps while relying on on-device processing and Private Cloud Compute where possible. Reports also point to Google’s Gemini as part of the new Siri architecture.
Washington moved frontier model release closer to national-security process. The White House signed an AI cybersecurity and frontier-model order asking leading AI developers to voluntarily submit covered models for government cybersecurity review before release, then followed with a national-security AI push focused on faster adoption, updated autonomous-weapons guidance, and multi-vendor AI use inside government.

Research highlight

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

Researchers from Harvard, MIT, 2077AI, and Kempner Institute built an LLM agent “economy” where agents bid in auctions, pay each other, gain wealth from rewards, mutate if successful, and go bankrupt if ineffective. Starting with weak agents, it improved MATH from 15.9% to 57.0%, finance from 45.0% to 60.0%, science best-run accuracy from 5.0% to 20.0%, accelerator EDP from 80.2 to 39.3, and Cloudcast cost from 930 to 657.

(also a good read from 1995, Toward a Model of Mind as a Laissez-Faire Economy of Idiots by Eric B. Baum)

Open-sourced Models

Xiaomi MiMo + TileRT pushes a 1-trillion-parameter model past 1,000 tokens per second on commodity GPUs. The key claim is inference speed on a 1T parameter model at commodity hardware levels — if real and reproducible, it changes the economics of what can be deployed without cloud dependency. Worth watching for replication.
Gemma 4 12B (Google DeepMind): Runs on a laptop — the 12B parameter model brings Google's Gemma family to a size that can run locally on consumer hardware. For agent workflows that need to run on-device rather than cloud-dependent, this is a meaningful step forward in the accessibility of capable models.

Research

Trends we see looking at every paper related to AI and ML published last week:

personalization instead of one-model-fits-all
agents instead of chatbots
world models instead of pure language scaling
evaluation becoming training
automated research
memory and self-improvement
reasoning efficiency

Agent reliability, memory, and self-improvement

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories – Localizes failures inside long agent trajectories, making agent debugging much more actionable.
🌟 Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses – Makes search agents more controllable by moving state outside the model.
🌟 Rethinking Continual Experience Internalization for Self-Evolving LLM Agents – Reframes how agents should absorb experience over time.

Search, retrieval, and long-context reasoning

🌟 GrepSeek: Training Search Agents for Direct Corpus Interaction – Trains agents to investigate corpora directly instead of outsourcing thinking to retrieval.
Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback – Adds feedback loops to retrieval, making search agents less blind and more self-correcting.
OCC-RAG: Optimal Cognitive Core for Faithful Question Answering – Pushes RAG toward faithful reasoning rather than prettier retrieval wrappers.

World models, physical AI, and embodied reasoning

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning – Connects world models with language models for reasoning that needs both simulation and abstraction.
Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration? – Studies active exploration as a real capability for embodied agents.
🌟 WALL-WM: Carving World Action Modeling at the Event Joints – Models action around event boundaries, a useful step toward structured world understanding.

Model adaptation, efficiency, and scalable personalization

🌟 On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters – Reopens personalization at serious scale through parameter-efficient adaptation.
🌟 Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution – Generates adapters for code models as software changes.
🌟 KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks – Reduces reasoning degradation from KV-cache quantization.
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation – Compresses reasoning traces so distillation becomes cheaper and more practical.

RL, distillation, and reward design

Trust Region On-Policy Distillation – Stabilizes behavior transfer during on-policy distillation.
🌟 Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning – Studies reward hacking in rubric-based RL, which matters because rubrics are becoming agent-training duct tape.
🌟 Self-Distilled Policy Gradient – Uses self-distillation to make policy optimization more stable.

Automation, research agents, and agent security

🌟 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents – Brings RL into live multi-turn web interaction for visual agents.
HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems – Evolves the agent harness and policy together instead of treating infrastructure as fixed.
🌟MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery – Points toward agents that search for new ML algorithms rather than only tune existing ones.

That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How did you like it?

FAQ

What is continual learning in LLMs?

Continual learning in LLMs means updating or adapting a model over time without destroying earlier capabilities, alignment, or useful knowledge.

Why do AI models “need sleep”?

They do not literally need sleep. The point is that learning may need an offline consolidation phase, where recent context or experience is processed before anything becomes durable memory or model behavior.

What is catastrophic forgetting?

Catastrophic forgetting happens when a model learns something new but loses performance on what it previously knew.

⬅ FOD 154: Enterprise AI Middlemen: Who Survives the Agent Era?

FOD#155: Continual Learning in LLMs: Why AI Models Need Sleep