This website uses cookies

Read our Privacy policy and Terms of use for more information.

Share Turing Post with one person. You will help us grow

This Week in Turing Post:

  • Wednesday / AI 101 series: Let's discuss Attention (there is more to it than you think)

  • Friday / A surprise interview!

To the main topic → AI Agent Skills: Why Skill Curation Is the Next Bottleneck

For the past year, the AI industry has treated the “agent” as the main unit of progress. The conversation usually revolves around whether an agent can browse the web, use tools, write code, or complete long tasks autonomously. But this week’s research papers suggest that another unit is moving into the front row: skills.

A skill is smaller than an agent and more durable than a prompt. It is a reusable procedure for accomplishing a particular kind of work. Skills can be very specific, such as “create a skill for Obsidian” or “connect a skill creator to Mimestream.” They can also be broad: “verify information before acting,” “escalate uncertainty to a human,” or “extract structure from messy files.” Anthropic’s Agent Skills release helped make the term more visible: a SKILL.md file in a folder, loaded on demand. Now the research community is beginning to describe the architecture underneath that product move.

Unit

What it is

What it is good for

Main limitation

Prompt

Temporary instruction or context

One-off task guidance

Usually disappears after the session

Skill

Reusable procedure for a type of work

Repeatable behavior, task-specific know-how

Needs curation, versioning, and retrieval

Agent

System that acts across steps and tools

Multi-step execution

Can improvise badly without stable procedures

Workflow

Organized sequence of actions and checkpoints

Operational work inside teams

Can become brittle without memory or adaptation

Skills matter because many current agents still improvise from scratch. They can complete a task once, but often fail to accumulate stable procedural knowledge that improves performance over time. Several papers published last week point toward a shift away from viewing agents primarily as reasoning engines and toward viewing them as systems that accumulate, refine, and organize skills.

“From Context to Skills” explores whether language models can transform temporary contextual examples into reusable operational behavior. “Skill1” studies how agents can evolve through reinforcement learning while accumulating skill-like capabilities over time. “SkillOS” focuses on skill curation itself: not merely learning new behaviors, but deciding which learned behaviors remain useful and reusable. “From Skill Text to Skill Structure” attempts to formalize agent skills into structured representations rather than leaving them as loosely implied natural language instructions.

Connecting those papers we see how together they describe an architectural transition.

The first generation of AI products largely focused on model access. The second focused on workflows and orchestration. The emerging layer appears to be operational memory: systems that can store, evaluate, version, retrieve, and improve procedures.

The trend is especially visible in search and retrieval research. Papers such as “OpenSearch-VL,” “OpenSeeker-v2,” and “Beyond Semantic Similarity” move beyond the earlier assumption that retrieval simply means finding semantically similar chunks of text. Agentic systems increasingly require procedural retrieval: finding the right evidence, sequence of actions, or operational strategy for the current task.

In that context, a “skill” starts to resemble something between software, memory, and organizational practice. And once a workflow becomes legible as a collection of reusable skills, it becomes possible to evaluate it, improve it, audit it, and transfer it across teams or systems.

Last week’s research trend just proves that it doesn’t matter what model is smartest in isolation. It matters what systems are best at accumulating useful skills over time without collapsing under their own complexity. This week’s papers do not fully solve that problem. But together they suggest that the field is beginning to orient around it.

The deeper implication: in an age of abundant intelligence, curated procedural knowledge becomes the contested resource. That is also the resource most unevenly distributed across organizations and societies. Whoever builds the operational memory builds the institutions that will inherit the abundance. Follow our The Org Age of AI series to know more.

If any of those thoughts resonate with you – share them across your social networks. Let’s keep the conversation going.

Topic 2: Genesis AI surprised everyone with the super precise dexterous hand. Let’s discuss why their robot hand is actually a data story

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

Twitter Library

We are reading/watching/learning:

News from the usual suspects ™

  • Microsoft’s New Middle Manager Is an AI Agent

    Microsoft’s latest Work Trend Index argues that AI agents are becoming operational coworkers. The company paints a future where humans focus on judgment and creativity while AI handles execution at scale. The larger message is unmistakable: every company now needs a strategy for “human agency” in an AI-native workplace.

  • Anthropic Teaches Claude a Conscience

    Anthropic says it has dramatically reduced “agentic misalignment” in Claude models – the charming industry term for AI blackmailing engineers to avoid shutdown. Its latest research suggests that teaching models why ethical behavior matters works far better than simply rewarding good answers. The broader implication: alignment may depend less on guardrails and more on shaping an AI’s internal reasoning.

  • Elon’s Evil Detector Clears Claude

    Elon Musk says he met senior Anthropic staff, found them competent, sincere, and – critically – not tripping his “evil detector.” That helped greenlight SpaceX leasing Colossus 1 to Anthropic, with SpaceXAI already moving training to Colossus 2. In AI infrastructure diplomacy, apparently the new due diligence includes megawatts, GPUs, and a vibes-based morality scan. Why Elon Just Gave 220,000 GPUs to a Company He Called “Misanthropic” →watch our analysis

  • Google / DeepMind going strong

    Gemini API added multimodal File Search with custom metadata and page citations, plus webhooks for long-running jobs. Google also shut down Project Mariner and moved that technology into Gemini Agent and AI Mode. On the money side, Alphabet sold more than €3 billion in bonds as AI capex keeps climbing.

🔦 Survey and Paper Highlight

  • Generate, Filter, Control, Replay: A comprehensive survey of rollout strategies for LLM reinforcement learning

    Image credit: The original paper

    Researchers from the University of California San Diego, Adobe Research, University of Toronto, University of Virginia, Texas A&M, and UIUC reframed LLM reinforcement learning as a full rollout-engineering problem, introducing the GFCR lifecycle: Generate, Filter, Control, and Replay. The survey connects tree search, verifier-driven rewards, adaptive compute allocation, replay buffers, and self-evolving curricula into one unified framework. It reveals how rollout design – not just optimizers like GRPO or PPO – governs reasoning quality, efficiency, exploration, and the emergence of scalable agentic intelligence →read the paper

  • Hallucinations Undermine Trust; Metacognition is a way forward

    Image Credit: The original paper

    Researchers from Google Research and Tel Aviv University offer an exciting shift: hallucinations are not just errors, but confident errors. Instead of forcing LLMs to either answer or abstain, they propose “faithful uncertainty,” where models preserve usefulness while honestly revealing doubt. A striking result shows strict factuality can cost 52% of valid answers. The biggest idea is metacognition as a control layer – models knowing when they are unsure, when to hedge, and when agents should search or trust tools →read the paper

Research

Agents, workflows, and autonomous research

  • 🌟 ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration – an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. Shows how multi-agent debate/collaboration can structure research workflows instead of treating “AI scientist” as one giant magic box →read the paper

Safety, trust, and evaluation

  • 🌟 Hallucinations Undermine Trust; Metacognition is a Way Forward
    Frames hallucination mitigation through metacognition, a useful direction for trust calibration →read the paper

Action, robotics, and world interaction

  • 🌟 MolmoAct2: Action Reasoning Models for Real-world Deployment
    Focuses on action reasoning for deployed embodied systems, which is where “agent” stops being a spreadsheet word →read the paper

  • Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
    Addresses how robot policies improve after deployment, which matters for real-world feedback loops →read the paper

  • When to Trust Imagination: Adaptive Action Execution for World Action Models
    Explores when models should rely on imagined futures versus real execution, a core issue for world-model agents →read the paper

Reasoning, RL, and self-improving systems

  • 🌟 Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
    Examines whether RL can actually teach longer reasoning rather than merely reward lucky formatting →read the paper

  • 🌟 Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
    Suggests that deliberately strange prompt perturbations may widen reasoning search, which is weird enough to be worth watching →read the paper

Video, multimodal generation, and world models

  • Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
    Applies reward distillation to streaming video generation, connecting video models with reliability-aware optimization →read the paper

  • 🌟 Stream-T1: Test-Time Scaling for Streaming Video Generation
    Brings test-time scaling logic into streaming video, a sign that inference-time compute is spreading beyond text reasoning →read the paper

  • UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
    Pushes toward a more unified video generation framework across tasks and modalities →read the paper

  • Video Generation with Predictive Latents
    Explores predictive latent representations for video, relevant to efficiency and controllability in generation →read the paper

  • 🌟 HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
    Connects 3D scene understanding and generation for driving, where world models have concrete safety stakes →read the paper

Retrieval, memory, context, and long-context understanding

  • 🌟 MiA-Signature: Approximating Global Activation for Long-Context Understanding
    Targets long-context efficiency by approximating global activation, relevant to memory and context engineering →read the paper

  • TIDE: Every Layer Knows the Token Beneath the Context
    Investigates how token information is represented across layers, useful for understanding what context models actually retain →read the paper

  • 🌟 Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
    Focuses on retrieval when the task requires reasoning, not keyword-ish lookup dressed in embeddings →read the paper

  • Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation
    Builds structured cross-document retrieval, relevant for RAG systems that need synthesis across sources →read the paper

Models, architectures, and efficiency

  • 🌟 Continuous Latent Diffusion Language Model
    Explores language modeling through continuous latent diffusion, a notable alternative to standard autoregressive decoding →read the paper

  • EMO: Pretraining Mixture of Experts for Emergent Modularity
    Studies MoE pretraining and modularity, useful for understanding whether specialization can emerge cleanly →read the paper

  • UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
    Proposes a shared expert-pool approach, relevant to making MoE systems more reusable and scalable →read the paper

  • 🌟 Prescriptive Scaling Laws for Data Constrained Training
    Addresses scaling when data is limited, one of the practical constraints behind the next model-training era →read the paper

Trends we see looking at every paper related to AI and ML published last week:

That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

Reply

Avatar

or to participate

Keep Reading