This website uses cookies

Read our Privacy policy and Terms of use for more information.

Today’s editorial: what people still misunderstand about tools for agents, why NVIDIA BioNeMo is a toolkit rather than “extra model knowledge,” and why agentic science depends on tools like MCP, workflows, permissions, and human review.

💸 How to Cut the Trust Tax of Evaluating AI Agents at Scale

Evaluating agents with external LLMs looks affordable. Until your agent traffic grows. Fiddler’s guide breaks down how to reduce Total Cost of Ownership while eliminating risk gaps in production.

Learn how to:

  • Evaluate every trace without sampling

  • Evaluate agents in-environment with batteries-included Trust Models

  • Reduce API costs at scale

Share Turing Post with one person. You will help us grow

What People Still Don’t Understand About AI Agents and Tools

During a Q&A with Kimberley Powell at the BIO AI Summit, where NVIDIA announced that they had open-sourced their BioNeMo Agent Toolkit, I heard a couple of questions from journalists who write about AI that made me see red. How much do you need to misunderstand the whole thing to ask something like that?

Then I calmed myself down and remembered: there are no dumb questions. There are signals.

And this one was a very useful signal. It showed how many people, including people who write about AI for a living, still don’t understand what agents are, what tools are, and what happens when you connect the two.

So let’s bring clarity to the world. Small mission, no pressure.

The question was basically this: if Claude refuses to help someone create a bioweapon, doesn’t a scientific toolkit now give it a deep knowledge that can help with that?

Bioweapons are a legitimate safety topic. Nobody should wave that away. But the question revealed a layer mistake. It assumed that NVIDIA had given models new dangerous scientific knowledge.

And this is a fundamental misunderstanding of what a toolkit is.

What is BioNeMo? BioNeMo Agent Toolkit is NVIDIA’s collection of scientific models, tools, and workflows that AI agents can call for life-sciences tasks. Clear enough – except it’s not.

BioNeMo is much closer to giving a scientist access to laboratory equipment than teaching them biology. Or even simpler: it is like giving someone a bicycle repair kit. A screwdriver screws and unscrews screws (try to say it out loud). A wrench tightens bolts. A pump puts air into the tire. A patch covers a puncture. Can you make a bomb from this bicycle with a screwdriver or a patch? That’s not impossible! But it requires much more than that.

A toolkit gives you specific tools for specific jobs.

That is the simplest way to understand what NVIDIA announced. BioNeMo Agent Toolkit packages life-sciences tools and models into agent-callable skills: protein folding, molecular docking, generative chemistry, genomics analysis, protein design, biomarker discovery, and related workflows.

And that means exactly that: AI agents can now call scientific instruments.

There was another question that kinda surprised me: how it was possible that OpenAI and Anthropic/Claude – both presented as models that the toolkit can use – agreed to collaborate.

Well, they don’t, they don’t need to, and that’s quite obvious when you know – again – what a toolkit is.

What makes this an agent toolkit, not just a pile of models?

Because the agent can chain the tools and models.

Example: design a protein binder

A scientist says: “Find a possible binder for this target protein.

The agent does not magically “become a biologist.” It follows a workflow:

  1. Use RFdiffusion model to design possible binder backbones.

  2. Use ProteinMPNN model to propose amino-acid sequences for those backbones.

  3. Use Boltz-2 or OpenFold3 models to predict whether the binder and target actually fold together.

  4. Rank candidates using confidence and interface metrics.

  5. Return the best candidates to the scientist with caveats.

The BioNeMo repo even includes a generative protein binder workflow that combines RFdiffusion, ProteinMPNN, and Boltz-2 / OpenFold3 for this kind of sequence: backbones → sequences → co-fold → filter. Like a set of skills an agent might require.

Why is it so important to understand?

As we plunge headfirst into this new agentic era, a basic understanding of it becomes crucially important. Because understanding this distinction changes how we view the future of AI – shifting the focus from what AI knows to what AI can do.

If we go back to biotech and BioNeMo, the difference will be this: A chatbot can explain molecular docking. If you casually ask about it. An agent with the right tool can run a docking workflow. It’s less about learning and more about building. You need to know some stuff by then. A chatbot can describe protein design. An agent with the right workflow can generate a protein backbone, propose an amino-acid sequence, predict whether it might fold, inspect the output, and bring the result back to a human scientist.

Scientist is the key word here.

And this is where many people still get lost. They keep looking at the model as if the model is the whole story. It’s not.

Which model is smarter? It’s a good question but not the most interesting. The better question now is: where does the model act?

Inside a codebase? Inside GitHub? Inside a lab? Inside a medical scanner? Inside a drug-discovery workflow? Inside a system that repeatedly measures, tests, adjusts, and improves? Once AI starts acting, the valuable layer is the loop around the model.

And science is full of loops.

Drug discovery, protein design, genomics, biomarker discovery, clinical research, medical imaging, literature review, protocol generation: these are not fields where progress usually arrives as one clean eureka moment. Much of the work is iteration. That is why having agents and tools for agents is so exciting. They can help the scientist move through the loop faster. I’m on my own loop to keep repeating that.

There was a fascinating point in the BioNeMo discussion: early dreams of AI for science imagined that models could simply consume all scientific knowledge, connect all the dots, and discoveries would just pop out. But that is not really how it worked. The real progress came when systems entered the loop: look at the literature, propose an experiment, analyze the data, then use that result to propose the next experiment.

It’s happening right now, though the majority of people still don’t realize that. That’s one of my biggest revelations from BIO AI Summit: we are at the beginning of agentic science.

And yes, it should make us excited.

Because a lot of scientists spend far too little time doing science. If agents can compress some of that engineering and operational work, they give scientists something precious back: more time in the creative scientific space. More time looking at data. More time asking the next question. More time noticing that something unexpected happened and following it. What does it give to the rest of us? The ability to cure diseases much faster. Feels like it’s even better than a new productivity tool.

This is why BioNeMo is not only a biotech story. It is part of the bigger shift I keep coming back to: the model is no longer the whole story. The loop around the model is becoming the interesting layer.

I talked about exactly this in my solo segment for O’Reilly’s This Week in AI this week: who owns the loop? In coding, in cybersecurity, in medicine, in science, the same question keeps appearing. Where does the model act? Who controls the tools? Who owns the feedback? Who decides when the loop is safe enough to run? Check it out→

If any of those thoughts resonate with you – share them across your social networks. Let’s keep the conversation going.

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

We are reading / watching

Twitter Library

News from the usual suspects ™

  • Midjourney announced Midjourney Medical, a full-body underwater ultrasound scanner concept that aims to make internal imaging faster, more repeatable, and more consumer-friendly. It’s also stunningly beautiful.

  • SpaceX’s news:

  • Anthropic’s news

  • OpenAI’s news:

    • Their LifeSciBench tests models on real life-science work with 750 expert-authored tasks, 1,062 artifacts, 173 scientist contributors, and more than 19,000 rubric criteria.

    • OpenAI and Molecule.one demonstrated a near-autonomous AI chemist that improved Chan-Lam coupling, a drug-chemistry reaction for forming carbon-nitrogen bonds.

  • NVIDIA’s news

    • NVIDIA introduced BioNeMo that we discussed above.

    • They also introduced new AI-for-science software at ISC, including DAQIRI and ALCHEMI NIM microservices for scientific pipelines from astronomy to chemistry and materials simulation

Survey highlight

World Action Models: A Survey by National University of Singapore

Image Credit: The original paper

This survey conceptualizes World Action Models (WAMs) as embodied predictive-action systems that integrate future forecasting directly into the action path. WAMs unify Vision-Language-Action policies with predictive world models. The authors structure the field into three primary design philosophies: Render-and-Decode, Latent-Only, and Video-Generation-Free. Finally, the taxonomy evaluates critical trade-offs balancing representational richness against compute, memory, latency, and physical plausibility. The field is moving toward generating less of the future while preserving control requirements.

Models

  • Z.ai released GLM-5.2, a major Chinese open-weight coding model with a 1M-token context window and MIT license, putting open Chinese models back into the U.S. AI anxiety machine.

  • Sakana AI released Fugu and Fugu Ultra, an orchestration-model family that routes tasks across a swappable pool of models behind one API.

  • DreamX-World 1.0 – builds toward interactive general-purpose world models instead of passive video generators →read the paper

  • Qwen-RobotWorld Technical Report – connects embodied world modeling with language-conditioned video generation for robotics →read the paper

  • PAIWorld – proposes a 3D-consistent world foundation model for robotic manipulation →read the paper

  • BioMatrix – expands biological foundation models across sequences, structures, and language →read the paper

  • VibeThinker-3B – pushes verifiable reasoning into small language models, which matters because reasoning cannot only live in giant expensive systems →read the paper

  • Sumi – develops an open diffusion language model from scratch, useful as diffusion LMs become less fringe and more serious →read the paper

Research

Trends we see looking at every paper related to AI and ML published last week:

  • World models are becoming agent infrastructure, not video generators.

  • Agents are learning to manage their own memory and improve themselves.

  • Reasoning is shifting toward reflection, verification, and active perception.

  • Robotics and foundation models are rapidly converging.

  • Researchers are experimenting beyond standard transformer architectures.

  • Open models are entering a new governance and capability-control phase.

World models, robotics, and physical agents

  • Looped World Models – introduces iterative latent refinement as a new scaling axis for long-horizon world simulation →read the paper

  • Kairos: A Native World Model Stack for Physical AI – frames world modeling as a full stack for physical agents that learn, maintain, and act →read the paper

  • Current World Models Lack a Persistent State Core – identifies persistent internal state as the missing piece in current world-model systems →read the paper

  • 🌟 ImageWAM – questions whether world action models really need video generation or whether image editing can capture enough dynamics →read the paper

  • Geometric Action Model for Robot Policy Learning – grounds robot policy learning in geometric structure instead of plain imitation →read the paper

  • Foresight – detects long-horizon robot manipulation failures using action-conditioned world-model latents →read the paper

  • ENPIRE – demonstrates real-world robot policy self-improvement through agentic learning loops →read the paper

  • PoLAR – factorizes latent actions to make robot policies more transferable and controllable →read the paper

Agent systems, memory, and self-improvement

  • DataClaw0 – turns raw multimodal streams into task-ready data through agentic data tailoringv→read the paper

  • OpenRath – gives agent systems a session-centered runtime state for replay, branching, memory, and tool evidence →read the paper

  • Self-Compacting Language Model Agents – lets agents compress their own context before long-running memory becomes soup →read the paper

  • 🌟 EvoEmbedding – makes retrieval representations evolve with long-context memory and agent usev→read the paper

  • Connect the Dots – trains long-lifecycle agents to learn across tasks through reinforcement learning →read the paper

  • OPD-Evolver – evolves agents through on-policy distillation instead of static post-training →read the paper

  • CalVerT – adds calibrated verifier telemetry so agents can decide when to act, retrieve, stop, or distrust themselves →read the paper

  • 🌟 Training Open Models for Agentic Phone Use – builds open agents for real phone interfaces, not just toy GUI tasks →read the paper

  • FAPO – automates prompt optimization across multi-step LLM pipelines →read the paper

Reasoning, perception, and learning loops

  • 🌟 Learning from Your Own Mistakes – constructs micro-reflective trajectories for self-distillation and reasoning improvement →read the paper

  • 🌟 Learning from the Self-future – teaches diffusion language models from their own future trajectories →read the paper

  • Zone of Proximal Policy Optimization – moves supervision from gradients into prompts through teacher-guided optimization →read the paper

  • Native Active Perception as Reasoning for Omni-Modal Understanding – treats perception as an active reasoning process, not just input processing →read the paper

  • S-Agent – shows how spatial tool use can elicit stronger spatial reasoning →read the paper

Architecture and efficient scaling

  • Variable-Width Transformers – explores adaptive-width computation as a scaling path beyond fixed transformer blocks. →read the paper

  • Grouped Query Experts – applies mixture-of-experts routing inside grouped-query attention. →read the paper

  • HydraHead – studies head-level specialization and hybrid attention, useful for understanding where transformer efficiency actually comes from. →read the paper

  • Tapered Language Models – questions uniform transformer width and explores models whose width changes across depth. →read the paper

  • Deeper is Not Always Better – reduces alignment tax through confident layer decoding instead of always using the full model depth. →read the paper

Safety, release strategy, and conceptual direction

  • Toward Open Weight Models Without Risks – proposes separating public and private capabilities inside open-weight model releases. →read the paper

  • Causal Discovery in the Era of Agents – argues agents can assist causal workflows, but only if causal claims are constrained instead of hallucinated. →read the paperThat’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

1  

Reply

Avatar

or to participate

Keep Reading