Today’s editorial: what people still misunderstand about tools for agents, why NVIDIA BioNeMo is a toolkit rather than “extra model knowledge,” and why agentic science depends on tools like MCP, workflows, permissions, and human review.
💸 How to Cut the Trust Tax of Evaluating AI Agents at Scale
Evaluating agents with external LLMs looks affordable. Until your agent traffic grows. Fiddler’s guide breaks down how to reduce Total Cost of Ownership while eliminating risk gaps in production.
Learn how to:
Evaluate every trace without sampling
Evaluate agents in-environment with batteries-included Trust Models
Reduce API costs at scale
Share Turing Post with one person. You will help us grow
What People Still Don’t Understand About AI Agents and Tools
During a Q&A with Kimberley Powell at the BIO AI Summit, where NVIDIA announced that they had open-sourced their BioNeMo Agent Toolkit, I heard a couple of questions from journalists who write about AI that made me see red. How much do you need to misunderstand the whole thing to ask something like that?
Then I calmed myself down and remembered: there are no dumb questions. There are signals.
And this one was a very useful signal. It showed how many people, including people who write about AI for a living, still don’t understand what agents are, what tools are, and what happens when you connect the two.
So let’s bring clarity to the world. Small mission, no pressure.
The question was basically this: if Claude refuses to help someone create a bioweapon, doesn’t a scientific toolkit now give it a deep knowledge that can help with that?
Bioweapons are a legitimate safety topic. Nobody should wave that away. But the question revealed a layer mistake. It assumed that NVIDIA had given models new dangerous scientific knowledge.
And this is a fundamental misunderstanding of what a toolkit is.
What is BioNeMo? BioNeMo Agent Toolkit is NVIDIA’s collection of scientific models, tools, and workflows that AI agents can call for life-sciences tasks. Clear enough – except it’s not.
BioNeMo is much closer to giving a scientist access to laboratory equipment than teaching them biology. Or even simpler: it is like giving someone a bicycle repair kit. A screwdriver screws and unscrews screws (try to say it out loud). A wrench tightens bolts. A pump puts air into the tire. A patch covers a puncture. Can you make a bomb from this bicycle with a screwdriver or a patch? That’s not impossible! But it requires much more than that.
A toolkit gives you specific tools for specific jobs.
That is the simplest way to understand what NVIDIA announced. BioNeMo Agent Toolkit packages life-sciences tools and models into agent-callable skills: protein folding, molecular docking, generative chemistry, genomics analysis, protein design, biomarker discovery, and related workflows.
And that means exactly that: AI agents can now call scientific instruments.
There was another question that kinda surprised me: how it was possible that OpenAI and Anthropic/Claude – both presented as models that the toolkit can use – agreed to collaborate.
Well, they don’t, they don’t need to, and that’s quite obvious when you know – again – what a toolkit is.
What makes this an agent toolkit, not just a pile of models?
Because the agent can chain the tools and models.
Example: design a protein binder
A scientist says: “Find a possible binder for this target protein.”
The agent does not magically “become a biologist.” It follows a workflow:
Use RFdiffusion model to design possible binder backbones.
Use ProteinMPNN model to propose amino-acid sequences for those backbones.
Use Boltz-2 or OpenFold3 models to predict whether the binder and target actually fold together.
Rank candidates using confidence and interface metrics.
Return the best candidates to the scientist with caveats.
The BioNeMo repo even includes a generative protein binder workflow that combines RFdiffusion, ProteinMPNN, and Boltz-2 / OpenFold3 for this kind of sequence: backbones → sequences → co-fold → filter. Like a set of skills an agent might require.
Why is it so important to understand?
As we plunge headfirst into this new agentic era, a basic understanding of it becomes crucially important. Because understanding this distinction changes how we view the future of AI – shifting the focus from what AI knows to what AI can do.
If we go back to biotech and BioNeMo, the difference will be this: A chatbot can explain molecular docking. If you casually ask about it. An agent with the right tool can run a docking workflow. It’s less about learning and more about building. You need to know some stuff by then. A chatbot can describe protein design. An agent with the right workflow can generate a protein backbone, propose an amino-acid sequence, predict whether it might fold, inspect the output, and bring the result back to a human scientist.
Scientist is the key word here.
And this is where many people still get lost. They keep looking at the model as if the model is the whole story. It’s not.
Which model is smarter? It’s a good question but not the most interesting. The better question now is: where does the model act?
Inside a codebase? Inside GitHub? Inside a lab? Inside a medical scanner? Inside a drug-discovery workflow? Inside a system that repeatedly measures, tests, adjusts, and improves? Once AI starts acting, the valuable layer is the loop around the model.
And science is full of loops.
Drug discovery, protein design, genomics, biomarker discovery, clinical research, medical imaging, literature review, protocol generation: these are not fields where progress usually arrives as one clean eureka moment. Much of the work is iteration. That is why having agents and tools for agents is so exciting. They can help the scientist move through the loop faster. I’m on my own loop to keep repeating that.
There was a fascinating point in the BioNeMo discussion: early dreams of AI for science imagined that models could simply consume all scientific knowledge, connect all the dots, and discoveries would just pop out. But that is not really how it worked. The real progress came when systems entered the loop: look at the literature, propose an experiment, analyze the data, then use that result to propose the next experiment.
It’s happening right now, though the majority of people still don’t realize that. That’s one of my biggest revelations from BIO AI Summit: we are at the beginning of agentic science.
And yes, it should make us excited.
Because a lot of scientists spend far too little time doing science. If agents can compress some of that engineering and operational work, they give scientists something precious back: more time in the creative scientific space. More time looking at data. More time asking the next question. More time noticing that something unexpected happened and following it. What does it give to the rest of us? The ability to cure diseases much faster. Feels like it’s even better than a new productivity tool.
This is why BioNeMo is not only a biotech story. It is part of the bigger shift I keep coming back to: the model is no longer the whole story. The loop around the model is becoming the interesting layer.
I talked about exactly this in my solo segment for O’Reilly’s This Week in AI this week: who owns the loop? In coding, in cybersecurity, in medicine, in science, the same question keeps appearing. Where does the model act? Who controls the tools? Who owns the feedback? Who decides when the loop is safe enough to run? Check it out→
If any of those thoughts resonate with you – share them across your social networks. Let’s keep the conversation going.
Follow us on 🎥 YouTube Twitter Hugging Face 🤗
We are reading / watching
Hotter Than a Hot Tub: The 45°C Breakthrough to Cool AI’s Biggest Machines by NVIDIA is a very interesting article about liquid cooling, how it can enable zero water consumption, and how waste heat from AI infrastructure could potentially be shared with residential buildings.
How Chinese Researchers Plan to Build Self-Improving AI by Recode China AI
Twitter Library
News from the usual suspects ™
Midjourney announced Midjourney Medical, a full-body underwater ultrasound scanner concept that aims to make internal imaging faster, more repeatable, and more consumer-friendly. It’s also stunningly beautiful.
SpaceX’s news:
They agreed to buy Anysphere, the company behind Cursor, for $60B in stock, turning the coding-agent race into a fight over the developer workflow itself.
SpaceX also disclosed a compute-capacity deal with Reflection AI worth up to $6.3B, making its AI-infrastructure ambitions look much bigger than Cursor alone. In March, we interviewed Reflection’s co-founder. Worth taking a look.
Anthropic’s news
They launched Claude Tag in Slack, letting Claude be summoned in group conversations to read context, break down tasks, and follow workplace threads like an AI teammate. Sad to see Andrej Karpathy as Anthropic’s new promo platform.
Micron signed a strategic infrastructure agreement with Anthropic to supply memory and storage products, adding memory supply to the list of bottlenecks in frontier AI scaling
OpenAI’s news:
Their LifeSciBench tests models on real life-science work with 750 expert-authored tasks, 1,062 artifacts, 173 scientist contributors, and more than 19,000 rubric criteria.
OpenAI and Molecule.one demonstrated a near-autonomous AI chemist that improved Chan-Lam coupling, a drug-chemistry reaction for forming carbon-nitrogen bonds.
NVIDIA’s news
NVIDIA introduced BioNeMo that we discussed above.
They also introduced new AI-for-science software at ISC, including DAQIRI and ALCHEMI NIM microservices for scientific pipelines from astronomy to chemistry and materials simulation
Survey highlight
World Action Models: A Survey by National University of Singapore

Image Credit: The original paper
This survey conceptualizes World Action Models (WAMs) as embodied predictive-action systems that integrate future forecasting directly into the action path. WAMs unify Vision-Language-Action policies with predictive world models. The authors structure the field into three primary design philosophies: Render-and-Decode, Latent-Only, and Video-Generation-Free. Finally, the taxonomy evaluates critical trade-offs balancing representational richness against compute, memory, latency, and physical plausibility. The field is moving toward generating less of the future while preserving control requirements.
Models
Z.ai released GLM-5.2, a major Chinese open-weight coding model with a 1M-token context window and MIT license, putting open Chinese models back into the U.S. AI anxiety machine.
Sakana AI released Fugu and Fugu Ultra, an orchestration-model family that routes tasks across a swappable pool of models behind one API.
DreamX-World 1.0 – builds toward interactive general-purpose world models instead of passive video generators →read the paper
Qwen-RobotWorld Technical Report – connects embodied world modeling with language-conditioned video generation for robotics →read the paper
PAIWorld – proposes a 3D-consistent world foundation model for robotic manipulation →read the paper
BioMatrix – expands biological foundation models across sequences, structures, and language →read the paper
VibeThinker-3B – pushes verifiable reasoning into small language models, which matters because reasoning cannot only live in giant expensive systems →read the paper
Sumi – develops an open diffusion language model from scratch, useful as diffusion LMs become less fringe and more serious →read the paper
Research
Trends we see looking at every paper related to AI and ML published last week:
World models are becoming agent infrastructure, not video generators.
Agents are learning to manage their own memory and improve themselves.
Reasoning is shifting toward reflection, verification, and active perception.
Robotics and foundation models are rapidly converging.
Researchers are experimenting beyond standard transformer architectures.
Open models are entering a new governance and capability-control phase.
World models, robotics, and physical agents
Looped World Models – introduces iterative latent refinement as a new scaling axis for long-horizon world simulation →read the paper
Kairos: A Native World Model Stack for Physical AI – frames world modeling as a full stack for physical agents that learn, maintain, and act →read the paper
Current World Models Lack a Persistent State Core – identifies persistent internal state as the missing piece in current world-model systems →read the paper
🌟 ImageWAM – questions whether world action models really need video generation or whether image editing can capture enough dynamics →read the paper
Geometric Action Model for Robot Policy Learning – grounds robot policy learning in geometric structure instead of plain imitation →read the paper
Foresight – detects long-horizon robot manipulation failures using action-conditioned world-model latents →read the paper
ENPIRE – demonstrates real-world robot policy self-improvement through agentic learning loops →read the paper
PoLAR – factorizes latent actions to make robot policies more transferable and controllable →read the paper
Agent systems, memory, and self-improvement
DataClaw0 – turns raw multimodal streams into task-ready data through agentic data tailoringv→read the paper
OpenRath – gives agent systems a session-centered runtime state for replay, branching, memory, and tool evidence →read the paper
Self-Compacting Language Model Agents – lets agents compress their own context before long-running memory becomes soup →read the paper
🌟 EvoEmbedding – makes retrieval representations evolve with long-context memory and agent usev→read the paper
Connect the Dots – trains long-lifecycle agents to learn across tasks through reinforcement learning →read the paper
OPD-Evolver – evolves agents through on-policy distillation instead of static post-training →read the paper
CalVerT – adds calibrated verifier telemetry so agents can decide when to act, retrieve, stop, or distrust themselves →read the paper
🌟 Training Open Models for Agentic Phone Use – builds open agents for real phone interfaces, not just toy GUI tasks →read the paper
FAPO – automates prompt optimization across multi-step LLM pipelines →read the paper
Reasoning, perception, and learning loops
🌟 Learning from Your Own Mistakes – constructs micro-reflective trajectories for self-distillation and reasoning improvement →read the paper
🌟 Learning from the Self-future – teaches diffusion language models from their own future trajectories →read the paper
Zone of Proximal Policy Optimization – moves supervision from gradients into prompts through teacher-guided optimization →read the paper
Native Active Perception as Reasoning for Omni-Modal Understanding – treats perception as an active reasoning process, not just input processing →read the paper
S-Agent – shows how spatial tool use can elicit stronger spatial reasoning →read the paper
Architecture and efficient scaling
Variable-Width Transformers – explores adaptive-width computation as a scaling path beyond fixed transformer blocks. →read the paper
Grouped Query Experts – applies mixture-of-experts routing inside grouped-query attention. →read the paper
HydraHead – studies head-level specialization and hybrid attention, useful for understanding where transformer efficiency actually comes from. →read the paper
Tapered Language Models – questions uniform transformer width and explores models whose width changes across depth. →read the paper
Deeper is Not Always Better – reduces alignment tax through confident layer decoding instead of always using the full model depth. →read the paper
Safety, release strategy, and conceptual direction
Toward Open Weight Models Without Risks – proposes separating public and private capabilities inside open-weight model releases. →read the paper
Causal Discovery in the Era of Agents – argues agents can assist causal workflows, but only if causal claims are constrained instead of hallucinated. →read the paperThat’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.
How did you like it?
1








