FOD#135: What It Means When AI Labs Step Into Healthcare

This Week in Turing Post:

Wednesday / AI 101 series: Web World Models
Friday / We will start a New Series!

🤝 From our partners: Vault-Free Privileged Access for Modern Engineering Teams

As AI and cloud infrastructure scale, managing privileged access with static credentials and vaults becomes both a bottleneck and a risk. Teleport replaces rotated credentials and vaulted secrets with real Zero Trust, issuing short-lived, cryptographic certificates at runtime for every human, machine, and AI agent.

Discover how vault-free PAM reduces risk and accelerates engineering.

Learn more

Our news digest is always free. Click on the partner’s link above to support us or Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Nvidia, Hugging Face, Microsoft, Google, a16z etc plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

Upgrade today

Last week at CES: Robots! More Robots! And Jensen Huang says they will have human-level capabilities THIS year. We went to see if robots were aware of that. Watch the video :)

Also last week: Why OpenAI and Anthropic Chose Healthcare at the Same Time

Right after the holidays, both OpenAI and Anthropic announced healthcare-focused initiatives within days of each other. For the first time, I don’t think about it as a competition, what I like about it is that it’s a signal that healthcare has crossed a threshold where staying out is no longer the cautious choice.

For several years, healthcare was treated as a deferred domain for leading AI labs. Understandably: the sector is heavily regulated, operationally fragmented, and unforgiving to confident mistakes. Earlier generations of models were difficult to bound, difficult to audit, and prone to failure modes that could not be cleanly isolated from their successes. In low-stakes domains, this was ok. In healthcare – not at all.

The decision by both labs to move now implies a shared conclusion that something fundamental has changed. The models are for sure more capable now, but most importantly – they are more governable.

Healthcare is therefore better understood as a systems test rather than a market opportunity. This is a hugely important step in AI adoption.

Another moment worth mentioning: doctors should not be worried. What AI is being applied to is coordination. It’s an old problem in healthcare that no one is structurally positioned to assemble full context under time pressure: information is distributed across multiple systems, and signals from medications, labs, imaging, wearables, genetics, and prior history are rarely considered together when decisions are made – and patients are left to play detectives putting all the pieces together on their own. In this framing, LLMs are not making medical judgments. They mainly help bring existing information together so it can be reviewed more easily.

Both labs appear to believe this coordination role is now stable enough to turn into a product.

Where the two labs differ is in how they approach this coordination role.

OpenAI is extending its general assistant into healthcare, treating health data as another high-value context that can sit alongside documents, calendars, and enterprise tools, with additional privacy and access controls layered on top. The underlying assumption is that a single, familiar interface can serve patients, clinicians, and administrative workflows, as long as the boundaries around data use are clearly defined.

Anthropic is taking a narrower approach. Its healthcare effort is oriented less toward a patient-facing assistant and more toward embedding Claude inside existing institutional workflows. The emphasis is on predictable behavior, limited scope, and alignment with how healthcare organizations already operate. Rather than broad continuity across use cases, the focus is on fitting cleanly into specific professional contexts.

The choices what to focus on reflect different theories of how trust is built in regulated systems. One assumes trust emerges from continuity and widespread use, the other from constraint and institutional alignment. It is not yet clear which approach will prove more durable, and it is possible that both will coexist in different parts of the system. What matters is that both labs are now willing to test their models in an environment where responsibility cannot remain abstract. I’m very excited about this new development.

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

Twitter Library

11 New Interesting Policy Optimization Techniques

Policy optimization is one of the most exciting topics for the AI community right now. Why?

www.turingpost.com/p/policyoptimization

Upgrade

We are reading

On the Slow Death of Scaling by Sara Hooker
a16z: The Power Brokers by Packy McCormick
Zhipu AI and MiniMax Just Went Public, But They're Not China's OpenAI by Recode China
Inside MiniMax: Testing if AGI is Possible Without Infinite VC Money

News from the usual suspects

Gmail Gets Gemini-fied
Gmail is stepping into 2026 with Gemini AI at the helm. Google’s flagship inbox now offers AI Overviews to summarize email threads, answer natural language queries, and filter clutter with the upcoming “AI Inbox.” Help Me Write and Suggested Replies get smarter, while proofreading goes premium. It’s no longer just email – it’s your AI-powered executive assistant.
Apple + Google: The Gemini Marriage
Apple has picked Google’s Gemini to power the long-delayed AI upgrade to Siri, marking a rare alliance between rivals. The multiyear partnership puts Gemini models at the core of Apple’s upcoming “Foundation Models,” keeping compute mostly on-device and in Apple’s private cloud. Apple remains mum on the $1B/year price tag, but this move signals Cupertino is finally showing up to the AI arms race – fashionably late, of course.
Musk's Macrohard Moment
xAI, Elon Musk’s AI venture, torched $7.8 billion in just nine months, chasing its dream of powering humanoid robots like Optimus. Despite swelling quarterly losses, revenue doubled to $107 million, and a $20B cash injection (featuring Nvidia) suggests the spending spree is far from over. "Macrohard" may be a pun on Microsoft – but the burn rate is no joke.

🔦 Research highlight

Researchers from MIT CSAIL present Recursive Language Models (RLMs), a novel inference-time architecture enabling LLMs to process arbitrarily long prompts – scaling beyond 10 million tokens, over 100× typical context windows. Instead of consuming the prompt directly, RLMs offload it into a Python REPL as a variable (context), allowing the LLM to symbolically interact with the prompt via code. The model can read, transform, and decompose the context and recursively call sub-LLMs through a built-in llm_query() function. This enables dynamic task decomposition, selective context access, and unbounded reasoning. RLMs require no retraining and work with existing models (GPT-5, Qwen3-Coder), achieving up to 2× higher accuracy than base LLMs and long-context agents on benchmarks like BrowseComp+, OOLONG, and OOLONG-Pairs, while keeping inference cost comparable or lower. Ablation studies confirm the critical role of both the REPL environment and recursive sub-calls in solving complex, information-dense tasks.
This is a significant step forward because RLMs break the fundamental context window barrier of LLMs – enabling scalable, symbolic, and recursive reasoning over massive inputs without retraining or architectural changes →read the paper

Models

Liquid: LFM2.5 – The Next Generation of On-Device AI
Release an open-weight 1.2B-class model family optimized for edge agents by extending pretraining to 28T tokens, scaling post-training with multi-stage reinforcement learning, and shipping text, Japanese, vision-language, and native audio variants with day-zero runtime support across common inference stacks and NPUs →read the paper
MiMo-V2-Flash Technical Report
Deliver fast, strong reasoning and agentic performance by combining a large MoE backbone with hybrid attention, multi-token prediction, and multi-teacher on-policy distillation to push decoding speed and parameter efficiency →read the paper
K-EXAONE Technical Report
Provide a multilingual MoE foundation model with long-context support that targets balanced reasoning, agentic, and industrial capabilities across multiple major languages →read the paper
LTX-2: Efficient Joint Audio-Visual Foundation Model
Generate temporally synchronized video and audio in a single unified model by coupling asymmetric modality-specific transformers through cross-attention for efficient, controllable audiovisual synthesis →read the paper

Research this week

(🌟 indicates papers that we recommend to pay attention to)

World models, environments, and embodied learning

Digital Twin AI: Opportunities and Challenges from Large Language Models to World Models
Unify how AI augments digital twins across modeling, mirroring, intervention, and autonomous management stages →read the paper
🌟 WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks (Microsoft)
Provide a large-scale, non-stationary web environment with rubric-based rewards to train and evaluate visual web agents →read the paper
Scaling Behavior Cloning Improves Causal Reasoning
Show that scaling data and depth in behavior cloning improves causal policies in real-time video game agents →read the paper
Evolving Programmatic Skill Networks
Grow a compositional network of executable skills that reflect, refactor, and stabilize over time in open-ended environments →read the paper

Agents, tools, and orchestration

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning
Route across models and tools using training-free priors and reinforcement learning to exploit heterogeneity in complex reasoning tasks →read the paper
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning
Interleave multimodal chain-of-thought reasoning with autonomous tool invocation to solve open-ended, real-world problems →read the paper
RelayLLM: Efficient Reasoning via Collaborative Decoding
Coordinate small and large models at the token level so lightweight models request help only when needed to cut inference cost →read the paper
🌟 Over-Searching in Search-Augmented Large Language Models (Apple)
Diagnose when retrieval harms efficiency and truthfulness and propose metrics and mitigations for search overuse →read the paper →
Can We Predict Before Executing Machine Learning Agents?
Replace costly execution with predictive reasoning by internalizing execution priors and using a predict-then-verify loop →read the paper
GenCtrl: A Formal Controllability Toolkit for Generative Models
Formalize controllability as a control problem and estimate controllable sets to expose the limits of human influence over generation →read the paper

Agent memory, long-horizon reasoning, and experience compression

SimpleMem: Efficient Lifelong Memory for LLM Agents
Compress interaction histories into high-density semantic memory units, consolidate them asynchronously into abstractions, and retrieve them adaptively to reduce token cost while preserving long-term performance →read the paper
MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents
Represent memories across semantic, temporal, causal, and entity graphs and retrieve them via policy-guided traversal to enable interpretable, query-aligned long-horizon reasoning →read the paper
Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning
Organize experiences into an event graph with explicit logical relations to support structured navigation over memory instead of shallow similarity search →read the paper
Distilling Feedback into Memory-as-a-Tool
Amortize inference-time critique by storing feedback as retrievable guidelines that agents can reuse as a tool to reduce reasoning cost →read the paper

Agent evaluation, verification, and confidence

Agent-as-a-Judge
Evolve evaluation from single-pass model judging to agentic judges with planning, tools, collaboration, and memory to enable verifiable multi-step assessment →read the paper
Agentic Rubrics as Contextual Verifiers for SWE Agents
Generate repository-specific rubric checklists via agent interaction to verify code patches without executing tests while remaining grounded and interpretable →read the paper
Confidence Estimation for LLMs in Multi-turn Interactions
Measure and improve confidence calibration across turns by formalizing monotonicity and per-turn reliability as context accumulates →read the paper
Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
Evaluate belief robustness by probing consistency across contextual neighborhoods rather than relying on point-wise self-consistency →read the paper

Reasoning dynamics, structure, and control

DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs
Reformulate chain-of-thought generation as an iterative denoising process to enable retrospective correction of reasoning steps →read the paper
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Analyze long reasoning traces as structured interaction patterns and guide the synthesis of stable reasoning trajectories →read the paper
Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy
Decompose large counting tasks into reliable subproblems and trace how intermediate counts are represented and aggregated inside the model →read the paper
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners
Probe how latent reasoning forms across languages and show that internal reasoning dynamics largely follow an English-centered pathway →read the paper
Parallel Latent Reasoning for Sequential Recommendation
Scale reasoning width by exploring multiple latent reasoning trajectories in parallel to improve generalization under real-time constraints →read the paper

Training efficiency, data efficiency, and optimization

SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving
Push lightweight supervised fine-tuning to state-of-the-art SWE performance through curated datasets, curriculum design, and verifier-based test-time scaling →read the paper
One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling
Demonstrate that a single, carefully engineered training sample can unlock broad reasoning gains across domains via reinforcement learning →read the paper
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
Suppress destructive gradients on confident-but-conflicting tokens by gating updates with entropy to reduce catastrophic forgetting during fine-tuning →read the paper
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
Replace fixed norm equilibria with learnable scaling factors to adapt weight magnitudes to data and improve downstream performance →read the paper
🌟 GDPO: Group reward-Decoupled Normalization Policy Optimization (Nvidia)
Decouple reward normalization in multi-reward reinforcement learning to preserve signal resolution and improve training stability →read the paper

That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

FOD#135: What It Means When AI Labs Step Into Healthcare

This Week in Turing Post:

🤝 From our partners: Vault-Free Privileged Access for Modern Engineering Teams

Twitter Library

We are reading

News from the usual suspects

🔦 Research highlight

Models

Research this week

How did you like it?

Reply

Keep Reading

Turing Post