- Turing Post
- Posts
- FOD#135: What It Means When AI Labs Step Into Healthcare
FOD#135: What It Means When AI Labs Step Into Healthcare
plus robots from CES
This Week in Turing Post:
Wednesday / AI 101 series: Web World Models
Friday / We will start a New Series!
🤝 From our partners: Vault-Free Privileged Access for Modern Engineering Teams
As AI and cloud infrastructure scale, managing privileged access with static credentials and vaults becomes both a bottleneck and a risk. Teleport replaces rotated credentials and vaulted secrets with real Zero Trust, issuing short-lived, cryptographic certificates at runtime for every human, machine, and AI agent.
Our news digest is always free. Click on the partner’s link above to support us or Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Nvidia, Hugging Face, Microsoft, Google, a16z etc plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →
Last week at CES: Robots! More Robots! And Jensen Huang says they will have human-level capabilities THIS year. We went to see if robots were aware of that. Watch the video :)
Also last week: Why OpenAI and Anthropic Chose Healthcare at the Same Time
Right after the holidays, both OpenAI and Anthropic announced healthcare-focused initiatives within days of each other. For the first time, I don’t think about it as a competition, what I like about it is that it’s a signal that healthcare has crossed a threshold where staying out is no longer the cautious choice.
For several years, healthcare was treated as a deferred domain for leading AI labs. Understandably: the sector is heavily regulated, operationally fragmented, and unforgiving to confident mistakes. Earlier generations of models were difficult to bound, difficult to audit, and prone to failure modes that could not be cleanly isolated from their successes. In low-stakes domains, this was ok. In healthcare – not at all.
The decision by both labs to move now implies a shared conclusion that something fundamental has changed. The models are for sure more capable now, but most importantly – they are more governable.
Healthcare is therefore better understood as a systems test rather than a market opportunity. This is a hugely important step in AI adoption.
Another moment worth mentioning: doctors should not be worried. What AI is being applied to is coordination. It’s an old problem in healthcare that no one is structurally positioned to assemble full context under time pressure: information is distributed across multiple systems, and signals from medications, labs, imaging, wearables, genetics, and prior history are rarely considered together when decisions are made – and patients are left to play detectives putting all the pieces together on their own. In this framing, LLMs are not making medical judgments. They mainly help bring existing information together so it can be reviewed more easily.
Both labs appear to believe this coordination role is now stable enough to turn into a product.
Where the two labs differ is in how they approach this coordination role.
OpenAI is extending its general assistant into healthcare, treating health data as another high-value context that can sit alongside documents, calendars, and enterprise tools, with additional privacy and access controls layered on top. The underlying assumption is that a single, familiar interface can serve patients, clinicians, and administrative workflows, as long as the boundaries around data use are clearly defined.
Anthropic is taking a narrower approach. Its healthcare effort is oriented less toward a patient-facing assistant and more toward embedding Claude inside existing institutional workflows. The emphasis is on predictable behavior, limited scope, and alignment with how healthcare organizations already operate. Rather than broad continuity across use cases, the focus is on fitting cleanly into specific professional contexts.
The choices what to focus on reflect different theories of how trust is built in regulated systems. One assumes trust emerges from continuity and widespread use, the other from constraint and institutional alignment. It is not yet clear which approach will prove more durable, and it is possible that both will coexist in different parts of the system. What matters is that both labs are now willing to test their models in an environment where responsibility cannot remain abstract. I’m very excited about this new development.
Follow us on 🎥 YouTube Twitter Hugging Face 🤗
Twitter Library
We are reading
On the Slow Death of Scaling by Sara Hooker
a16z: The Power Brokers by Packy McCormick
Zhipu AI and MiniMax Just Went Public, But They're Not China's OpenAI by Recode China
Inside MiniMax: Testing if AGI is Possible Without Infinite VC Money
News from the usual suspects
Gmail Gets Gemini-fied
Gmail is stepping into 2026 with Gemini AI at the helm. Google’s flagship inbox now offers AI Overviews to summarize email threads, answer natural language queries, and filter clutter with the upcoming “AI Inbox.” Help Me Write and Suggested Replies get smarter, while proofreading goes premium. It’s no longer just email – it’s your AI-powered executive assistant.Apple + Google: The Gemini Marriage
Apple has picked Google’s Gemini to power the long-delayed AI upgrade to Siri, marking a rare alliance between rivals. The multiyear partnership puts Gemini models at the core of Apple’s upcoming “Foundation Models,” keeping compute mostly on-device and in Apple’s private cloud. Apple remains mum on the $1B/year price tag, but this move signals Cupertino is finally showing up to the AI arms race – fashionably late, of course.Musk's Macrohard Moment
xAI, Elon Musk’s AI venture, torched $7.8 billion in just nine months, chasing its dream of powering humanoid robots like Optimus. Despite swelling quarterly losses, revenue doubled to $107 million, and a $20B cash injection (featuring Nvidia) suggests the spending spree is far from over. "Macrohard" may be a pun on Microsoft – but the burn rate is no joke.
🔦 Research highlight

Researchers from MIT CSAIL present Recursive Language Models (RLMs), a novel inference-time architecture enabling LLMs to process arbitrarily long prompts – scaling beyond 10 million tokens, over 100× typical context windows. Instead of consuming the prompt directly, RLMs offload it into a Python REPL as a variable (context), allowing the LLM to symbolically interact with the prompt via code. The model can read, transform, and decompose the context and recursively call sub-LLMs through a built-in llm_query() function. This enables dynamic task decomposition, selective context access, and unbounded reasoning. RLMs require no retraining and work with existing models (GPT-5, Qwen3-Coder), achieving up to 2× higher accuracy than base LLMs and long-context agents on benchmarks like BrowseComp+, OOLONG, and OOLONG-Pairs, while keeping inference cost comparable or lower. Ablation studies confirm the critical role of both the REPL environment and recursive sub-calls in solving complex, information-dense tasks.
This is a significant step forward because RLMs break the fundamental context window barrier of LLMs – enabling scalable, symbolic, and recursive reasoning over massive inputs without retraining or architectural changes →read the paper
Models
Liquid: LFM2.5 – The Next Generation of On-Device AI
Release an open-weight 1.2B-class model family optimized for edge agents by extending pretraining to 28T tokens, scaling post-training with multi-stage reinforcement learning, and shipping text, Japanese, vision-language, and native audio variants with day-zero runtime support across common inference stacks and NPUs →read the paperMiMo-V2-Flash Technical Report
Deliver fast, strong reasoning and agentic performance by combining a large MoE backbone with hybrid attention, multi-token prediction, and multi-teacher on-policy distillation to push decoding speed and parameter efficiency →read the paperK-EXAONE Technical Report
Provide a multilingual MoE foundation model with long-context support that targets balanced reasoning, agentic, and industrial capabilities across multiple major languages →read the paperLTX-2: Efficient Joint Audio-Visual Foundation Model
Generate temporally synchronized video and audio in a single unified model by coupling asymmetric modality-specific transformers through cross-attention for efficient, controllable audiovisual synthesis →read the paper
Research this week
(🌟 indicates papers that we recommend to pay attention to)
World models, environments, and embodied learning
Digital Twin AI: Opportunities and Challenges from Large Language Models to World Models
Unify how AI augments digital twins across modeling, mirroring, intervention, and autonomous management stages →read the paper🌟 WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks (Microsoft)
Provide a large-scale, non-stationary web environment with rubric-based rewards to train and evaluate visual web agents →read the paperScaling Behavior Cloning Improves Causal Reasoning
Show that scaling data and depth in behavior cloning improves causal policies in real-time video game agents →read the paperEvolving Programmatic Skill Networks
Grow a compositional network of executable skills that reflect, refactor, and stabilize over time in open-ended environments →read the paper
Agents, tools, and orchestration
Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning
Route across models and tools using training-free priors and reinforcement learning to exploit heterogeneity in complex reasoning tasks →read the paperMindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning
Interleave multimodal chain-of-thought reasoning with autonomous tool invocation to solve open-ended, real-world problems →read the paperRelayLLM: Efficient Reasoning via Collaborative Decoding
Coordinate small and large models at the token level so lightweight models request help only when needed to cut inference cost →read the paper🌟 Over-Searching in Search-Augmented Large Language Models (Apple)
Diagnose when retrieval harms efficiency and truthfulness and propose metrics and mitigations for search overuse →read the paper →Can We Predict Before Executing Machine Learning Agents?
Replace costly execution with predictive reasoning by internalizing execution priors and using a predict-then-verify loop →read the paperGenCtrl: A Formal Controllability Toolkit for Generative Models
Formalize controllability as a control problem and estimate controllable sets to expose the limits of human influence over generation →read the paper
Agent memory, long-horizon reasoning, and experience compression
SimpleMem: Efficient Lifelong Memory for LLM Agents
Compress interaction histories into high-density semantic memory units, consolidate them asynchronously into abstractions, and retrieve them adaptively to reduce token cost while preserving long-term performance →read the paperMAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents
Represent memories across semantic, temporal, causal, and entity graphs and retrieve them via policy-guided traversal to enable interpretable, query-aligned long-horizon reasoning →read the paperMemory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning
Organize experiences into an event graph with explicit logical relations to support structured navigation over memory instead of shallow similarity search →read the paperDistilling Feedback into Memory-as-a-Tool
Amortize inference-time critique by storing feedback as retrievable guidelines that agents can reuse as a tool to reduce reasoning cost →read the paper
Agent evaluation, verification, and confidence
Agent-as-a-Judge
Evolve evaluation from single-pass model judging to agentic judges with planning, tools, collaboration, and memory to enable verifiable multi-step assessment →read the paperAgentic Rubrics as Contextual Verifiers for SWE Agents
Generate repository-specific rubric checklists via agent interaction to verify code patches without executing tests while remaining grounded and interpretable →read the paperConfidence Estimation for LLMs in Multi-turn Interactions
Measure and improve confidence calibration across turns by formalizing monotonicity and per-turn reliability as context accumulates →read the paperIllusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
Evaluate belief robustness by probing consistency across contextual neighborhoods rather than relying on point-wise self-consistency →read the paper
Reasoning dynamics, structure, and control
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs
Reformulate chain-of-thought generation as an iterative denoising process to enable retrospective correction of reasoning steps →read the paperThe Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Analyze long reasoning traces as structured interaction patterns and guide the synthesis of stable reasoning trajectories →read the paperMechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy
Decompose large counting tasks into reliable subproblems and trace how intermediate counts are represented and aggregated inside the model →read the paperLarge Reasoning Models Are (Not Yet) Multilingual Latent Reasoners
Probe how latent reasoning forms across languages and show that internal reasoning dynamics largely follow an English-centered pathway →read the paperParallel Latent Reasoning for Sequential Recommendation
Scale reasoning width by exploring multiple latent reasoning trajectories in parallel to improve generalization under real-time constraints →read the paper
Training efficiency, data efficiency, and optimization
SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving
Push lightweight supervised fine-tuning to state-of-the-art SWE performance through curated datasets, curriculum design, and verifier-based test-time scaling →read the paperOne Sample to Rule Them All: Extreme Data Efficiency in RL Scaling
Demonstrate that a single, carefully engineered training sample can unlock broad reasoning gains across domains via reinforcement learning →read the paperEntropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
Suppress destructive gradients on confident-but-conflicting tokens by gating updates with entropy to reduce catastrophic forgetting during fine-tuning →read the paperLearnable Multipliers: Freeing the Scale of Language Model Matrix Layers
Replace fixed norm equilibria with learnable scaling factors to adapt weight magnitudes to data and improve downstream performance →read the paper🌟 GDPO: Group reward-Decoupled Normalization Policy Optimization (Nvidia)
Decouple reward normalization in multi-reward reinforcement learning to preserve signal resolution and improve training stability →read the paper
That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.
How did you like it? |


Reply