Turing Post
Posts
FOD#115: What is Artificial General...Science?

FOD#115: What is Artificial General...Science?

plus the best curated list of important models, related research papers, and what to read

Ksenia Se
August 25, 2025

This Week in Turing Post:

Wednesday / AI 101 series: Critiques of World Models
Friday / AI Literacy / 2nd episode – “Teaching Machines to Think (and why it’s not actually thinking)”

Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Hugging Face, Microsoft, Google, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

This week, both the “Models to Pay Attention To” and “Research Papers” categories are especially interesting – you can find them at the end of this email

Now, to the main topic: Suddenly, in just one week, we got closer to what may soon be called Artificial General Science. In other terms: AI-powered autonomous scientific discovery.

For centuries, science was our tool to understand the world. Now we’re building tools that can do science themselves. It’s a bit like inventing the telescope and finding out it’s writing astronomy papers while you sleep.

The first major inflection point was AlphaFold. In 2020 DeepMind stunned the life sciences community by predicting protein structures with near-experimental accuracy. For decades this problem had resisted solution. AlphaFold showed that AI could generate answers to questions that had long been out of reach, reshaping the trajectory of biology.

The next milestone came in 2022 with AlphaTensor, where an RL-based agent rediscovered faster algorithms for matrix multiplication, improving on human-devised methods for the first time in half a century. This was algorithmic invention – a sign that AI could push into mathematics, not only applied domains. Phenomenal.

By 2024, Sakana’s AI Scientist stitched these lessons together into a prototype of an autonomous research agent. It combined literature review, experiment design, execution, and paper drafting. The results were uneven, but they have the right ambition: move from models that excel at single tasks to agents that navigate the entire scientific workflow.

Then, in early 2025, Google unveiled the AI co-scientist, built on Gemini 2.0. This multi-agent system mirrored the reasoning steps of the scientific method — generation, reflection, ranking, evolution, meta-review. It wasn’t limited to text generation: it produced hypotheses that were validated in laboratories. Repurposed drugs for leukemia were confirmed in vitro, epigenetic targets for liver fibrosis tested in human organoids, and a mechanistic explanation of antimicrobial resistance aligned with fresh experimental findings. AI was beginning to propose ideas that human scientists could take straight into the lab.

And then came August 2025, the last two weeks of it that may well be remembered as the ignition point of Artificial General Science.

Image Credit: Virtuous Machines: Towards AGS

A few heavyweight papers I’d like you to know about (find the links in the reading list below): Virtuous Machines (160 pages; Wehr et al.) laid out a philosophical roadmap for AGS. It is also “the first demonstration of an autonomously conducted, end-to-end online experiment with human participants”. The aiXiv project (60 pages; Zhang et al,) introduced a publishing ecosystem where AI scientists can submit, review, and refine research. From AI for Science to Agentic Science (74 pages; Wei et al.) gave the field a structured vocabulary and positioned agentic systems as the new paradigm.

We also have Intern-S1 arrived from Shanghai AI Laboratory – a 241-billion-parameter model trained on trillions of scientific tokens, outperforming rivals on physics and chemistry tasks. And Microsoft Research contributed MindJourney, a test-time reasoning framework where models learn by simulating 3D exploration. MindJourney showed an 8% leap on spatial aptitude benchmarks and offered a path for scientific agents to “move” through environments they cannot directly sense – a capability crucial for robotics, geoscience, and lab automation.

That convergence is astonishing. Within a week, we saw the capability (Intern-S1, MindJourney), the infrastructure (aiXiv), and the intellectual frame (Agentic Science, Virtuous Machines) appear together. Independent groups, different traditions, all arriving at the same horizon: science conducted by AI, autonomously.

This science acceleration is breathtaking. And this past week will be remembered as more than a cluster of publications. It was the moment discovery itself began to scale at machine tempo — and the scientific method gained a sleepless new collaborator.

From our partners: Clarifai’s Local Runners → A Secure API Bridge for Your Models

Every model iteration usually means another redeployment – and lost time.
With Clarifai’s Local Runners, you can share access to models running on your laptop securely with anyone on the Internet.
Think of it as ngrok for AI workloads: run models locally, test changes instantly, debug faster, and share securely without standing up new infrastructure or transferring gigabytes of weights back and forth.
Keep building locally. Share securely. Scale effortlessly. Start for free.

Our 3 WOWs and 1 Promise: Just to confirm our own consciousness, we must learn to read.

Reading List / papers from the editorial:

Virtuous Machines: Towards Artificial General Science by Gabrielle Wehr et al. (2025) →read the paper
aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists by Pengsong Zhang et al. (2025) →read the paper
From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery by Jiaqi Wei et al. (2025) →read the paper

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

Models to pay attention to:

Intern-s1: A scientific multimodal foundation model by Shanghai AI Lab (open-source)
This is 241B-parameter multimodal Mixture-of-Experts model with 28B active parameters, optimized for scientific reasoning. Trained on 5T tokens (2.5T scientific), it supports text, images, molecular structures, and time-series data. Intern-S1 introduces a dynamic tokenizer and Mixture-of-Rewards RL framework, achieving 75.0 on MatBench, 83.4 on ChemBench, and 65.7 on MSEarthMCQ—outperforming both open- and closed-source models →read the paper
Nemotron Nano 2 by NVIDIA (open-source)
Researchers from NVIDIA developed Nemotron-Nano-9B-v2, a 9B hybrid Mamba-Transformer LLM optimized for reasoning. It achieves 3–6× higher throughput than Qwen3-8B and matches or exceeds its accuracy across benchmarks like MATH (80.5), BFCLv3 (66.9), RULER-128k (82.2), and AIME24 (30.0). Built via FP8 pretraining on 20T tokens and 128k context support, it’s compressed from a 12B model using Minitron pruning and distillation. It runs on a single 22GB A10G GPU →read the paper
Command A Reasoning: Enterprise-grade control for AI agents by Cohere
This advanced enterprise-grade LLM is optimized for agentic workflows and deep reasoning. It supports 128k–256k context length, runs on a single H100/A100 GPU, and includes token budgeting to balance cost and performance. It outperforms gpt-oss-120b, DeepSeek-R1 0528, and Magistral Medium across BFCL-v3, Tau-bench, and multilingual tasks. Powering North, it excels in long-form research, safety evals, and adaptive latency-performance tradeoffs →read their blog
DeepSeek V3.1 release (open-source)
This model is featuring hybrid inference modes (“Think” and “Non-Think”) in a single model with 128k context support. The “Think” mode boosts tool use and multi-step reasoning, outperforming DeepSeek-R1-0528 in complex tasks. Pretrained on 840B tokens for extended context, the model shows gains on SWE and Terminal-Bench. It supports Anthropic API format, strict function calling (beta), and improved thinking efficiency, marking a major step toward agent-oriented LLM performance →read the release notes
Ovis2.5 by Alibaba (open-source)
This multimodal LLM integrats a native-resolution vision transformer (NaViT) for fine-grained visual perception and a “thinking mode” for reflective reasoning. The 9B model achieves 78.3 on OpenCompass, outperforming all open-source models <40B. It ranks highest on STEM, chart, OCR, grounding, and video tasks. Training involves a five-phase curriculum and GRPO-based reinforcement, with hybrid parallelism delivering 3–4× speedup. A compact 2B version scores 73.9.→read the paper
Dinov3 from Meta AI
This is a 7B parameter self-supervised vision foundation model trained on 1.7B curated web images. DINOv3 uses a new "Gram anchoring" method to stabilize dense features over long training, enabling state-of-the-art performance in segmentation (ADE20k mIoU 63.0), detection (COCO mAP 66.1), and depth estimation (NYUv2 RMSE 0.309). It supports resolutions up to 4096² and distills into ViT and ConvNeXt models, achieving strong results on 3D tasks, remote sensing, and OCR with no fine-tuning →read the paper
Matrix-game 2.0 by Skywork AI (open-source)
This real-time, auto-regressive diffusion-based interactive world model achieving 25 FPS video generation on a single H100 GPU. Trained on 1,200 hours of GTA5 and Unreal Engine data with action annotations, it uses a 1.8B parameter DiT with frame-level keyboard/mouse control. Distilled via Self-Forcing into a few-step model, it outperforms Oasis and YUME in visual quality, temporal coherence, and controllability. Ablations confirm cache tuning and denoising steps critical for long-sequence fidelity →read the paper

Interesting surveys

Image Credit: The original paper

Image Credit: A Survey on LLMs Benchmarks

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Agentic Systems and GUI Automation

🌟 Mobile-Agent-v3 (by Alibaba)
build foundational GUI agents with large-scale environments, modular skills, and scalable RL for cross-platform tasks → read the paper
🌟 Prompt Orchestration Markup Language (by Microsoft)
structure complex multi-modal prompts with a markup system that supports styling, integration, and collaboration → read the paper

Reinforcement Learning for Alignment and Agent Training

AgentFly
adapt LLM agents with memory-based online RL instead of gradient updates, enabling continual learning without fine-tuning → read the paper
End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
train medical RAG agents with reinforcement learning to improve retrieval, reasoning, and traceability for diagnosis → read the paper
Reinforcement Learning with Rubric Anchors
extend verifiable RL with rubric-based rewards to handle open-ended, subjective tasks with stylistic control → read the paper
On-Policy RL Meets Off-Policy Experts
harmonize supervised fine-tuning and RL with dynamic weighting to balance imitation and exploration → read the paper
🌟 SSRL: Self-Search Reinforcement Learning (by Tsinghua University)
train LLMs to use their own internal knowledge for search tasks, reducing reliance on external engines → read the paper
🌟 Atom-Searcher (by Ant Group)
guide agentic deep research with fine-grained “atomic thought” rewards for interpretable reasoning steps → read the paper

Reasoning, Interpretability, and Control

🌟 MindJourney (by Microsoft)
improve spatial reasoning by simulating egocentric 3D viewpoints for VLMs at test time → read the paper
🌟Deep Think with Confidence (by Meta AI, UCSD)
filter low-quality reasoning traces using internal confidence signals to improve efficiency and accuracy → read the paper
🌟 Controlling Multimodal LLMs via Reward-guided Decoding (by Mila, Universite de Montreal, McGill University, Meta, CIFAR)
steer MLLMs during inference by weighting precision–recall trade-offs with reward models → read the paper
X-Node: Self-Explanation is All We Need
integrate per-node self-explanations into GNN predictions for interpretability in medical imaging → read the paper

Privacy, Safety, and Unlearning

Scalable Private Partition Selection via Adaptive Weighting
scale differentially private partition selection to massive datasets for applications like private vocab extraction → read the paper
🌟 CRISP: Persistent Concept Unlearning via Sparse Autoencoders (by Technion)
remove harmful knowledge permanently by suppressing feature activations across layers → read the paper
🌟 Unlearning Comparator (by Sungkyunkwan University)
visualize and compare unlearning methods across accuracy, efficiency, and privacy dimensions → read the paper

Efficiency, Scaling, and Inference Optimization

TPLA: Tensor Parallel Latent Attention
accelerate tensor-parallel inference with compressed KV caching while preserving accuracy → read the paper
🌟 XQuant (by UC Berkeley)
rematerialize keys and values on-the-fly with quantized activations, cutting LLM memory needs up to 12× → read the paper
🌟 BeyondWeb (by Datalogy AI)
generate large-scale synthetic pretraining data to overcome the data wall in trillion-scale LLM training → read the paper

Retrieval and Long-Context Reasoning

🌟 Retrieval-augmented reasoning with lean language models (by The Alan Turing Institute, Oxford, Cambridge, Imperial College London)
combine reasoning and retrieval in small, domain-specific models for secure and efficient deployment → read the paper
ComoRAG
introduce memory-organized retrieval cycles to support stateful long-narrative reasoning → read the paper

Evaluation of Human-Centered Concepts

Leveraging Large Language Models for Predictive Analysis of Human Misery
predict perceived misery scores from natural language with gamified evaluation setups → read the paper
Beyond Human Judgment
evaluate moral value understanding in LLMs using a Bayesian framework that accounts for human disagreement → read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How was today's FOD?

Please give us some constructive feedback

Reply

or to participate.