Turing Post
Posts
FOD#116: Benchmarking Season

FOD#116: Benchmarking Season

plus the best curated list of important models, related research papers, and what to read

Ksenia Se
September 01, 2025

This Week in Turing Post:

Wednesday / AI 101 series: xQuant
Friday / Interview:

Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Hugging Face, Microsoft, Google, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

Now, to the main topic: The Benchmarking Season

The past week in AI was quieter on the model front. The most notable launch was Gemini 2.5 Flash Image (aka Nano Banana) — and credit where it’s due, the Gemini marketing team finally nailed the name. Microsoft AI also introduced its first in-house models MAI: ultra-fast, natural speech, efficient training at scale, strong early benchmarks, and a clear signal of strategic independence from OpenAI. Beyond that, little new appeared at model scale.

What stood out instead was an abundance of benchmarks and evaluation systems.

It’s easy to underestimate these. Benchmarks may look like neutral scoreboards. They are not. Each one encodes a philosophy: what kind of labor matters, what counts as success, what can safely be ignored. A benchmark can elevate a field, as ImageNet did for vision. It can distort it, as SQuAD once did when models learned to guess answers without understanding. And it can collapse under its own weight, as GLUE did once saturated. Designing a good benchmark is as difficult – and as consequential – as designing the model itself.

The week of many rulers

Seven explicit benchmarks appeared in one week, with another half-dozen evaluations that function the same way. Together they illustrate the new directions.

Agentic work: MCP-Bench tests whether agents can use servers and tools across multi-step tasks. ReportBench evaluates research agents on survey writing – not trivia, but the labor of scholarship itself.
Domain specificity: CMPhysBench asks if models know condensed matter physics. AetherCode scores them on competitive programming. MovieCORE pushes into cognitive reasoning about film.
Reasoning across modalities: T2I-ReasonBench looks at reasoning in text-to-image generation. SEAM checks semantic equivalence across language and vision. SpotEdit stresses precision in visual editing.
Safety and adaptivity: Mind the Third Eye! measures privacy awareness in smartphone agents. InMind tests whether models can adapt to individual reasoning styles.
Harder frontiers: UQ shifts the field from memorized test sets to unsolved questions, where there are no easy shortcuts.
Scientific reasoning disentangled: SCIREAS (Demystifying Scientific Problem-Solving in LLMs) separates domain knowledge from reasoning ability, probing whether models can truly “think scientifically” rather than just recall facts.

This is a long way from leaderboards like MMLU or GSM8K. Instead of “who scores best on fixed questions,” the benchmarks now ask: can agents navigate workflows, respect privacy, master specialized fields, and show reasoning across modalities?

On the surface, these look like just benchmarks. In reality, they are competing claims about what counts as competence – and they set the frame for progress. The choice of rulers may prove as influential as the systems themselves. And this season, we’ll see more interesting benchmarks and evaluations emerge.

From our partners: ✨Phoenix.new –> The fastest way to build Elixir apps in-browser

Phoenix.new spins up real Elixir apps right in the browser – no setup, no yak-shaving. The agent has root access, runs real tests, interacts with the UI in a headless browser, and pushes to GitHub. You get live previews, a dev loop that just works, and one-click deploys to Fly. GitHub included. Local optional.

Our 3 WOWs and 1 Promise: Watch it! I share my honest opinion about using Tesla’s full self-driving beta after more than two years →

Reading List / papers from the editorial:

Microsoft AI’s MAI-Voice-1 and MAI-1-preview →read their blog
Gemini Nano Banana →read their blog
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers →read the paper
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks →read the paper
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics →read the paper
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions →read the paper
MovieCORE: COgnitive REasoning in Movies →read the paper
T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation →read the paper
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models →read the paper
SpotEdit: Evaluating Visually-Guided Image Editing Methods →read the paper
Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents →read the paper
InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles →read the paper
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning (SCIREAS)→read the paper
UQ: Assessing Language Models on Unsolved Questions →read the paper

Also reading:

Can Machines Think? by Alejandro Piad Morffis
Introducing the 2025 Intelligent Applications 40 by Madrona.com

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

Curated Collections – 11 Powerful Image Models

Click to open the full list

Models to pay attention to:

🚨 Apple just released FastVLM on Hugging Face - 0.5, 1.5 and 7B real-time VLMs with WebGPU support 🤯
> 85x faster and 3.4x smaller than comparable sized VLMs
> 7.9x faster TTFT for larger models
> designed to output fewer output tokens and reduce encoding time for high
— Vaibhav (VB) Srivastav (@reach_vb)
4:49 PM • Aug 29, 2025

OLMoASR: A series of open speech recognition models
These six fully open ASR models (39M–1.5B parameters) trained on curated datasets up to 680K hours. Benchmarked on 21 unseen test sets, OLMoASR-medium.en achieved 12.8%/11.0% WER (short/long-form), matching Whisper-medium.en. The largest model cut the WER gap with Whisper-large to 0.4% when trained on equal data. Built from a 3M-hour pool filtered to 1M hours, OLMoASR emphasizes reproducibility, rigorous data curation, and transparency →read their blog
gpt-realtime and Realtime API updates for production voice agents
This speech-to-speech model achieving 82.8% accuracy on Big Bench Audio and 30.5% on MultiChallenge—surpassing previous versions. It supports image inputs, SIP phone calling, and remote MCP servers. Function calling accuracy improved to 66.5%. Two new voices, Marin and Cedar, enhance naturalness. Unlike traditional pipelines, it processes audio in one step, reducing latency. The API now offers EU data residency, reusable prompts, and 20% lower pricing than gpt-4o-realtime-preview →read their blog
InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency
This LLM-based multimodal model family featuring Cascade Reinforcement Learning (offline + online RL) to enhance reasoning, achieving +16.0% gain on tasks like MMMU and MathVista. The Visual Resolution Router (ViR) dynamically adjusts visual token resolution, and Decoupled Vision-Language Deployment (DvD) balances GPU load. InternVL3.5-241B-A28B achieves 4.05× faster inference and state-of-the-art performance across general multimodal and agentic tasks among open-source models →read the paper
Hermes 4 technical report
It’s a hybrid reasoning LLM family built using 5M post-training samples (19B tokens), including 3.5M reasoning-heavy examples with sequences up to 16K tokens. They used DataForge for structured synthetic data generation and Atropos for rejection sampling across task-specific RL environments. Models (14B/70B/405B) achieved 81.9% on AIME’24 and 61.3% on LiveCodeBench, outperforming DeepSeek-R1 while reducing overlong outputs by 78%. All weights and evaluations are public →read the paper
USO: Unified style and subject-driven generation via disentangled and reward learning
This one uses a triplet dataset (content, style, stylized image) and trains via style-alignment and content-style disentanglement objectives. A Style Reward Learning (SRL) module further enhances generation quality. USO outperforms open-source models on USO-Bench, a benchmark jointly evaluating style similarity and subject fidelity, achieving state-of-the-art results in both style consistency and subject preservation →read the paper
rStar2-Agent: Agentic reasoning technical report
This is a 14B parameter math reasoning model trained with agentic RL. It uses GRPO-RoC, an RL strategy that handles noisy code environments, and is trained efficiently using only 64 MI300X GPUs. In just 510 RL steps, it achieves 80.6% on AIME24 and 69.8% on AIME25, outperforming DeepSeek-R1 (671B). The model also generalizes to alignment, scientific reasoning, and agentic tool-use tasks →read the paper
VibeVoice technical report
This is a long-form speech synthesis model using next-token diffusion for continuous data generation. A novel tokenizer compresses speech data by 80× compared to Encodec without quality loss. VibeVoice can generate up to 90 minutes of speech involving four speakers in a 64K token window, delivering high-fidelity, multi-speaker dialogue synthesis that surpasses both open-source and proprietary systems in maintaining conversational coherence and naturalness →read the paper

Interesting surveys

Last week we discussed AGS (Artificial General Science) and a wave of papers related to that topic. Here’s another one worth paying attention to: A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Efficiency and Acceleration

🌟 Diffusion Language Models Know the Answer Before Decoding
accelerate diffusion language model inference by detecting early convergence and committing tokens before full refinement → read the paper
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
redesign memory-layer architectures to rival MoE efficiency with better long-context performance and lower memory access → read the paper
🌟 Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
optimize large-scale LLM serving with HeteroScale, a coordinated autoscaling framework that balances prefill and decode stages across heterogeneous GPUs, improving utilization by 26.6% and saving hundreds of thousands of GPU-hours daily → read the paper

Reasoning Supervision and Control

🌟 StepWiser: Stepwise Generative Judges for Wiser Reasoning
train generative reward models that “meta-reason” about intermediate steps, improving judgment accuracy and inference search → read the paper
🌟 ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models
implement discrete reasoning modes (high, medium, low) to balance computation cost and performance → read the paper
Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation?
examine the faithfulness of chain-of-thought reasoning in soft-reasoning tasks, showing influence and reliability can diverge → read the paper
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency
model sequence generation as tree search to reduce RL training cost while preserving exploration → read the paper

Tool Use and Augmented Learning

🌟 Provable Benefits of In-Tool Learning for Large Language Models
prove that tool-augmented models scale factual recall beyond parameter limits, outperforming in-weight memorization → read the paper
🌟 Understanding Tool-Integrated Reasoning
provide the first theoretical proof of tool-augmented reasoning’s benefits and propose ASPO for better tool usage → read the paper

Evaluation and Judging

🌟 Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
critically assess the reliability, validity, and assumptions behind using LLMs as evaluators in NLP → read the paper

Interpretability and Cognitive Analysis

🌟 Unraveling the cognitive patterns of Large Language Models through module communities
analyze emergent module communities in LLMs via network methods inspired by biology, revealing distributed skill patterns → read the paper
Beyond Transcription: Mechanistic Interpretability in ASR
apply interpretability tools like logit lens and activation patching to speech recognition, uncovering hidden acoustic-semantic dynamics → read the paper

Code, Video, and Multimodal Systems

Efficient Code Embeddings from Code Generation Models
build compact autoregressive code embedding models for retrieval, Q&A, and cross-language similarity → read the paper
Autoregressive Universal Video Segmentation Model
unify prompted and unprompted video segmentation into one autoregressive architecture for streaming video → read the paper
🌟 Mixture of Contexts for Long Video Generation
introduce sparse attention routing for diffusion transformers to preserve consistency in long video synthesis → read the paper
Self-Rewarding Vision-Language Model via Reasoning Decomposition
strengthen visual reasoning in VLMs by decomposing perception and reasoning, rewarding self-contained perceptions → read the paper
🌟Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
stabilize text-to-image reinforcement learning with pairwise preference rewards and a unified benchmark → read the paper
OmniHuman-1.5: Instilling an active mind in avatars via cognitive simulation
generate semantically expressive avatar animations by using LLM-structured conditions and a Multimodal DiT with Pseudo Last Frame for lip-sync, motion naturalness, and semantic alignment across single/multi-person and non-human scenes →read the paper

Scientific Discovery

Spacer: Towards Engineered Scientific Inspiration
generate creative, grounded scientific hypotheses by recombining keyword graphs and refining them into concepts → read the paper

Agent Training

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent
combine a generalist planner and specialist executor with decoupled RL for scientific computing GUIs → read the paper
AWorld: Orchestrating the Training Recipe for Agentic AI
scale reinforcement learning for agentic AI with distributed interaction environments, enabling faster experience generation → read the paper
UItron: Foundational GUI agent with advanced perception and planning
train a large-scale mobile/PC GUI agent with SFT + curriculum RL over 1M+ steps to improve perception, grounding, and task planning for Chinese apps →read the paper

Privacy, Safety, and Security of Agentic Systems.

🌟 Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills
expose vulnerabilities in Model Context Protocol (MCP) agents, showing how benign tasks can chain into adversarial attack sequences that bypass service isolation and compromise security → read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How was today's FOD?

Please give us some constructive feedback

Reply

or to participate.