FOD#116: Benchmarking Season

plus the best curated list of important models, related research papers, and what to read

This Week in Turing Post:

  • Wednesday / AI 101 series: xQuant

  • Friday / Interview:

Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Hugging Face, Microsoft, Google, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

Now, to the main topic: The Benchmarking Season

The past week in AI was quieter on the model front. The most notable launch was Gemini 2.5 Flash Image (aka Nano Banana) — and credit where it’s due, the Gemini marketing team finally nailed the name. Microsoft AI also introduced its first in-house models MAI: ultra-fast, natural speech, efficient training at scale, strong early benchmarks, and a clear signal of strategic independence from OpenAI. Beyond that, little new appeared at model scale.

What stood out instead was an abundance of benchmarks and evaluation systems.

It’s easy to underestimate these. Benchmarks may look like neutral scoreboards. They are not. Each one encodes a philosophy: what kind of labor matters, what counts as success, what can safely be ignored. A benchmark can elevate a field, as ImageNet did for vision. It can distort it, as SQuAD once did when models learned to guess answers without understanding. And it can collapse under its own weight, as GLUE did once saturated. Designing a good benchmark is as difficult – and as consequential – as designing the model itself.

The week of many rulers

Seven explicit benchmarks appeared in one week, with another half-dozen evaluations that function the same way. Together they illustrate the new directions.

  • Agentic work: MCP-Bench tests whether agents can use servers and tools across multi-step tasks. ReportBench evaluates research agents on survey writing – not trivia, but the labor of scholarship itself.

  • Domain specificity: CMPhysBench asks if models know condensed matter physics. AetherCode scores them on competitive programming. MovieCORE pushes into cognitive reasoning about film.

  • Reasoning across modalities: T2I-ReasonBench looks at reasoning in text-to-image generation. SEAM checks semantic equivalence across language and vision. SpotEdit stresses precision in visual editing.

  • Safety and adaptivity: Mind the Third Eye! measures privacy awareness in smartphone agents. InMind tests whether models can adapt to individual reasoning styles.

  • Harder frontiers: UQ shifts the field from memorized test sets to unsolved questions, where there are no easy shortcuts.

  • Scientific reasoning disentangled: SCIREAS (Demystifying Scientific Problem-Solving in LLMs) separates domain knowledge from reasoning ability, probing whether models can truly “think scientifically” rather than just recall facts.

This is a long way from leaderboards like MMLU or GSM8K. Instead of “who scores best on fixed questions,” the benchmarks now ask: can agents navigate workflows, respect privacy, master specialized fields, and show reasoning across modalities?

On the surface, these look like just benchmarks. In reality, they are competing claims about what counts as competence – and they set the frame for progress. The choice of rulers may prove as influential as the systems themselves. And this season, we’ll see more interesting benchmarks and evaluations emerge.

From our partners: ✨Phoenix.new –> The fastest way to build Elixir apps in-browser

Phoenix.new spins up real Elixir apps right in the browser – no setup, no yak-shaving. The agent has root access, runs real tests, interacts with the UI in a headless browser, and pushes to GitHub. You get live previews, a dev loop that just works, and one-click deploys to Fly. GitHub included. Local optional.

Our 3 WOWs and 1 Promise: Watch it! I share my honest opinion about using Tesla’s full self-driving beta after more than two years

Reading List / papers from the editorial:

  • Microsoft AI’s MAI-Voice-1 and MAI-1-previewread their blog

  • Gemini Nano Bananaread their blog

  • MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers →read the paper

  • ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks →read the paper

  • CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics →read the paper

  • AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions →read the paper

  • MovieCORE: COgnitive REasoning in Movies →read the paper

  • T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation →read the paper

  • SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models →read the paper

  • SpotEdit: Evaluating Visually-Guided Image Editing Methods →read the paper

  • Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents →read the paper

  • InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles →read the paper

  • Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning (SCIREAS)→read the paper

  • UQ: Assessing Language Models on Unsolved Questions →read the paper

Also reading:

Follow us on  🎥 YouTube Twitter  Hugging Face 🤗

Curated Collections – 11 Powerful Image Models

Models to pay attention to:

  • OLMoASR: A series of open speech recognition models
    These six fully open ASR models (39M–1.5B parameters) trained on curated datasets up to 680K hours. Benchmarked on 21 unseen test sets, OLMoASR-medium.en achieved 12.8%/11.0% WER (short/long-form), matching Whisper-medium.en. The largest model cut the WER gap with Whisper-large to 0.4% when trained on equal data. Built from a 3M-hour pool filtered to 1M hours, OLMoASR emphasizes reproducibility, rigorous data curation, and transparency →read their blog

  • gpt-realtime and Realtime API updates for production voice agents
    This speech-to-speech model achieving 82.8% accuracy on Big Bench Audio and 30.5% on MultiChallenge—surpassing previous versions. It supports image inputs, SIP phone calling, and remote MCP servers. Function calling accuracy improved to 66.5%. Two new voices, Marin and Cedar, enhance naturalness. Unlike traditional pipelines, it processes audio in one step, reducing latency. The API now offers EU data residency, reusable prompts, and 20% lower pricing than gpt-4o-realtime-preview →read their blog

  • InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency
    This LLM-based multimodal model family featuring Cascade Reinforcement Learning (offline + online RL) to enhance reasoning, achieving +16.0% gain on tasks like MMMU and MathVista. The Visual Resolution Router (ViR) dynamically adjusts visual token resolution, and Decoupled Vision-Language Deployment (DvD) balances GPU load. InternVL3.5-241B-A28B achieves 4.05× faster inference and state-of-the-art performance across general multimodal and agentic tasks among open-source models →read the paper

  • Hermes 4 technical report
    It’s a hybrid reasoning LLM family built using 5M post-training samples (19B tokens), including 3.5M reasoning-heavy examples with sequences up to 16K tokens. They used DataForge for structured synthetic data generation and Atropos for rejection sampling across task-specific RL environments. Models (14B/70B/405B) achieved 81.9% on AIME’24 and 61.3% on LiveCodeBench, outperforming DeepSeek-R1 while reducing overlong outputs by 78%. All weights and evaluations are public →read the paper

  • USO: Unified style and subject-driven generation via disentangled and reward learning
    This one uses a triplet dataset (content, style, stylized image) and trains via style-alignment and content-style disentanglement objectives. A Style Reward Learning (SRL) module further enhances generation quality. USO outperforms open-source models on USO-Bench, a benchmark jointly evaluating style similarity and subject fidelity, achieving state-of-the-art results in both style consistency and subject preservation →read the paper

  • rStar2-Agent: Agentic reasoning technical report
    This is a 14B parameter math reasoning model trained with agentic RL. It uses GRPO-RoC, an RL strategy that handles noisy code environments, and is trained efficiently using only 64 MI300X GPUs. In just 510 RL steps, it achieves 80.6% on AIME24 and 69.8% on AIME25, outperforming DeepSeek-R1 (671B). The model also generalizes to alignment, scientific reasoning, and agentic tool-use tasks →read the paper

  • VibeVoice technical report
    This is a long-form speech synthesis model using next-token diffusion for continuous data generation. A novel tokenizer compresses speech data by 80× compared to Encodec without quality loss. VibeVoice can generate up to 90 minutes of speech involving four speakers in a 64K token window, delivering high-fidelity, multi-speaker dialogue synthesis that surpasses both open-source and proprietary systems in maintaining conversational coherence and naturalness →read the paper

Interesting surveys

Last week we discussed AGS (Artificial General Science) and a wave of papers related to that topic. Here’s another one worth paying attention to: A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Efficiency and Acceleration

  • 🌟 Diffusion Language Models Know the Answer Before Decoding
    accelerate diffusion language model inference by detecting early convergence and committing tokens before full refinement → read the paper

  • UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
    redesign memory-layer architectures to rival MoE efficiency with better long-context performance and lower memory access → read the paper

  • 🌟 Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
    optimize large-scale LLM serving with HeteroScale, a coordinated autoscaling framework that balances prefill and decode stages across heterogeneous GPUs, improving utilization by 26.6% and saving hundreds of thousands of GPU-hours daily → read the paper

Reasoning Supervision and Control

  • 🌟 StepWiser: Stepwise Generative Judges for Wiser Reasoning
    train generative reward models that “meta-reason” about intermediate steps, improving judgment accuracy and inference search → read the paper

  • 🌟 ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models
    implement discrete reasoning modes (high, medium, low) to balance computation cost and performance → read the paper

  • Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation?
    examine the faithfulness of chain-of-thought reasoning in soft-reasoning tasks, showing influence and reliability can diverge → read the paper

  • TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency
    model sequence generation as tree search to reduce RL training cost while preserving exploration → read the paper

Tool Use and Augmented Learning

  • 🌟 Provable Benefits of In-Tool Learning for Large Language Models
    prove that tool-augmented models scale factual recall beyond parameter limits, outperforming in-weight memorization → read the paper

  • 🌟 Understanding Tool-Integrated Reasoning
    provide the first theoretical proof of tool-augmented reasoning’s benefits and propose ASPO for better tool usage → read the paper

Evaluation and Judging

  • 🌟 Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
    critically assess the reliability, validity, and assumptions behind using LLMs as evaluators in NLP → read the paper

Interpretability and Cognitive Analysis

  • 🌟 Unraveling the cognitive patterns of Large Language Models through module communities
    analyze emergent module communities in LLMs via network methods inspired by biology, revealing distributed skill patterns → read the paper

  • Beyond Transcription: Mechanistic Interpretability in ASR
    apply interpretability tools like logit lens and activation patching to speech recognition, uncovering hidden acoustic-semantic dynamics → read the paper

Code, Video, and Multimodal Systems

  • Efficient Code Embeddings from Code Generation Models
    build compact autoregressive code embedding models for retrieval, Q&A, and cross-language similarity → read the paper

  • Autoregressive Universal Video Segmentation Model
    unify prompted and unprompted video segmentation into one autoregressive architecture for streaming video → read the paper

  • 🌟 Mixture of Contexts for Long Video Generation
    introduce sparse attention routing for diffusion transformers to preserve consistency in long video synthesis → read the paper

  • Self-Rewarding Vision-Language Model via Reasoning Decomposition
    strengthen visual reasoning in VLMs by decomposing perception and reasoning, rewarding self-contained perceptions → read the paper

  • 🌟Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
    stabilize text-to-image reinforcement learning with pairwise preference rewards and a unified benchmark → read the paper

  • OmniHuman-1.5: Instilling an active mind in avatars via cognitive simulation
    generate semantically expressive avatar animations by using LLM-structured conditions and a Multimodal DiT with Pseudo Last Frame for lip-sync, motion naturalness, and semantic alignment across single/multi-person and non-human scenes →read the paper

Scientific Discovery

  • Spacer: Towards Engineered Scientific Inspiration
    generate creative, grounded scientific hypotheses by recombining keyword graphs and refining them into concepts → read the paper

Agent Training

  • CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent
    combine a generalist planner and specialist executor with decoupled RL for scientific computing GUIs → read the paper

  • AWorld: Orchestrating the Training Recipe for Agentic AI
    scale reinforcement learning for agentic AI with distributed interaction environments, enabling faster experience generation → read the paper

  • UItron: Foundational GUI agent with advanced perception and planning
    train a large-scale mobile/PC GUI agent with SFT + curriculum RL over 1M+ steps to improve perception, grounding, and task planning for Chinese apps →read the paper

Privacy, Safety, and Security of Agentic Systems.

  • 🌟 Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills
    expose vulnerabilities in Model Context Protocol (MCP) agents, showing how benign tasks can chain into adversarial attack sequences that bypass service isolation and compromise security → read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How was today's FOD?

Please give us some constructive feedback

Login or Subscribe to participate in polls.

Reply

or to participate.