FOD#113: New AI Personality Tax

plus the best curated papers, what to read, world models, and news from the usual suspects

This Week in Turing Post:

  • Wednesday / AI 101 series: What’s new with Test-Time Compute

  • Friday / AI Literacy – the start of the series

Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Hugging Face, Microsoft, Google, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

The majority of tech updates happened in the model world, so please check “Models to pay attention to” category, as well as the substantial Reading list we provide

The central story of AI today is a paradox that explains nearly every confusing headline and user complaint. On one hand, the adoption of AI is agonizingly slow. On the other, it is astonishingly fast. Both are true, and the tech industry is only now waking up to the implications.

The slow lane is familiar territory. This is the world of behavior change. When a new feature requires us to learn new skills, alter our workflows, and think in different ways, we resist. This is the – let’s call it – "Behavioral-Delta Law" in action: the bigger the change a product asks of us, the more friction it creates. This is, as Arvind Narayanan’s writes, “a property of human behavior, not the technology in question, so we shouldn't expect AI to be any different.” Adoption here is measured in months and years.

But the GPT-5 launch exposed the fast lane, a phenomenon that is anything but normal. The visceral, widespread backlash to losing GPT-4o was not the grumbling of users losing a familiar tool. No one mourned the passing of Windows XP or the old Photoshop interface with such feeling. This was different. People were mourning the loss of a specific, predictable collaborator. It was about The Relationship!

This reveals the other side of AI adoption: the lightning-fast formation of relational attachment. While we are slow to change our habits for an AI, it seems we are incredibly quick to form habits with an AI that seamlessly fits our existing mental models. The "vibe," the conversational quirks, the predictable tone – these weren't bugs or happy accidents. They were the very features users had implicitly integrated into their cognitive workflows. This adoption is measured in days and weeks.

This is the industry's critical blind spot. Companies like OpenAI have been optimizing for capability velocity, racing to build better engines. They treated the GPT-4o-to-GPT-5 transition as a simple software upgrade, assuming "better" was an objective measure of benchmark scores.

They failed to understand that for their most engaged users, they weren't just upgrading a tool – they were replacing their thinking partner. They didn't account for the "personality tax" of forcing users to adapt to a new collaborator. The success of their other feature – the automatic model-switcher – proves the point. (Grok 4 immediately tried to recreate it, as well as making itself as available as possible)

By changing the engine under the hood with zero behavioral delta, it drove massive adoption precisely because it respected the user's established habits.

The future of AI products depends on resolving this paradox. The winners will be those who understand that for a conversational AI, the personality is the user interface.

They will need to manage persona stability with the same rigor they manage server uptime. They will need to recognize that while users are slow to learn, they are fast to trust – and even faster to feel betrayed when that trust is broken by an unannounced change in the "partner" they've come to rely on.

We recommend: 📌 NVIDIA, Databricks, and SuperAnnotate → Building AI Agents You Can Trust

Join NVIDIA, Databricks, and SuperAnnotate to explore how leading teams build trustworthy AI agents through structured evaluation and domain expert feedback. We’ll dive into why evaluating agents is harder than traditional ML, share best practices for developing and scaling LLM-as-a-Judge systems, and show how to implement formalized domain expert feedback loops that improve performance and alignment over time.

Our 3 WOWs and 1 Promise: we discuss Kaggle Game Arena, backlash of GPT-5, ElevenLabs Music, and Genie 2. Watch it here

Our Curated Collections – 6 great books about AI and ML

Follow us on  🎥 YouTube Twitter  Hugging Face 🤗

We are reading/watching

Models to pay attention to:

  • GPT-5
    Researchers from OpenAI released GPT-5, a unified system with fast and “thinking” modes routed automatically. It achieves 94.6% on AIME 2025, 74.9% SWE-bench Verified, 88% Aider Polyglot, 84.2% MMMU, and 46.2% HealthBench Hard, with GPT-5 Pro scoring 88.4% GPQA. Improvements include ~45–80% lower hallucination rates, reduced sycophancy (<6%), safer completions, stronger multimodal reasoning, top-tier coding and writing, and new steerable personalities. Available to all ChatGPT tiers, Pro offers extended reasoning →read their blog

  • Claude Opus 4.1
    Researchers from Anthropic released Claude Opus 4.1, an upgrade over Opus 4 with improved coding, reasoning, and agentic search. It scores 74.5% on SWE-bench Verified (500 tasks) without extended thinking, and shows notable gains in multi-file refactoring, precision bug fixes, and large codebase edits. Extended thinking boosts performance on TAU-bench, GPQA Diamond, MMMLU, MMMU, and AIME, with expanded multi-turn agent trajectories. Claude Opus 4.1 is available via API, Claude Code, Amazon Bedrock, and Google Cloud Vertex AI at the same price as Opus 4 →read their blog

  • Qwen-image technical report
    Researchers from the Qwen team present Qwen-Image, an image generation foundation model excelling in complex text rendering and precise image editing. Using a progressive curriculum, dual-encoding (Qwen2.5-VL semantic + VAE reconstructive), and multi-task training, it achieves SOTA on DPG (88.32), GenEval-RL (0.91), OneIG-ZH (0.548), ChineseWord (58.30%), and LongText-Bench-ZH (0.946). It ranks top in GEdit-CN (7.52) and ImgEdit overall (4.27), and delivers competitive novel view synthesis (PSNR 15.11) and depth estimation, especially in Chinese and long-text rendering →read the paper

  • Glm-4.5: Agentic, reasoning, and coding (ARC) foundation models
    Researchers from Zhipu AI & Tsinghua University developed GLM-4.5, a 355B-parameter MoE LLM (32B active) and its 106B version, GLM-4.5-Air, trained on 23T tokens with hybrid reasoning modes. GLM-4.5 ranks 3rd overall, 2nd in agentic tasks, scoring 70.1% TAU-Bench, 91.0% AIME 24, 64.2% SWE-Bench Verified. It uses multi-stage pre/mid/post-training with expert iteration and RL, excelling in reasoning, coding, multilingual translation, safety (89.9%), and agentic coding (90.6% tool-call success) →read the paper

  • R-Zero: Self-evolving reasoning LLM from zero data
    Researchers from Tencent AI Seattle Lab, Washington University in St. Louis, University of Maryland, and The University of Texas at Dallas present R-Zero, a co-evolutionary Challenger–Solver framework that trains reasoning LLMs without any external data. Using Group Relative Policy Optimization, the Challenger creates tasks at the Solver’s capability edge, and the Solver learns via pseudo-labeled, filtered data. Across Qwen3 and OctoThinker models, R-Zero boosts math scores up to +6.49 and general reasoning up to +3.81, with gains compounding over three iterations →read the paper

  • Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction
    Researchers from Princeton, NVIDIA, Tsinghua University, Stanford University, Meta FAIR, Amazon, Shanghai Jiao Tong University, and Peking University present Goedel-Prover-V2, an open-source Lean theorem prover. Using verifier-guided self-correction, scaffolded data synthesis, and model averaging, the 8B model scores 84.6% pass@32 on MiniF2F (exceeding DeepSeek-Prover-V2-671B), while the 32B model reaches 88.1% (90.4% with self-correction) and solves 86 PutnamBench problems pass@184 – surpassing prior SOTA with far smaller size and compute →read the paper

  • Seed diffusion: A large-scale diffusion language model with high-speed inference
    Researchers from ByteDance Seed and Tsinghua present Seed Diffusion Preview, a discrete-state diffusion code LLM achieving 2,146 tokens/s on H20 GPUs via parallel block-level generation, constrained-order training, and on-policy trajectory optimization. Using a two-stage curriculum (mask-based then edit-based corruption), it matches or surpasses similarly sized autoregressive models on HumanEval, MBPP, BigCodeBench, LiveCodeBench, MBXP, and excels in code editing (54.3% CanItEdit). It outperforms prior diffusion models Mercury and Gemini on the speed–quality Pareto frontier →read the paper

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Reinforcement Learning for LLM and Agent Capability Expansion

  • 🌟 Sotopia-RL: Reward Design for Social Intelligence (by Illinois Urbana-Champaign, California Irvine, Ai2, Carnegie Mellon University, Stanford, MIT)
    train socially intelligent agents with utterance-level, multi-dimensional rewards to capture nuanced social behaviors → read the paper

  • Tool-integrated Reinforcement Learning for Repo Deep Search
    combine tool-use and RL to train LLMs for multi-step code issue localization and repair → read the paper

  • RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
    extend RLVR with hybrid-policy optimization to push beyond base model reasoning limits → read the paper

  •  🌟 Agent Lightning: Train ANY AI Agents with RL (by Microsoft)
    provide a general framework for applying RL to any AI agent architecture with minimal integration overhead → read the paper

  • SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
    enable GUI-based agents to self-learn new software via curriculum-guided exploration and iterative skill acquisition → read the paper

  • CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search
    optimize ANNS algorithms via RL to balance retrieval accuracy and search speed → read the paper

  • 🌟 Exploitation Is All You Need... for Exploration (by Micah Rentschler, Jesse Roberts) demonstrate conditions under which pure exploitation objectives can yield emergent exploration in RL agents → read the paper

  • 🌟 Learning to Reason for Factuality (by Meta, University of Washington)
    design a reward for RL that balances factual precision, detail, and relevance to reduce hallucinations in reasoning models → read the paper

  • Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
    apply RL to open-weight LLMs for complex, stateful multi-turn software engineering tasks, boosting real-world SWE benchmarks → read the paper

  • Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
    improve instruction following without external supervision by using reasoning models’ own internal signals in self-supervised RL → read the paper

Evaluation, Verification, and Benchmarking

  • CompassVerifier
    build a robust, domain-general LLM verifier with a new benchmark for accurate answer checking and RL reward modeling → read the paper

  • Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?
    introduce a multilingual, multimodal benchmark for fine-grained RAG system evaluation → read the paper

  • Are Today's LLMs Ready to Explain Well-Being Concepts?
    evaluate and improve LLMs’ ability to explain well-being concepts for diverse audiences via SFT and DPO → read the paper

  • VeriGUI
    create a long-horizon, verifiable GUI task dataset to evaluate complex computer-use agents → read the paper

Efficiency, Scaling, and Architecture Improvements

  • Trainable Dynamic Mask Sparse Attention
    introduce dynamic, content-aware sparse attention to improve long-context efficiency without losing fidelity → read the paper

  •  🌟 VeOmni (by ByteDance)
    scale omni-modal LLM training with a modular, model-centric distributed framework for efficient parallelism → read the paper

  • LeanK
    prune unimportant KV cache channels to reduce memory and speed up long-context decoding → read the paper

Reasoning Process Understanding and Control

  • Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models
    review methods to reduce overlong reasoning chains in R1-style models while preserving accuracy → read the paper

  •  🌟 Is Chain-of-Thought Reasoning of LLMs a Mirage? (by Arizona State University)
    analyze CoT performance through a data distribution lens to reveal its fragility beyond training domains → read the paper

  •  🌟 Cognitive Loop via In-Situ Optimization (by Microsoft)
    enable self-adaptive, steerable reasoning for scientific discovery through uncertainty-aware cognitive loops → read the paper

  •  🌟 Sculptor (by Tsinghua University)
    equip LLMs with active context management tools to mitigate interference and improve reasoning robustness → read the paper

  • On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
    adjust SFT gradients for better generalization using a simple dynamic scaling method → read the paper

Retrieval and Information Grounding

  • SitEmb-v1.5
    improve long-document retrieval by conditioning short chunk embeddings on broader context → read the paper

  • AttnTrace
    trace LLM outputs back to influential context segments using efficient attention-based methods → read the paper

Applied Multimodal and GUI Agents

  • LaTCoder
    convert webpage designs to code with layout-preserving reasoning strategies → read the paper

  •  🌟 CoAct-1 (by University of Southern California, Salesforce, University of Washington)
    combine GUI actions with direct coding to create more efficient computer-use agents → read the paper

  • ChartCap
    build a large dataset and metric for hallucination-free dense chart captioning → read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How was today's FOD?

Please give us some constructive feedback

Login or Subscribe to participate in polls.

Reply

or to participate.