Turing Post
Posts
FOD#113: New AI Personality Tax

FOD#113: New AI Personality Tax

plus the best curated papers, what to read, world models, and news from the usual suspects

Ksenia Se
August 11, 2025

This Week in Turing Post:

Wednesday / AI 101 series: What’s new with Test-Time Compute
Friday / AI Literacy – the start of the series

Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Hugging Face, Microsoft, Google, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

The majority of tech updates happened in the model world, so please check “Models to pay attention to” category, as well as the substantial Reading list we provide

The central story of AI today is a paradox that explains nearly every confusing headline and user complaint. On one hand, the adoption of AI is agonizingly slow. On the other, it is astonishingly fast. Both are true, and the tech industry is only now waking up to the implications.

The slow lane is familiar territory. This is the world of behavior change. When a new feature requires us to learn new skills, alter our workflows, and think in different ways, we resist. This is the – let’s call it – "Behavioral-Delta Law" in action: the bigger the change a product asks of us, the more friction it creates. This is, as Arvind Narayanan’s writes, “a property of human behavior, not the technology in question, so we shouldn't expect AI to be any different.” Adoption here is measured in months and years.

But the GPT-5 launch exposed the fast lane, a phenomenon that is anything but normal. The visceral, widespread backlash to losing GPT-4o was not the grumbling of users losing a familiar tool. No one mourned the passing of Windows XP or the old Photoshop interface with such feeling. This was different. People were mourning the loss of a specific, predictable collaborator. It was about The Relationship!

This reveals the other side of AI adoption: the lightning-fast formation of relational attachment. While we are slow to change our habits for an AI, it seems we are incredibly quick to form habits with an AI that seamlessly fits our existing mental models. The "vibe," the conversational quirks, the predictable tone – these weren't bugs or happy accidents. They were the very features users had implicitly integrated into their cognitive workflows. This adoption is measured in days and weeks.

This is the industry's critical blind spot. Companies like OpenAI have been optimizing for capability velocity, racing to build better engines. They treated the GPT-4o-to-GPT-5 transition as a simple software upgrade, assuming "better" was an objective measure of benchmark scores.

They failed to understand that for their most engaged users, they weren't just upgrading a tool – they were replacing their thinking partner. They didn't account for the "personality tax" of forcing users to adapt to a new collaborator. The success of their other feature – the automatic model-switcher – proves the point. (Grok 4 immediately tried to recreate it, as well as making itself as available as possible)

By changing the engine under the hood with zero behavioral delta, it drove massive adoption precisely because it respected the user's established habits.

The future of AI products depends on resolving this paradox. The winners will be those who understand that for a conversational AI, the personality is the user interface.

They will need to manage persona stability with the same rigor they manage server uptime. They will need to recognize that while users are slow to learn, they are fast to trust – and even faster to feel betrayed when that trust is broken by an unannounced change in the "partner" they've come to rely on.

Join NVIDIA, Databricks, and SuperAnnotate to explore how leading teams build trustworthy AI agents through structured evaluation and domain expert feedback. We’ll dive into why evaluating agents is harder than traditional ML, share best practices for developing and scaling LLM-as-a-Judge systems, and show how to implement formalized domain expert feedback loops that improve performance and alignment over time.

Our 3 WOWs and 1 Promise: we discuss Kaggle Game Arena, backlash of GPT-5, ElevenLabs Music, and Genie 2. Watch it here →

Our Curated Collections – 6 great books about AI and ML

Click to open the whole list

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

We are reading/watching

In this US-China AI race, our internet giants are rapidly becoming marginalized (translated by ChinAI)

The issue with GPT-5 in a nutshell is that unless you pay for model switching & know to use GPT-5 Thinking or Pro, when you ask “GPT-5” you sometimes get the best available AI & sometimes get one of the worst AIs available and it might even switch within a single conversation.
— Ethan Mollick (@emollick)
3:58 PM • Aug 9, 2025

GPT-5 and the arc of progress by Nathan Lambert
GPT-5: It Just Does Stuff by Ethan Molick
GPT-5 Hands-On: Welcome to the Stone Age by Latent Space
AI 101: Everything You Need to Know about GPT OSS by us
Apple CEO Tells Staff AI Is ‘Ours to Grab’ in Hourlong Pep Talk by Bloomberg
OpenAI's o3 Crushes Grok 4 In Final, Wins Kaggle's AI Chess Exhibition Tournament by Chess.com
The Current AI Conference Model Is Unsustainable! (An interesting take on how AI conferences impose significant environmental costs and contribute to mental health challenges)
A BEST CASE SCENARIO FOR AI? by David Sacks
From GPT-2 to gpt-oss: Analyzing the Architectural Advances by Sebastian Raschka
ChatGPT will apologize for anything by AI Weirdness

Models to pay attention to:

GPT-5
Researchers from OpenAI released GPT-5, a unified system with fast and “thinking” modes routed automatically. It achieves 94.6% on AIME 2025, 74.9% SWE-bench Verified, 88% Aider Polyglot, 84.2% MMMU, and 46.2% HealthBench Hard, with GPT-5 Pro scoring 88.4% GPQA. Improvements include ~45–80% lower hallucination rates, reduced sycophancy (<6%), safer completions, stronger multimodal reasoning, top-tier coding and writing, and new steerable personalities. Available to all ChatGPT tiers, Pro offers extended reasoning →read their blog
Claude Opus 4.1
Researchers from Anthropic released Claude Opus 4.1, an upgrade over Opus 4 with improved coding, reasoning, and agentic search. It scores 74.5% on SWE-bench Verified (500 tasks) without extended thinking, and shows notable gains in multi-file refactoring, precision bug fixes, and large codebase edits. Extended thinking boosts performance on TAU-bench, GPQA Diamond, MMMLU, MMMU, and AIME, with expanded multi-turn agent trajectories. Claude Opus 4.1 is available via API, Claude Code, Amazon Bedrock, and Google Cloud Vertex AI at the same price as Opus 4 →read their blog

Claude can now reference past chats, so you can easily pick up from where you left off.
— Claude (@claudeai)
7:04 PM • Aug 11, 2025

Qwen-image technical report
Researchers from the Qwen team present Qwen-Image, an image generation foundation model excelling in complex text rendering and precise image editing. Using a progressive curriculum, dual-encoding (Qwen2.5-VL semantic + VAE reconstructive), and multi-task training, it achieves SOTA on DPG (88.32), GenEval-RL (0.91), OneIG-ZH (0.548), ChineseWord (58.30%), and LongText-Bench-ZH (0.946). It ranks top in GEdit-CN (7.52) and ImgEdit overall (4.27), and delivers competitive novel view synthesis (PSNR 15.11) and depth estimation, especially in Chinese and long-text rendering →read the paper
Glm-4.5: Agentic, reasoning, and coding (ARC) foundation models
Researchers from Zhipu AI & Tsinghua University developed GLM-4.5, a 355B-parameter MoE LLM (32B active) and its 106B version, GLM-4.5-Air, trained on 23T tokens with hybrid reasoning modes. GLM-4.5 ranks 3rd overall, 2nd in agentic tasks, scoring 70.1% TAU-Bench, 91.0% AIME 24, 64.2% SWE-Bench Verified. It uses multi-stage pre/mid/post-training with expert iteration and RL, excelling in reasoning, coding, multilingual translation, safety (89.9%), and agentic coding (90.6% tool-call success) →read the paper
R-Zero: Self-evolving reasoning LLM from zero data
Researchers from Tencent AI Seattle Lab, Washington University in St. Louis, University of Maryland, and The University of Texas at Dallas present R-Zero, a co-evolutionary Challenger–Solver framework that trains reasoning LLMs without any external data. Using Group Relative Policy Optimization, the Challenger creates tasks at the Solver’s capability edge, and the Solver learns via pseudo-labeled, filtered data. Across Qwen3 and OctoThinker models, R-Zero boosts math scores up to +6.49 and general reasoning up to +3.81, with gains compounding over three iterations →read the paper
Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction
Researchers from Princeton, NVIDIA, Tsinghua University, Stanford University, Meta FAIR, Amazon, Shanghai Jiao Tong University, and Peking University present Goedel-Prover-V2, an open-source Lean theorem prover. Using verifier-guided self-correction, scaffolded data synthesis, and model averaging, the 8B model scores 84.6% pass@32 on MiniF2F (exceeding DeepSeek-Prover-V2-671B), while the 32B model reaches 88.1% (90.4% with self-correction) and solves 86 PutnamBench problems pass@184 – surpassing prior SOTA with far smaller size and compute →read the paper
Seed diffusion: A large-scale diffusion language model with high-speed inference
Researchers from ByteDance Seed and Tsinghua present Seed Diffusion Preview, a discrete-state diffusion code LLM achieving 2,146 tokens/s on H20 GPUs via parallel block-level generation, constrained-order training, and on-policy trajectory optimization. Using a two-stage curriculum (mask-based then edit-based corruption), it matches or surpasses similarly sized autoregressive models on HumanEval, MBPP, BigCodeBench, LiveCodeBench, MBXP, and excels in code editing (54.3% CanItEdit). It outperforms prior diffusion models Mercury and Gemini on the speed–quality Pareto frontier →read the paper

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Reinforcement Learning for LLM and Agent Capability Expansion

🌟 Sotopia-RL: Reward Design for Social Intelligence (by Illinois Urbana-Champaign, California Irvine, Ai2, Carnegie Mellon University, Stanford, MIT)
train socially intelligent agents with utterance-level, multi-dimensional rewards to capture nuanced social behaviors → read the paper
Tool-integrated Reinforcement Learning for Repo Deep Search
combine tool-use and RL to train LLMs for multi-step code issue localization and repair → read the paper
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
extend RLVR with hybrid-policy optimization to push beyond base model reasoning limits → read the paper
🌟 Agent Lightning: Train ANY AI Agents with RL (by Microsoft)
provide a general framework for applying RL to any AI agent architecture with minimal integration overhead → read the paper
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
enable GUI-based agents to self-learn new software via curriculum-guided exploration and iterative skill acquisition → read the paper
CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search
optimize ANNS algorithms via RL to balance retrieval accuracy and search speed → read the paper
🌟 Exploitation Is All You Need... for Exploration (by Micah Rentschler, Jesse Roberts) demonstrate conditions under which pure exploitation objectives can yield emergent exploration in RL agents → read the paper
🌟 Learning to Reason for Factuality (by Meta, University of Washington)
design a reward for RL that balances factual precision, detail, and relevance to reduce hallucinations in reasoning models → read the paper
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
apply RL to open-weight LLMs for complex, stateful multi-turn software engineering tasks, boosting real-world SWE benchmarks → read the paper
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
improve instruction following without external supervision by using reasoning models’ own internal signals in self-supervised RL → read the paper

Evaluation, Verification, and Benchmarking

CompassVerifier
build a robust, domain-general LLM verifier with a new benchmark for accurate answer checking and RL reward modeling → read the paper
Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?
introduce a multilingual, multimodal benchmark for fine-grained RAG system evaluation → read the paper
Are Today's LLMs Ready to Explain Well-Being Concepts?
evaluate and improve LLMs’ ability to explain well-being concepts for diverse audiences via SFT and DPO → read the paper
VeriGUI
create a long-horizon, verifiable GUI task dataset to evaluate complex computer-use agents → read the paper

Efficiency, Scaling, and Architecture Improvements

Trainable Dynamic Mask Sparse Attention
introduce dynamic, content-aware sparse attention to improve long-context efficiency without losing fidelity → read the paper
🌟 VeOmni (by ByteDance)
scale omni-modal LLM training with a modular, model-centric distributed framework for efficient parallelism → read the paper
LeanK
prune unimportant KV cache channels to reduce memory and speed up long-context decoding → read the paper

Reasoning Process Understanding and Control

Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models
review methods to reduce overlong reasoning chains in R1-style models while preserving accuracy → read the paper
🌟 Is Chain-of-Thought Reasoning of LLMs a Mirage? (by Arizona State University)
analyze CoT performance through a data distribution lens to reveal its fragility beyond training domains → read the paper
🌟 Cognitive Loop via In-Situ Optimization (by Microsoft)
enable self-adaptive, steerable reasoning for scientific discovery through uncertainty-aware cognitive loops → read the paper
🌟 Sculptor (by Tsinghua University)
equip LLMs with active context management tools to mitigate interference and improve reasoning robustness → read the paper
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
adjust SFT gradients for better generalization using a simple dynamic scaling method → read the paper

Retrieval and Information Grounding

SitEmb-v1.5
improve long-document retrieval by conditioning short chunk embeddings on broader context → read the paper
AttnTrace
trace LLM outputs back to influential context segments using efficient attention-based methods → read the paper

Applied Multimodal and GUI Agents

LaTCoder
convert webpage designs to code with layout-preserving reasoning strategies → read the paper
🌟 CoAct-1 (by University of Southern California, Salesforce, University of Washington)
combine GUI actions with direct coding to create more efficient computer-use agents → read the paper
ChartCap
build a large dataset and metric for hallucination-free dense chart captioning → read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How was today's FOD?

Please give us some constructive feedback

Reply

or to participate.