Turing Post
Posts
FOD#118: OpenAI the same day -> Slop and Top

FOD#118: OpenAI the same day -> Slop and Top

plus the best curated roundup of impactful news, important models, related research papers, and what to read

Ksenia Se
September 15, 2025

This Week in Turing Post:

Wednesday / AI 101 series: Guardian Models
Friday / Interview: Ulrik Stig Hansen, co-founder of Encord

Our news digest is always free. Click on the partner’s link to support us or Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Hugging Face, Microsoft, Google, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI →

Don’t stop at the editorial – the research papers are phenomenal this week

Now, to the main topic: How People Use ChatGPT

This Monday, I had a different topic in mind – two, actually. I was debating whether to cover casual attention (casual AI is something I follow diligently) or the state of hallucinations (two great papers dropped last week). But suddenly OpenAI started their Monday publishing: a 63-page report with hard numbers about how people use ChatGPT.

First, I got excited for some insights. I actually printed it out so I could use my pink pencil to underline the most interesting things. Then I read it, highlighting the parts that caught my attention. And then I used ChatGPT to clarify these things and see if I was right in my questions.

The mystery is why the researchers behind this report didn’t read or at least use ChatGPT to check it themselves. Might’ve saved them a few embarrassing moments.

suggested prompt: i have a few doubts about this report, what inconsistencies/inaccuracies/faults can you spot?

There are indeed a bunch of inconsistencies scattered around – most of them not catastrophic, but each one nicks at the credibility. And then there’s a bigger flaw that undermines the whole report.

Let me demonstrate:

They repeat throughout the report, in different words, the following: “As of July 2025 about 70% of ChatGPT consumer queries were unrelated to work; while both work-related and non-work-related queries have been increasing, non-work queries have been increasing faster.”

But then there’s also a footnote: “Our sample includes the three consumer plans (Free, Plus, or Pro). OpenAI also offers a variety of other ChatGPT plans (Business fka. Teams, Enterprise, Education), which we do not include in our sample.”

If you look at it strictly as a consumer usage report – then yes, it makes sense they cut out Teams, Business, Enterprise, and Education accounts. Those are not “consumer plans,” they’re workplace products. So the paper isn’t wrong for excluding them. But then how can you make any conclusions about work vs non-work usage?! It’s like writing a report on how people eat pizza – and then only counting take-out orders from Domino’s, while leaving out every slice eaten in restaurants, at school cafeterias, or office parties.

Where it gets confusing is in the framing. The title and conclusion position it as How People Use ChatGPT – full stop – when in fact it’s really How Consumers Use ChatGPT. That missing qualifier changes how you read the findings:

“70% of usage is non-work” is true for Free/Plus/Pro users, but you can’t generalize that to all usage when a giant slice of the pie – enterprise accounts where work dominates – is off the table.
The “work vs non-work” trend is real within consumer accounts, but doesn’t tell us what’s happening in offices, classrooms, or enterprise workflows. Because they also use ir for work and non-work.

So:

If the researchers had just titled it How Consumers Use ChatGPT, no problem.
Because they didn’t, the report risks being quoted as “proof” that ChatGPT is mostly non-work everywhere, which isn’t supported by their own sampling choices.

They say: “While most economic analysis of AI has focused on its impact on productivity in paid work, the impact on activity outside of work (home production) is on a similar scale and possibly larger.”

If they claim that, then they have to actually do the comparative work analysis to justify it. Otherwise, the comparison collapses into hand-waving.

They also say: “The fact that non-work usage is increasing faster suggests that the welfare gains from generative AI usage could be substantial.”

I don’t know why it’s so important for them to hammer that point, but it falls apart. And it got me all pumped up about it because if you have millions of users and millions of readers, you have to be responsible for what you say. When it’s this sloppy, it’s just painful and raises questions to credibility.

*sigh They also launched Codex today, though. Read about it below. It’s Top.

Links from the editorial:

OpenAI report “How people use ChatGPT”

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

After doing 3 Wow and 1 Promise for a few weeks, we asked ourselves: Who really needs more AI news? With so much out there, attention gets stretched too thin. What matters is holding focus on the things that shape the long-term horizon.

Introducing Attention Span – starting with an explanation behind something new (and the first paper) from Thinking Machines Lab. Watch it here →

News from The Usual Suspects ©

Also OpenAI today: GPT-5 Codex – from code suggestions to coding agents, with no waste of tokens

GPT-5-Codex — big improvement for long-running agentic tasks:
— Greg Brockman (@gdb)
5:20 PM • Sep 15, 2025

Some developers complain that Codex feels longer (though smarter) than Claude Code – but that’s actually the whole point. Codex has been trained to spend its effort where it matters. It doesn’t waste tokens on trivial autocomplete tasks; it answers those quickly. But when the problem is harder it slows down, reasons harder, and works longer. It’s by design, and it’s a very interesting feature!

Image Credit: Introducing Updates to Codex

Anthropic’s MCP goes public
The MCP Registry has landed – an open catalog and API for discovering publicly available MCP servers. It’s designed as a single source of truth, enabling both public and private sub-registries to thrive without stepping on toes. With a community-moderated model and open-source foundation, it’s a foundational step toward scaling context-aware AI. A quiet launch, but one with deep roots and broad ambitions.
Oracle’s Loud Pivot
After a decade of quiet infrastructure work, Oracle just shouted its way into the AI big leagues. With a record-setting compute deal in the works and AI demand visibly swelling its backlog, Oracle looks less like a dusty database vendor and more like the connective tissue of enterprise AI. It skipped the model arms race and built the rails – data, governance, and distribution – for others to ride.
Devin Goes to Eleven
Cognition AI, the team behind coding agent Devin, just raised $400M at a $10.2B valuation – up from $4B earlier this year. With ARR jumping from $1M to $73M in under a year and net burn under $20M, the numbers are as aggressive as the company culture. Long hours, layoffs, and buyouts haven’t scared investors – or slowed growth. It’s a hyperloop ride in both valuation and velocity. We just published a super detail deep dive about them →read it here

We are reading/watching

Absolutely brilliant Melanie Mitchell Magical Thinking on AI
One of our favorite visionaries Demis Hassabis on AI, Creativity, and a Golden Age of Science | All-In Summit
Fully autonomous robots are much closer than you think – Sergey Levine with Dwarkesh Patel

Models to pay attention to

VaultGemma – train a 1B decoder-only Gemma variant fully under differential privacy, demonstrate practical DP scaling laws, and release open weights for privacy-preserving applications →read the paper (pdf)
Hunyuan-MT / Hunyuan-MT-Chimera – build multilingual translation models across 33 languages and aggregate multi-setting outputs at test time to boost robustness, achieving state-of-the-art WMT2025 performance →read the paper
mmBERT – pretrain a modern multilingual encoder on 3T tokens with annealed language learning to lift classification and retrieval in both high- and low-resource languages →read the paper
Qwen3-Next – combine gated DeltaNet and gated attention with an ultra-sparse MoE and native multi-token prediction to deliver long-context efficiency while activating ~3B of 80B parameters →read the paper

Interesting surveys

A Survey of Reinforcement Learning for Large Reasoning Models →read the paper

Image Credit: The original paper

Reinforcement learning foundations for deep research systems: A survey
Researchers from Huawei Technologies surveyed RL approaches for training deep research systems with hierarchical agents. They examined data synthesis methods like cross-document and obfuscated queries, RL techniques for long-horizon credit assignment, reward design, and multimodal reasoning, and frameworks such as GRPO and DUPO. The survey highlights system bottlenecks, coordination strategies, and benchmarks, offering a roadmap for building scalable, tool-using, and evaluation-ready agentic research systems →read the paper

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Agents, Tools & Environments

🌟 Tool-space interference in the MCP era: Designing for agent compatibility at scale (Microsoft) – analyze how tool catalogs interact in the Model Context Protocol ecosystem and propose ways to prevent cross-agent inefficiencies →read the paper
🌟 Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents (Stanford) – convert research papers into interactive MCP-based agents that can execute the original workflows and extend them →read the paper
🌟 Virtual Agent Economies (Google DeepMind) – conceptualize agent-to-agent markets and explore auction mechanisms, mission economies, and governance for steerable AI economies →read the paper
WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents – generate challenging web navigation data and train agents with long contexts and tool calls for state-of-the-art browsing →read the paper
EnvX: Agentize Everything with Agentic AI – transform GitHub repositories into autonomous agents capable of natural interaction and cross-repository collaboration →read the paper

Agentic RL & Long-Horizon Execution

🌟Bootstrapping Task Spaces for Self-Improvement (Meta) – train models with exploratory iteration to grow task spaces and enable inference-time self-improvement across math, tool-use, and ML tasks →read the paper
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning – instill parallel reasoning through curriculum and RL, using multi-path exploration as a scaffold for stronger problem solving →read the paper
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning – provide a unified framework and scaling strategy to train LLM agents for multi-turn decision making across realistic environments →read the paper
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents – stabilize learning with uncertainty-aware gradient modulation, amplifying confident correct updates while dampening unstable ones →read the paper
Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding – dynamically adjust problem difficulty with adaptive hints to keep training efficient and aligned with model capacity →read the paper
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents – reinforce reasoning-oriented agents with synthetic data to strengthen autonomous deep research skills →read the paper
ΔL Normalization: Rethink Loss Aggregation in RLVR – minimize gradient variance in verifiable reward training by normalizing losses for variable-length outputs →read the paper
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing – decentralize RL post-training with asynchronous rollout sharing to scale efficiently across heterogeneous hardware →read the paper

Reasoning, Hallucination & Reliability

🌟 The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs – show how compounding per-step accuracy yields exponential gains in long tasks, and why execution errors dominate over reasoning gaps →read the paper
🌟 Why Language Models Hallucinate (OpenAI) – explain hallucinations as statistical pressures from training and evaluation incentives that reward guessing, not calibrated uncertainty →read the paper
The Majority is not always right: RL training for solution aggregation – train aggregators that reconcile multiple candidate solutions into a correct answer, outperforming majority voting →read the paper
Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet – reveal that longer reasoning often increases hallucinations in fact-heavy settings, limiting test-time scaling benefits →read the paper

Safety, Security & Robustness

🌟 Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated (Google DeepMind) – demonstrate decomposed reasoning poison attacks that target chain-of-thought while also revealing emergent robustness →read the paper
🌟 All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching (Texas A&M University) – build an LLM-driven system that discovers and patches software vulnerabilities, validated in DARPA’s AIxCC →read the paper
🌟 R2AI: Towards Resistant and Resilient AI in an Evolving World (Tsinghua) – propose a safe-by-coevolution paradigm where AI develops immunity-like resistance and resilience through adversarial feedback loops →read the paper
🌟 Statistical Methods in Generative AI – survey how statistical tools can improve reliability, fairness, and safety in generative AI pipelines →read the paper

Architectures & Training Paradigms

Guided Decoding and Its Critical Role in Retrieval-Augmented Generation – compare decoding frameworks that constrain RAG outputs to structured formats, balancing hallucination control and usability →read the paper
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining – equip models with strong in-context ML capabilities via causal model–based pretraining and efficient serialization →read the paper
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models – introduce trajectory-aware RL for diffusion LMs, yielding smaller yet stronger reasoning models →read the paper
🌟 Language Self-Play For Data-Free Training (Meta) – use game-theoretic self-play to let models improve without external data, showing stronger task performance than data-driven baselines →read the paper
🌟 Causal Attention with Lookahead Keys – extend causal attention with lookahead keys to blend forward-looking context without breaking autoregressive constraints →read the paper

Multimodal Reasoning & Integration

🌟 Visual Representation Alignment for Multimodal Large Language Models (KAIST) – align multimodal LLMs’ vision pathways with pretrained VFMs to improve fine-grained visual reasoning →read the paper
Can Understanding and Generation Truly Benefit Together – or Just Coexist? – unify image understanding and generation through reconstruction-based RL, showing mutual improvements →read the paper
Visual Programmability: A Guide for Code-as-Thought in Chart Understanding – enable adaptive chart reasoning via code-as-thought pathways and RL-based strategy selection →read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How did you like it?

Reply

or to participate.