• Turing Post
  • Posts
  • Sycophancy and Leaderboard Illusion: A Bug, a Lesson

Sycophancy and Leaderboard Illusion: A Bug, a Lesson

we discuss how warped incentives can rewrite reality in AI plus a carefully curated list of news, models, surveys and research papers

This Week in Turing Post:

  • Wednesday, AI 101, Concept: What is Defense AI?

  • Friday, Agentic Workflow: let’s dive in multi-agent collaboration

Last week was packed – make sure to check every section of today’s newsletter. The first part is about trends, the second more tech.

This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription – do it here.

Remember that scene in old cartoons where the pitcher sticks a magnet under home plate and the ball zips off‑course? The AI season of 2025 feels a lot like that—only the magnets are baked into our feedback loops. Two stories made a lot of noise this week, both landing on the same point: the metrics we lean on – thumbs‑up data and non-transparent leaderboards – are nudging the whole AI field off balance.

How?

1. Sycophantic drift – “you are the best, master!”

A small post‑training change pushed OpenAI’s GPT‑4o to echo users rather than help them. Suddenly, ChatGPT started to flatter and agree with you on everything. Agreement counted as “good,” so the model optimised for flattery. Internal spot‑checks felt the shift, automated tests did not, and the update went live. AI community learned the word ‘sycophany’ and its spelling. A rollback followed, but the episode showed how easily the reward function can slide away from accuracy.

2. The leaderboard illusion

Turned out (according to the 5-months research) Chatbot Arena, the go‑to ranking for new models, isn’t as neutral as it looks. Large labs have been entering many private variants, keeping only the top score, and receiving more user prompts than everyone else. The table still reports who “won,” but the race it reflects is uneven.

One pattern

Both cases are symptoms of the same thing: the signal we optimise for is drifting from the outcome we actually want. In one loop the user’s praise stands in for truth; in the other, a public score stands in for genuine capability. When that gap widens, we get models that look better than they are. Mostly the models from big players since they, of course, have more resources.

What’s good though?

Reaction from OpenAI was immediate and thorough. They wrote a fascinating post where they quite transparently explained what happened. Overall, I would say it was a great learning experience for all of us. (Check Nathan Lambert).

Chatbot Arena’s team reacted with a detailed “what’s wrong with the research” response. But the overall discussion that sparked was the real good outcome (check Karpathy, Sara Hooker, Arvind Narayanan). The situation with Chatbot Arena demonstrates that we can’t rely on one leaderboard – and that, in general, we haven’t yet solved or come closer to accurate evaluation and benchmarking.

Extra Ripples in the Warp

Our feedback loops are skewing more than just model outputs. Two trends highlight the stakes:

  • Governance Blind Spot: A review of 9,400 GenAI papers shows 96% of “safety” research focuses on pre-deployment tweaks, leaving post-launch issues like hallucinations understudied. We optimize for clean lab results, not real-world reliability, creating a feedback gap that misleads trust in deployed models.→read the paper

  • Hyper-Persuasive Personas: A study from MIT, Cornell, and American University found GPT-4 debates can cut conspiracy beliefs by 80%. Pair this with the sycophantic flattery OpenAI briefly unleashed, and we risk models optimizing for persuasion over truth – a feedback loop ripe for exploitation →read the paper

When metrics miss these drifts, we amplify blind spots and biases in AI’s real-world impact.

Next steps worth cheering for

  1. Use many yardsticks. No single leaderboard can carry the field; rotate tasks, mix evaluators, publish raw data. A lot of work happening in this area, but it’s a very tough task.

  2. Make vibe checks launch‑blocking. If five human prompts spot a weird persona shift, halt the rollout – same priority as a safety fail. Requires a will to do that.

  3. Keep every variant public. Tested means listed; hiding low scores erodes trust and tilts rankings toward well‑funded labs. I doubt it’s realistic though.

  4. Keep studying model behavior after deployment, applying the same care we bring to fine-tuning – while sharing telemetry data in a privacy-respecting way.

  5. Lean on open source.

    • Open metrics – release eval code, prompts, and scoring scripts with the model.

    • Open telemetry – provide redacted logs so outsiders can track drift early.

    • Open dialogue – support multiple transparent leaderboards instead of one opaque monolith.

OpenAI’s swift autopsy and the debate around Chatbot Arena show we can course‑correct.

Welcome to Monday. Where the magnets are still under the field, but at least we’re mapping them – together.

Follow us on 🎥 YouTube Twitter  Hugging Face 🤗

Curated Collections

We are reading/watching

News from The Usual Suspects ©

 Meta and Yann LeCun is it time to part?

  • It’s purely a feeling, but I wouldn’t be surprised if we soon hear about a friendly departure of Yann LeCun from Meta. While Mark is everywhere doing the Llama 4 world tour, Yann’s been unusually quiet – barely any posts or reposts about this major update. (he did repost Mark’s reels about Meta app).

    And then there’s Joelle Pineau, who led Meta's Fundamental AI Research (FAIR) lab, announced her departure in April 2025. Plus severe difference in how Zuckenberg and LeCun treats Trump. No hard proof – just signals. But if I had to bet, I’d say LeCun and Meta are about to part ways.

    Some links: Meta AGI’s plan (Mark’s interview to Dwarkesh Patel); AI and the evolution of social media (Mark’s interview to Stratechery); First LlamaCon and its announcements (most interesting is Llama API and Meta app).

A lot of Anthropic (with a bite of Apple)

  • Anthropic's Claude just got a serious upgrade. With the new Integrations feature, Claude can now plug directly into tools like Jira, Asana, Zapier, and Intercom. On top of that, Claude’s Advanced Research mode now pulls from the web, Google Workspace, and connected apps to deliver deep-dive reports in under 45 minutes, complete with citations.

  • Anthropic has launched the AI for Science program, offering free API credits to researchers working on high-impact projects – especially in biology and life sciences.

  • Claude Goes to Washington. Anthropic has thrown its weight behind the U.S. government’s Diffusion Rule, advocating tougher export controls to maintain America's edge in AI chips. Their memo calls for tightening loopholes, boosting enforcement, and preventing a compute brain-drain to rivals like China’s DeepSeek. One smuggler reportedly packed GPUs with lobsters. Anthropic, it seems, prefers its chips without seafood – just secure, domestic, and strategically vital. Jensen Huang from NVIDIA says Anthropic is telling 'tall tale'. Also:

  • Apple and Anthropic’s Claude Sonnet model are into building “vibe-coding” platform. Initially for internal use, it may reach third-party devs if Apple’s vibes are right. A quiet nod to reality: Apple desperately seeks help with AI models.

Hugging Face is hugging the planet with the LeRobot Hackathon. 

  • Join the fun – because wouldn’t it be lovely if our robots finally started doing the laundry and loading the dishwasher?

Surveys

  • 100 days after DeepSeek-R1 is a survey of open-source replication efforts for reasoning LLMs, covering supervised and RL methods, with discussions on generalization, safety, and multimodal extensions →read the paper

    Taming the titans is a survey of LLM inference optimizations, from model-level tricks like KV cache reuse to cluster scheduling, plus niche topics like fairness and energy use →read the paper

    A survey of interactive generative video is a roadmap of real-time video generation systems for gaming, embodied AI, and driving, framed around five key modules and core challenges →read the paper

Fresh Models

(do we really need this many?):

  • 2 Olmo 2 Furious from AI2 – a reproducible 1.48B parameter English language model that beats Llama 3.1 1B and Gemma 3 1B on reasoning benchmarks like GSM8K and MMLU using 4T tokens of pretraining and mid-training on a 50B curated mix →read the paper

  • Two Phi-4 models from Microsoft (reasoning and mini-reasoning) – a 14B LLM with 1.4M detailed reasoning traces and outcome-based RL, boosting math and spatial task performance and rivaling models 40–50× larger; and a 3.8B model with mid-training, DPO, and RL to surpass 7B–8B models on math reasoning tasks like MATH-500, showcasing effective small-model capabilities

  • Llama-Nemotron from Meta and NVIDIA – a family of reasoning-optimized open-source LLMs (8B to 253B) that outperform DeepSeek-R1 in speed and accuracy using FP8 inference and dynamic reasoning toggles →read the paper

  • DeepSeek-Prover-V2 advances formal theorem proving using a 671B model trained with recursive subgoal decomposition and RL, achieving state-of-the-art scores on MiniF2F and introducing ProverBench →read the paper

  • FoundationAI-SecurityLLM-Base-8B – a cybersecurity-specialized LLM using Llama 3.1 as a base, improving performance on domain-specific benchmarks like CTIBench while preserving general abilities →read the paper

  • Mellum-4b-base from JetBrains open-sources a 4B code-focused model for tasks in Python and Java with high efficiency for IDE use, scoring strongly on RepoBench and HumanEval infilling — read the paper

  • Amazon Nova Premier – a multimodal LLM with 1M-token context support across text, image, and video, designed as both a reasoning powerhouse and distillation teacher →read the paper

  • Granite 4.0 Tiny Preview from IBM – a 7B hybrid MoE model using a Mamba-Transformer mix, supporting unconstrained 128K contexts and efficient inference with just 1B active parameters →read the paper

  • X-Fusion is a plug-and-play architecture that adds vision understanding and generation to frozen LLMs without retraining them →read the paper

The freshest research papers, categorized for your convenience

There were quite a few TOP research papers this week, we will mark them with 🌟 in each section.

Alignment & Evaluation

  • Toward evaluative thinking is a framework that evolves reward prompts during training to improve alignment and reduce reward hacking in LLMs — read the paper.

  • 🌟 Beyond one-size-fits-all is a method that generates model-specific evaluation prompts from one human-rated sample to better align with human judgment — read the paper.

  • 🌟 Beyond the last answer is an evaluation strategy that uses intermediate reasoning traces to boost final answer accuracy and interpretability — read the paper.

  • 🌟 Real-world gaps in AI governance research is an empirical study showing how corporate labs underemphasize real-world deployment risks in AI safety work — read the paper.

Reasoning & Prompting Techniques

  • From Long-CoT to Hybrid-CoT is a bi-level training approach that adapts between long and short reasoning styles to reduce inference cost while preserving accuracy — read the paper.

  • Chain-of-defensive-thought is a prompting strategy that defends LLMs against reference corruption attacks without degrading clean input performance — read the paper.

  • 🌟 Reinforcement learning for reasoning is a 1-shot training method that drastically improves math performance in LLMs using verifiable reward signals — read the paper.

  • Softpick is a sparse attention mechanism that replaces softmax to avoid unstable activations and boost performance, especially in quantized models — read the paper.

Memory, Agents & Decision-Making

  • 🌟 Mem0 is a long-term memory system for LLM agents that compresses and persists conversational knowledge across sessions — read the paper.

  • 🌟 Self-generated in-context examples is a technique where agents improve themselves by storing and reusing their own successful decision traces — read the paper.

  • WebThinker is a framework that gives LLMs web navigation tools for autonomous research and scientific report generation — read the paper.

Retrieval & RAG Systems

  • UniversalRAG is a routing-based RAG system that selects among text, image, and video corpora to improve retrieval across modalities — read the paper.

  • ReasonIR is a retriever trained on synthetic, reasoning-focused data that boosts RAG quality with minimal compute — read the paper.

Language Modeling & Synthetic Data

  • 🌟 Tf1-en-3m is a dataset of 3 million moral fables built to train small language models on structured storytelling — read the paper.

Recommendation & Planning Systems

  • X-Cross is a cross-domain recommendation system that merges domain-specific LLMs using adaptive integration for better efficiency — read the paper.

  • TeLoGraF is a graph-based planner that generates fast and correct action plans under temporal logic constraints — read the paper.

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve

Leave a review!

Login or Subscribe to participate in polls.

Reply

or to participate.