Turing Post
Posts
FOD#104: AI “Holy Shit!” Moments of the Year – and What’s Still Not There Yet

FOD#104: AI “Holy Shit!” Moments of the Year – and What’s Still Not There Yet

short interviews from the AI Engineer World Fair, plus our regular curated selection of the most relevant AI news and research papers

Ksenia Se
June 09, 2025

This Week in Turing Post:

Wednesday – AI 101 / Concept: Let’s discuss the role of synthetic data
Friday – Interview: Olga Megorskaya on Human-in-the-Loop

Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. If you want to support us without getting a subscription – do it here.

Two main topics today!

1) The “Holy Shit” Moments of AI vs “Not There Yet” Hype

It’s June – halfway through the year. The pace of AI development is neck-breaking, and many things get forgotten. What was once impossible is now a commodity.

So, while at the AI Engineer World Fair last week in San Francisco, I decided to ask a few builders: what was their “holy shit” moment with AI this year?

It’s funny but a significant “wow” moment for one person might still feel like a “not there yet” to another. We also talked about what parts of their jobs they’d actually be happy to hand over to AI.

Simon Willison (an independent AI engineer everyone knows, aka the author of the 'pelican riding a bicycle' benchmark), swyx (Latent Space podcast and AI Engineer conferences), Jerry Liu (LlamaIndex), Solomon Hykes (Docker and Dagger), Stefania Druga (AI Educator) and a few others shared their views: watch and subscribe →

(I publish text summary of the survey online, you can read it here →read online

AI "Holy Shit!" Moments This Year (as recorded at AI Engineer World Fair, Jun 3-5, SF):

AI practitioners shared with me diverse moments that surprised or impressed them:

Multimodality: The rapid improvement in image and video generation quality and speed was a significant moment.
Agent Autonomy: Waking up to find an agent had completed coding tasks better than they could have was a total WOW moment.
Code Generation & Planning: The ability of models (specifically LLMs) to generate entire programs from scratch, write code for planning, and use tools effectively unlocks new applications.
Exceeding Human Capability: A specific moment when Gemini 2.5 Pro immediately solved a problem a human couldn't figure out led to the realization that the AI was "smarter than me".
Capable Local Models: The significant improvement in local models (like Mistral Small 3, Llama 70B, Gemma 3) that can run on a laptop and feel "GPT4 class" was exciting.
Models Joining the Team: Realizing models are now smart enough to join a software engineering team, write code, understand assignments, and make a plan was a "holy shit" moment indeed.
Improved Model Reasoning/Tool Use: The ability of models like O3 and O4 mini to use tools (like search) as part of their thinking process and refine their approach was seen as a revelation.
Emergence of Young Builders: Meeting 19-20 year old teams building AI companies was a significant moment.
Field Expansion: The presence of talks on science, multimedia, and voice for the first time indicated the field expanding beyond code and digital applications.
Image Generation Advancements: Seeing auto-regressive image generation and the ability to prompt and edit photos was a clear sign of the future.

What's Hyped But Not There Yet:

While many areas show promise, practitioners noted things being widely discussed but not fully realized:

Reliable Coding Agents: While agents are exciting, coding agents still make many mistakes and require babysitting. Some believe they might be ready this year.
Autonomous Agents & Multi-Agent Systems: Fully autonomous workflows and making agents communicate with each other introduce significant complexities and require more work. Multi-agent conversation and planning (MCP) is the largest track at the conference, but most people haven't put it in production yet and are skeptical.
Reinforcement Learning for Workflows: Applying reinforcement learning to personal workflows without being an expert is exciting but not yet widely accessible.
Security/Prompt Injection: The security side, particularly prompt injection, remains a significant unsolved problem, despite being an issue for two and a half years. Models are still fundamentally gullible.
Connecting Data to Reasoning Models: The necessary ability to go from logs and user interaction data into valuable datasets, features, or directly piping back into reasoning models is talked about but not yet available, although some believe it will be there in the next year.

Are Agents Part of the Workflow or Still Hype?

There was a division in perspectives on agents:

Definitely in Workflow: For some, agents are not hype; they are using them for most of their coding and writing their own agents, having decided to "go all in". Agents are seen as real, new, and happening very quickly, already working for some people. Coding agents specifically are expected to become much more mainstream by the end of this year. Search agents (like research assistants) and coding agents that can write, execute, and debug code are considered amazingly effective tools.
Slowly Becoming Part: For others, agents are slowly starting to become part of the workflow, particularly in coding and using tools like deep research or O3 as primitive forms of agents. It's believed they will become a huge part of workflows later this year, within 6 months, pending better integrations and reliability.
Skepticism/Varied Definitions: The term "agent" itself lacks a single definition. While the fundamental idea of a model running in a loop and performing actions makes sense and is believed to get better, some views are based more on demos than widespread practical application.

Excitement for AI Taking Over Job Parts:

When asked what parts of their job they'd be excited for AI to take over, common themes emerged:

Repetitive Tasks: There is excitement for AI to handle the "long tail of tasks" and "endless endless tasks" that are repetitive and often don't get done. This includes things like preparing memos, emailing people, and communicating ideas manually.
Talking to Investors: One person specifically mentioned wanting AI to replace the task of talking to investors.
Automating Workflows: Generally, automating workflows to focus on more interesting things like coming up with research questions and experiments was desired. This is seen more as augmentation than replacement in the scientific process.

Anyway, it’s better to watch the whole video → watch it on YouTube

2) Everyone talks about these papers:

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity by Apple

And How much do language models memorize? – a collaboration between researchers from FAIR, Google DeepMind, Cornell University, and NVIDIA

But what do many people miss about them?

The papers seem to study different phenomena. One investigates the limits of reasoning, the other of memorization. But what everybody missed is this: they describe the same underlying breakdown – a model’s coping mechanism when pushed past its fundamental capacity.

In the Illusion of Thinking paper, the model’s processing capacity – its effective “CPU” – is overloaded by puzzles that demand deep, multi-step reasoning. The model’s response is to abort the task mid-thought, reducing reasoning effort as complexity increases. This leads to a visible reasoning collapse.

In the Memorization paper, the model’s storage capacity – its “Hard Drive” – is saturated by large training sets. The model can’t memorize everything, so it begins to compress aggressively by generalizing. This triggers the double descent phenomenon and reduces the model’s ability to recall specifics.

Same failure, different modality. Whether the pressure comes from too many steps or too much data, the result is the same: the model simplifies, guesses, or shuts down – all while still outputting something that looks fluent and confident.

Reasoning collapse and forced generalization aren’t separate problems. They’re two faces of the same coin: how finite architectures break under load.

Welcome to Monday. Don’t you worry, the models are still just a technology with a bunch of technological bottlenecks.

Curated Collections

LLM, SLM, VLM, MLLM, LAM… There are a lot of model abbreviations out there. We decided to help you learn them — or at least make them a bit clearer.

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

We are reading/watching

The Perfect Week of Exercise - Rick Rubin talking to Jack Clark (Anthropic)
Disrupting malicious uses of AI: June 2025 by Open AI (read it like a thriller).
Some thoughts on human-AI relationships from Joanne Jang, OpenAI’s lead model behavior & policy
The last six months in LLMs, illustrated by pelicans on bicycles by Simon Willison

News from The Usual Suspects ©

Apple opens the AI gates – but keeps Siri on mute
At WWDC 2025, Apple finally cracked open its AI vault. The new “Apple Intelligence” suite is now available to third-party developers, promising features like image-aware suggestions and real-time translation. But Siri, expected to headline the show with a reboot, was curiously absent. That evolution is now delayed till 2026. For now, devs get the toys—users still get the same old assistant.
Yoshua Bengio’s LawZero
AI legend Yoshua Bengio has launched LawZero, a nonprofit devoted to building AI that doesn't go rogue. Based in Montréal and incubated at Mila, the lab rejects agentic designs in favor of “Scientist AI” — models that understand rather than act. Think oversight over ambition. Backers include Open Philanthropy and Jaan Tallinn. The aim? Guardrails for an accelerating world.
Anthropic’s Guide to Claude Code
Anthropic is dogfooding Claude Code across the board – from growth marketers building Figma-integrated ad generators to legal teams prototyping accessibility tools in an afternoon. Whether it's Kubernetes debugging, React dashboard generation, or Terraform reviews, Claude Code is their new universal colleague.
OpenAI’s Voice
OpenAI has rolled out improvements to ChatGPT’s Advanced Voice Mode for paid users, enhancing naturalness and expressiveness in speech. The updated system now handles subtleties like tone, cadence, and emotional inflection more effectively. It also introduces live translation capabilities across languages – useful for both travel and global collaboration. Some minor audio inconsistencies remain, but overall, voice interactions take another step forward.

Models and datasets to pay attention to:

SmolVLA: A vision-language-action model for affordable and efficient robotics
Researchers from Hugging Face and Sorbonne University developed SmolVLA, a compact VLA model with just 0.45B parameters that rivals 10× larger systems in robotic control tasks. Trained on 22.9K episodes from 481 community datasets, it supports single-GPU training and CPU deployment. SmolVLA uses an asynchronous inference stack to decouple action prediction and execution, enabling 30% faster control. It outperformed larger baselines in real-world and simulation benchmarks while maintaining efficiency and reproducibility →read the paper

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Reasoning with RL & Self-Improvement

Self-Challenging Language Model Agents
Trains agents to create and solve their own tool-use tasks using code-based problem generation and RL → read the paper
Reflect, Retry, Reward
Enhances model performance by rewarding useful self-reflection after incorrect answers, using only binary feedback → read the paper
ProRL
Prolonged RL training discovers novel reasoning strategies beyond what’s visible in the base model → read the paper
Beyond the 80/20 Rule
Finds that a small number of high-entropy tokens drive RL improvements and reasoning diversity → read the paper
REASONING GYM
A synthetic RL environment generator for diverse reasoning domains, offering infinite task variation → read the paper
AlphaOne
Adds a test-time control mechanism for toggling between slow and fast reasoning to improve answer generation efficiency → read the paper
Unleashing the Reasoning Potential...Critique Fine-Tuning
Improves reasoning by fine-tuning on teacher critiques of a single problem — a cheaper alternative to RL → read the paper
ARIA
Reduces reward variance by clustering similar actions into a low-dimensional intention space during RL → read the paper
Incentivizing Reasoning...Instruction Following
Improves complex instruction-following by training on decomposed instructions and rule-based rewards → read the paper
OThink-R1
Dynamically prunes unnecessary reasoning steps by switching between fast and slow thinking modes → read the paper

Domain-Specific Reasoning & Long Contexts

Reasoning Like an Economist
Fine-tunes LLMs on economic problems to improve multi-agent reasoning and game-theoretic thinking → read the paper
A Controllable Examination for Long-Context LLMs
Introduces LongBioBench: a synthetic benchmark for evaluating long-context reasoning with interpretable control → read the paper
SuperWriter
Uses planning, reflection, and tree search to improve the quality of long-form generation in LLMs → read the paper

Training & Optimization at Scale

Protocol Models
Enables highly compressed model-parallel training (up to 99%) by compressing both forward/backward activations → read the paper
AReaL
Introduces a fully asynchronous RL training system that boosts throughput without stalling on long outputs → read the paper
StreamBP
Reduces memory use during long-sequence training by decomposing backpropagation across sequence steps → read the paper
Taming LLMs by Scaling Learning Rates
Improves optimizer stability via grouped gradient scaling, helping both full and parameter-efficient fine-tuning → read the paper

Memory & Inference Efficiency

Diagonal Batching
Makes recurrent memory transformers parallelizable without retraining by reordering compute at runtime → read the paper
Inference-Time Hyper-Scaling with KV Cache Compression
Compresses the KV cache to generate longer outputs at same compute cost with minimal quality loss → read the paper
Unified Scaling Laws for Compressed Representations
Establishes performance scaling rules for sparse and quantized models, enabling direct capacity comparisons → read the paper

Multimodal & GUI Agents

GUI-Actor
Achieves coordinate-free visual grounding for GUI agents via a dedicated action token and grounding verifier → read the paper
Surfer-H Meets Holo1
Combines a lightweight web agent (Surfer-H) with a new open-weight VLM suite for efficient web navigation → read the paper

Embeddings & Representations

Qwen3 Embedding
Introduces new multilingual embedding and reranking models trained with self-generated data and model merging → read the paper
Aligning Latent Spaces with Flow Priors
Uses pre-trained flow models to align latent representations without needing ODE solvers or likelihood computation → read the paper
Large Language Models are Locally Linear Mappings
Shows that LLMs behave like locally linear systems, allowing layer-by-layer interpretation via the Jacobian → read the paper

Evaluation & Benchmarking

Establishing Trustworthy LLM Evaluation
Proposes analyzing and patching shortcut neurons to detect and mitigate benchmark contamination → read the paper
Evaluation is All You Need
Demonstrates that tiny variations in evaluation design can drastically inflate perceived model performance → read the paper
Datasheets Aren't Enough
Proposes DataRubrics — a rubric-based, automated dataset evaluation framework using LLMs as judges → read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve

Leave a review!

Reply

or to participate.