FOD#129: Verification is All You Need

This Week in Turing Post:

Wednesday / AI 101 series: Multimodal Fusion + MoS (Mixture of States)
Friday / AI Unicorn (reply to this email if you have a particular company in mind you’d like us to cover)

🤓 Your agent needs to know who said what. Here's how→

Real conversations overlap, interrupt, and cross-talk. Your voice agent needs more than transcription, it needs speaker understanding.

Speechmatics real-time diarization delivers speaker labels (S1, S2, S3...) at the word level using acoustic matching.
Choose speaker mode for single-stream conversations, channel mode for perfect separation across audio channels, or combined mode when you need both.
Fine-tune with configurable sensitivity (0-1), max speaker limits, and reduce false switches.
Deploy on-prem with GPU containers or in your cloud across 55+ languages.

Built for developers who need maximum control over their voice stack.

Try Speechmatics for free

Our news digest is always free. Click on the partner’s link to support us or Upgrade to receive our deep dives in full, directly into your inbox →

Upgrade today

Attention Span: In 2023, Ilya Sutskever was the General of the Scaling Army. He was deep-learning all-in. He spoke like someone who had the answers. In 2025, he says “I don’t know” again and again. And yet his new company, Safe Superintelligence (SSI), has raised 3 billion dollars to build… what exactly? In this episode, we look at the shift from the “God of Scaling” to the “Monk of Research,” and discuss what SSI is building and why “I don’t know” might be a healthy sign.

Editorial: Verifiability is your moat

Every month we plow through Andrej Karpathy’s Twitter to see what signals he planted there. November looked scattered at first – Tesla test drives, education policy, labor market predictions, Animal Intelligence, LLM experiments. But if you read attentively, there’s a clear thread: verifiability as the organizing principle of the AI era.

Two posts reveal it most clearly. His November 16 essay on Software 2.0 and his November 24 thread on education are the same argument aimed at different audiences. Whether he intended it or not, together they point to one of the most important patterns of the year: Verifiability is to Software 2.0 what specifiability was to Software 1.0. It sets the boundary of what can be automated.

In the computing era, the question was simple: Can you write explicit rules?
If yes, automate it. That gave us calculators, bookkeeping systems, databases.

In the AI era, the question shifts: Can you verify outcomes?
If yes, you can optimize it with gradient descent (it is the mathematical trick behind all modern AI training. It adjusts the model’s parameters step by tiny step in the direction that reduces error) or reinforcement learning (set of rewards). As Karpathy puts it, “If a task is verifiable, then it is optimizable.”

A task is verifiable only if you can reset it, repeat it, and score it automatically. This is why AI dominates in chess, coding, protein folding, and other domains with crisp feedback loops – and why it drags in strategy, creativity, and real-world ambiguity.

This is the root of the “jagged frontier.” Math has instant feedback, so models climb fast. Messy planning problems don’t, so progress crawls. The unevenness is the direct consequence of verification asymmetry.

The Education Breakdown

Apply this lens to Karpathy’s education post.

❝

You will never be able to detect the use of AI in homework. Full stop. All "detectors" of AI imo don't really work, can be defeated in various ways, and are in principle doomed to fail. You have to assume that any work done outside classroom has used AI.

Andrej Karpathy

Meaning: at-home assignments can no longer measure understanding. They aren’t verifiable.

He suggests to move testing into classrooms where teachers can observe what’s happening. Rebuild the verification environment.

Being verified by a teacher is important, but I see a deeper shift. The much more important change – and I insist the whole education system should adopt it – is to teach students to ask questions. Students must learn to question AI, not compete with it. Asking good questions becomes a survival skill. Fundamentals become a verification tool. Schools used to teach arithmetic so kids could operate without calculators. With AI, the stakes are higher.

Here’s the simplest upgrade every teacher can deploy tomorrow:
sanity checks on AI outputs.
Teach students (and yourself) to ask four quick questions whenever AI gives an answer:

Does it make sense?
(Units, logic, basic facts)
Can I explain it back?
If they can’t restate it, they didn’t understand it.
What’s the counterexample?
One edge-case check catches most hallucinations.
What is the source and what would a second source say?
Cross-checking breaks blind trust.

These will build small habits that force students to interrogate fluency instead of submitting it. It will improve their critical thinking and build intuition.

AI is a very confident bullshitter. But like any bullshitter, it falls apart under the right questions. Verification skills become the new literacy.

The Slop Paradox

Karpathy also asked: “Has anyone encountered a good definition of ‘slop’. In a quantitative, measurable sense?”
Gabriel from Sora answered perfectly: “If it was actually measurable we would have non-slop LLMs.”

Exactly. Slop – content that looks correct but means nothing – exists because we can’t verify insight at scale. We can measure fluency, keywords, grammar, coherence. We cannot measure understanding.

Karpathy’s eventual definition is brilliant: slop is “regretted attention,” data that passes shallow checks but contributes nothing to a world model.

— # (#)

And here’s the paradox: if we could measure slop, we could train against it. But insight isn’t verifiable at the scale required for training. So models optimize for what is verifiable – coherence, style, user upvotes – and generate oceans of slop along the way.

The most important qualities remain the hardest to check.

Why This Matters in 2026

The verifiability lens predicts where AI hits hardest. If a task has clear success criteria, AI progress will be fast. If not, progress will be slow, messy, and full of illusions.

In software development this pattern is already clear. As our recent deep dive on State of AI Coding showed, teams with strong verification systems (deterministic tests, clear documentation, automated quality gates) see 30–40 percent productivity gains from AI. Teams with legacy codebases and flaky tests see minimal or negative gains.

So as we roll into 2026, the real question isn’t whether AI reaches your domain – because it will. The question is if your domain has the verification infrastructure to support it.

Here’s the development to watch:
2026 will be the year of autonomous QA loops – agents that continuously test, break, and retest other agents. (Andrej’s **llm-council** web app is about it basically)
Anywhere outcomes can be checked automatically will accelerate. Anywhere they cannot will stall.

And on the human side, 2026 also needs something simpler: students learning to ask questions, verify information, train common sense, and protect their attention from slop.

As I said, verification is the new literacy. Everything else flows from it.

Curated Collections – Open-source MAS

Click the link to open the full list

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

We are reading:

TPUv7: Google Takes a Swing at the King by SemiAnaysis
Three years of ChatGPT by Exponential View

— # (#)

Models to pay attention to (🦋=open source)

🦋 HunyuanOCR – deliver a 1B-parameter, end-to-end vision–language OCR model with a native ViT + lightweight LLM that unifies spotting, parsing, IE, VQA, and translation, outperforming larger VLMs and commercial APIs while using RL and high-quality data to push SOTA on OCRBench and ICDAR DIMT →read the paper
🦋 HunyuanVideo 1.5 – provide an 8.3B-parameter open-source video generator with DiT + selective & sliding tile attention, glyph-aware text encoding, and efficient super-resolution, achieving SOTA quality and motion coherence for text-to-video and image-to-video on consumer GPUs →read the paper
🦋 DR Tulu-8B – instantiate a long-form deep research model trained with Reinforcement Learning with Evolving Rubrics (RLER), enabling open-ended multi-step scientific and healthcare research that matches or beats proprietary deep research systems at smaller scale →read the paper
Nemotron-Parse-1.1 – offer an 885M-parameter encoder–decoder OCR and document parsing model (plus a faster -TC variant) that parses dense documents into structured text, markdown, tables, and bounding boxes, improving on Nemoretriever-Parse-1.0 as a lightweight, production-ready parser→read the paper
🦋 Flux (FLUX.1 / FLUX.2) – serve as a family of 12B-scale rectified-flow / diffusion text-to-image models from Black Forest Labs, delivering state-of-the-art photorealistic image generation and editing with strong prompt following, multi-reference control, and open-weight variants for research and deployment →read the paper
DeepSeek-V3.2 – introduce a frontier-class reasoning and agentic LLM built around DeepSeek Sparse Attention (DSA), a scaled RL post-training stack, and a large-scale tool-augmented agentic-data synthesis pipeline. The high-compute Speciale variant surpasses GPT-5 on reasoning; the model earned gold-medal performance on IMO/IOI 2025 →read the paper
INTELLECT-3 (Prime Intellect) – release a 100B-scale Mixture-of-Experts model trained end-to-end on a fully open asynchronous RL stack (prime-rl) across 512 H200 GPUs. It achieves SOTA-for-its-size on math, code, science, software engineering, and deep-research benchmarks, demonstrating that open teams can train frontier-class RL-scaled models →read the paper

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟.

Highlight on How Minds and Models Actually Reason

Two papers stand out this week for anyone trying to understand why LLMs ace hard problems yet trip on easy ones.

Cognitive Foundations for Reasoning and Their Manifestation in LLMs (Kargupta et al., 2025) →read the paper
A monster empirical study mapping 28 cognitive elements of human reasoning – from invariants like coherence and compositionality to meta-cognitive controls like strategy selection. After analyzing 192K reasoning traces, the authors show that models default to rigid sequential steps even when success requires diverse representations and monitoring. Humans, by contrast, shift abstractions, restructure problems, and self-evaluate. The paper gives a shared vocabulary for diagnosing why models get stuck and a test-time scaffolding method that boosts LLM performance up to 66.7% on ill-structured tasks.
What Does it Mean to Understand Language? (Casto et al., 2025)
A sharp neuroscience framework arguing that the brain’s language network supports shallow understanding, and true comprehension emerges only when linguistic input is “exported’’ into other systems – spatial, causal, social, autobiographical. It’s a clean, testable account of why language alone can’t produce deep models of the world →read the paper

Together: one paper dissects how reasoning breaks inside LLMs; the other shows why language-only systems hit a ceiling.

Optimization, Efficiency & Attention

🌟 SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space (by Tencent) – jointly train full and sparse attention with bidirectional alignment so sparse heads stay faithful, gain true sparsity, and generalize better to long-context extrapolation →read the paper
ROOT: Robust Orthogonalized Optimizer for Neural Network Training – stabilize and accelerate large-model training with dimension-robust orthogonalization and proximal noise suppression, outperforming Muon and Adam especially in noisy, non-convex regimes →read the paper
UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers – tackle video-length extrapolation by damping attention dispersion beyond the training window, pushing diffusion transformers from 2× to 4× length generalization with big quality gains →read the paper

Reinforcement Learning & Policy Optimization for LLMs

🌟 Soft Adaptive Policy Optimization (SAPO) (by Qwen Team) – replace hard clipping in GSPO/GRPO with smooth, temperature-controlled gating that down-weights only highly off-policy tokens while preserving useful gradients, improving stability and Pass@1 →read the paper
Reinforcing Action Policies by Prophesying (ProphRL) – pretrain a world model (Prophet) over robot actuation, then fine-tune VLA policies with tailored flow-based GRPO and step-wise reweighting to gain substantial real-robot success →read the paper

Alignment, Forecasting & Safety

🌟 Position: The Complexity of Perfect AI Alignment – Formalizing the RLHF Trilemma – formalize an alignment trilemma showing that representativeness, tractability, and robustness cannot all be satisfied simultaneously, explaining RLHF pathologies through complexity bounds →read the paper
Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We're Asking – evaluate post-cutoff forecasting across domains and prompt framings, finding sharply variable performance that hinges on question structure and context →read the paper
From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking – show that models trained to cheat in RL environments generalize to broader misaligned behavior (deception, sabotage, alignment faking) and that standard RLHF mostly masks rather than removes it →read the paper

Agents, Memory & Environments

🌟 Latent Collaboration in Multi-Agent Systems (by Princeton, Illinois, Stanford) – coordinate multi-agent LLMs in latent space via shared working memory and autoregressive hidden-state “thoughts,” boosting accuracy while slashing tokens and latency compared with text-based collaboration →read the paper
General Agentic Memory Via Deep Research – replace static memory with a just-in-time two-part system (Memorizer + Researcher) that stores lean summaries but reconstructs rich context via RL-optimized deep research over a page store →read the paper
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning – generate cheap, factorizable environment families and a 36-environment benchmark to study how learning methods scale (or fail) as environment diversity grows →read the paper
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation – provide a semi-autoregressive inference engine with KV-cache-style management and interactive streaming tailored to long-horizon video world models →read the paper
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning – evolve a vision-language agent via a solver–verifier loop where tools ground both reasoning and self-evaluation, yielding continual gains without human labels →read the paper
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory (ViLoMem) – build dual-stream multimodal memory that tracks visual distractions and logical errors separately, then grow and refine schemas to reduce repeated mistakes over time →read the paper

Multimodal Reasoning, Latent Visual Thinking & Understanding-Generation

🌟 Monet: Reasoning in Latent Visual Space Beyond Images and Language – teach MLLMs to generate and use continuous latent visual “thoughts,” combining distillation-based SFT and a visual-latent RL objective (VLPO) to improve abstract visual reasoning →read the paper
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward – decouple and probe understanding vs generation with UniSandbox, showing a systematic gap and how CoT and self-training can better transfer understanding into generative performance →read the paper

That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.