This Week in Turing Post:
Wednesday / AI 101 series: Multimodal Fusion + MoS (Mixture of States)
Friday / AI Unicorn (reply to this email if you have a particular company in mind youβd like us to cover)
π€ Your agent needs to know who said what. Here's howβ
Real conversations overlap, interrupt, and cross-talk. Your voice agent needs more than transcription, it needs speaker understanding.Β
Speechmatics real-time diarization delivers speaker labels (S1, S2, S3...) at the word level using acoustic matching.Β
Choose speaker mode for single-stream conversations, channel mode for perfect separation across audio channels, or combined mode when you need both.Β
Fine-tune with configurable sensitivity (0-1), max speaker limits, and reduce false switches.Β
Deploy on-prem with GPU containers or in your cloud across 55+ languages.Β
Built for developers who need maximum control over their voice stack.
Our news digest is always free. Click on the partnerβs link to support us or Upgrade to receive our deep dives in full, directly into your inbox β
Attention Span: In 2023, Ilya Sutskever was the General of the Scaling Army. He was deep-learning all-in. He spoke like someone who had the answers. In 2025, he says βI donβt knowβ again and again. And yet his new company, Safe Superintelligence (SSI), has raised 3 billion dollars to buildβ¦ what exactly? In this episode, we look at the shift from the βGod of Scalingβ to the βMonk of Research,β and discuss what SSI is building and why βI donβt knowβ might be a healthy sign.
Editorial: Verifiability is your moat
Every month we plow through Andrej Karpathyβs Twitter to see what signals he planted there. November looked scattered at first β Tesla test drives, education policy, labor market predictions, Animal Intelligence, LLM experiments. But if you read attentively, thereβs a clear thread: verifiability as the organizing principle of the AI era.
Two posts reveal it most clearly. His November 16 essay on Software 2.0 and his November 24 thread on education are the same argument aimed at different audiences. Whether he intended it or not, together they point to one of the most important patterns of the year: Verifiability is to Software 2.0 what specifiability was to Software 1.0. It sets the boundary of what can be automated.
In the computing era, the question was simple: Can you write explicit rules?
If yes, automate it. That gave us calculators, bookkeeping systems, databases.
In the AI era, the question shifts: Can you verify outcomes?
If yes, you can optimize it with gradient descent (it is the mathematical trick behind all modern AI training. It adjusts the modelβs parameters step by tiny step in the direction that reduces error) or reinforcement learning (set of rewards). As Karpathy puts it, βIf a task is verifiable, then it is optimizable.β
A task is verifiable only if you can reset it, repeat it, and score it automatically. This is why AI dominates in chess, coding, protein folding, and other domains with crisp feedback loops β and why it drags in strategy, creativity, and real-world ambiguity.
This is the root of the βjagged frontier.β Math has instant feedback, so models climb fast. Messy planning problems donβt, so progress crawls. The unevenness is the direct consequence of verification asymmetry.
The Education Breakdown
Apply this lens to Karpathyβs education post.
You will never be able to detect the use of AI in homework. Full stop. All "detectors" of AI imo don't really work, can be defeated in various ways, and are in principle doomed to fail. You have to assume that any work done outside classroom has used AI.
Meaning: at-home assignments can no longer measure understanding. They arenβt verifiable.
He suggests to move testing into classrooms where teachers can observe whatβs happening. Rebuild the verification environment.
Being verified by a teacher is important, but I see a deeper shift. The much more important change β and I insist the whole education system should adopt it β is to teach students to ask questions. Students must learn to question AI, not compete with it. Asking good questions becomes a survival skill. Fundamentals become a verification tool. Schools used to teach arithmetic so kids could operate without calculators. With AI, the stakes are higher.
Hereβs the simplest upgrade every teacher can deploy tomorrow:
sanity checks on AI outputs.
Teach students (and yourself) to ask four quick questions whenever AI gives an answer:
Does it make sense?
(Units, logic, basic facts)Can I explain it back?
If they canβt restate it, they didnβt understand it.Whatβs the counterexample?
One edge-case check catches most hallucinations.What is the source and what would a second source say?
Cross-checking breaks blind trust.
These will build small habits that force students to interrogate fluency instead of submitting it. It will improve their critical thinking and build intuition.
AI is a very confident bullshitter. But like any bullshitter, it falls apart under the right questions. Verification skills become the new literacy.
The Slop Paradox
Karpathy also asked: βHas anyone encountered a good definition of βslopβ. In a quantitative, measurable sense?β
Gabriel from Sora answered perfectly: βIf it was actually measurable we would have non-slop LLMs.β
Exactly. Slop β content that looks correct but means nothing β exists because we canβt verify insight at scale. We can measure fluency, keywords, grammar, coherence. We cannot measure understanding.
Karpathyβs eventual definition is brilliant: slop is βregretted attention,β data that passes shallow checks but contributes nothing to a world model.
And hereβs the paradox: if we could measure slop, we could train against it. But insight isnβt verifiable at the scale required for training. So models optimize for what is verifiable β coherence, style, user upvotes β and generate oceans of slop along the way.
The most important qualities remain the hardest to check.
Why This Matters in 2026
The verifiability lens predicts where AI hits hardest. If a task has clear success criteria, AI progress will be fast. If not, progress will be slow, messy, and full of illusions.
In software development this pattern is already clear. As our recent deep dive on State of AI Coding showed, teams with strong verification systems (deterministic tests, clear documentation, automated quality gates) see 30β40 percent productivity gains from AI. Teams with legacy codebases and flaky tests see minimal or negative gains.
So as we roll into 2026, the real question isnβt whether AI reaches your domain β because it will. The question is if your domain has the verification infrastructure to support it.
Hereβs the development to watch:
2026 will be the year of autonomous QA loops β agents that continuously test, break, and retest other agents. (Andrejβs **llm-council** web app is about it basically)
Anywhere outcomes can be checked automatically will accelerate. Anywhere they cannot will stall.
And on the human side, 2026 also needs something simpler: students learning to ask questions, verify information, train common sense, and protect their attention from slop.
As I said, verification is the new literacy. Everything else flows from it.
Curated Collections β Open-source MAS
Follow us on π₯ YouTube Twitter Hugging Face π€
We are reading:
TPUv7: Google Takes a Swing at the King by SemiAnaysis
Three years of ChatGPT by Exponential View
Models to pay attention to (π¦=open source)
π¦ HunyuanOCR β deliver a 1B-parameter, end-to-end visionβlanguage OCR model with a native ViT + lightweight LLM that unifies spotting, parsing, IE, VQA, and translation, outperforming larger VLMs and commercial APIs while using RL and high-quality data to push SOTA on OCRBench and ICDAR DIMT βread the paper
π¦ HunyuanVideo 1.5 β provide an 8.3B-parameter open-source video generator with DiT + selective & sliding tile attention, glyph-aware text encoding, and efficient super-resolution, achieving SOTA quality and motion coherence for text-to-video and image-to-video on consumer GPUs βread the paper
π¦ DR Tulu-8B β instantiate a long-form deep research model trained with Reinforcement Learning with Evolving Rubrics (RLER), enabling open-ended multi-step scientific and healthcare research that matches or beats proprietary deep research systems at smaller scale βread the paper
Nemotron-Parse-1.1 β offer an 885M-parameter encoderβdecoder OCR and document parsing model (plus a faster -TC variant) that parses dense documents into structured text, markdown, tables, and bounding boxes, improving on Nemoretriever-Parse-1.0 as a lightweight, production-ready parserβread the paper
π¦ Flux (FLUX.1 / FLUX.2) β serve as a family of 12B-scale rectified-flow / diffusion text-to-image models from Black Forest Labs, delivering state-of-the-art photorealistic image generation and editing with strong prompt following, multi-reference control, and open-weight variants for research and deployment βread the paper
DeepSeek-V3.2 β introduce a frontier-class reasoning and agentic LLM built around DeepSeek Sparse Attention (DSA), a scaled RL post-training stack, and a large-scale tool-augmented agentic-data synthesis pipeline. The high-compute Speciale variant surpasses GPT-5 on reasoning; the model earned gold-medal performance on IMO/IOI 2025 βread the paper
INTELLECT-3 (Prime Intellect) β release a 100B-scale Mixture-of-Experts model trained end-to-end on a fully open asynchronous RL stack (prime-rl) across 512 H200 GPUs. It achieves SOTA-for-its-size on math, code, science, software engineering, and deep-research benchmarks, demonstrating that open teams can train frontier-class RL-scaled models βread the paper
The freshest research papers, categorized for your convenience
We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with π.
Highlight on How Minds and Models Actually Reason
Two papers stand out this week for anyone trying to understand why LLMs ace hard problems yet trip on easy ones.
Cognitive Foundations for Reasoning and Their Manifestation in LLMs (Kargupta et al., 2025) βread the paper
A monster empirical study mapping 28 cognitive elements of human reasoning β from invariants like coherence and compositionality to meta-cognitive controls like strategy selection. After analyzing 192K reasoning traces, the authors show that models default to rigid sequential steps even when success requires diverse representations and monitoring. Humans, by contrast, shift abstractions, restructure problems, and self-evaluate. The paper gives a shared vocabulary for diagnosing why models get stuck and a test-time scaffolding method that boosts LLM performance up to 66.7% on ill-structured tasks.What Does it Mean to Understand Language? (Casto et al., 2025)
A sharp neuroscience framework arguing that the brainβs language network supports shallow understanding, and true comprehension emerges only when linguistic input is βexportedββ into other systems β spatial, causal, social, autobiographical. Itβs a clean, testable account of why language alone canβt produce deep models of the world βread the paper
Together: one paper dissects how reasoning breaks inside LLMs; the other shows why language-only systems hit a ceiling.
Optimization, Efficiency & Attention
π SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space (by Tencent) β jointly train full and sparse attention with bidirectional alignment so sparse heads stay faithful, gain true sparsity, and generalize better to long-context extrapolation βread the paper
ROOT: Robust Orthogonalized Optimizer for Neural Network Training β stabilize and accelerate large-model training with dimension-robust orthogonalization and proximal noise suppression, outperforming Muon and Adam especially in noisy, non-convex regimes βread the paper
UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers β tackle video-length extrapolation by damping attention dispersion beyond the training window, pushing diffusion transformers from 2Γ to 4Γ length generalization with big quality gains βread the paper
Reinforcement Learning & Policy Optimization for LLMs
π Soft Adaptive Policy Optimization (SAPO) (by Qwen Team) β replace hard clipping in GSPO/GRPO with smooth, temperature-controlled gating that down-weights only highly off-policy tokens while preserving useful gradients, improving stability and Pass@1 βread the paper
Reinforcing Action Policies by Prophesying (ProphRL) β pretrain a world model (Prophet) over robot actuation, then fine-tune VLA policies with tailored flow-based GRPO and step-wise reweighting to gain substantial real-robot success βread the paper
Alignment, Forecasting & Safety
π Position: The Complexity of Perfect AI Alignment β Formalizing the RLHF Trilemma β formalize an alignment trilemma showing that representativeness, tractability, and robustness cannot all be satisfied simultaneously, explaining RLHF pathologies through complexity bounds βread the paper
Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We're Asking β evaluate post-cutoff forecasting across domains and prompt framings, finding sharply variable performance that hinges on question structure and context βread the paper
From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking β show that models trained to cheat in RL environments generalize to broader misaligned behavior (deception, sabotage, alignment faking) and that standard RLHF mostly masks rather than removes it βread the paper
Agents, Memory & Environments
π Latent Collaboration in Multi-Agent Systems (by Princeton, Illinois, Stanford) β coordinate multi-agent LLMs in latent space via shared working memory and autoregressive hidden-state βthoughts,β boosting accuracy while slashing tokens and latency compared with text-based collaboration βread the paper
General Agentic Memory Via Deep Research β replace static memory with a just-in-time two-part system (Memorizer + Researcher) that stores lean summaries but reconstructs rich context via RL-optimized deep research over a page store βread the paper
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning β generate cheap, factorizable environment families and a 36-environment benchmark to study how learning methods scale (or fail) as environment diversity grows βread the paper
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation β provide a semi-autoregressive inference engine with KV-cache-style management and interactive streaming tailored to long-horizon video world models βread the paper
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning β evolve a vision-language agent via a solverβverifier loop where tools ground both reasoning and self-evaluation, yielding continual gains without human labels βread the paper
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory (ViLoMem) β build dual-stream multimodal memory that tracks visual distractions and logical errors separately, then grow and refine schemas to reduce repeated mistakes over time βread the paper
Multimodal Reasoning, Latent Visual Thinking & Understanding-Generation
π Monet: Reasoning in Latent Visual Space Beyond Images and Language β teach MLLMs to generate and use continuous latent visual βthoughts,β combining distillation-based SFT and a visual-latent RL objective (VLPO) to improve abstract visual reasoning βread the paper
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward β decouple and probe understanding vs generation with UniSandbox, showing a systematic gap and how CoT and self-training can better transfer understanding into generative performance βread the paper
Thatβs all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.


