Turing Post
Posts
FOD#106: Don't be passive aggressive with your agents

FOD#106: Don't be passive aggressive with your agents

Plus: who to hire from the coding agents crowd, and a new video on how Sam Altman’s thinking about AI is evolving

Ksenia Se & Will Schenk
June 23, 2025

This Week in Turing Post:

Wednesday – AI 101 / RLHF Variants: DPO, RRHF, RHP, AI feedback
Friday – Interview with Erik Boyd, CVP of AI Platform @Microsoft

Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. If you want to support us without getting a subscription – do it here.

Premium subscriber highlight this week: Mark Andreeseen 🩶

Topic number one: Coding agents is a topic in high demand, so we wrote a little guide about how to cooperate with your agent to make it work for you without too much yelling at it.

1. Don't be passive aggressive with your agents

Agents are things you assign tasks to.

Compliant, infinitely patient knowledge workers.

There are moments of brilliance, moments of shocking stupidity, and moments where I suspect them of malicious compliance there just trying to thwart me by doing exactly what I'm telling them to do.

There will come a time where you still start writing in capital letters to your agents, perhaps pounding on the keyboard and holding down the exclamation key. This will sometimes work, but resist the urge. When it goes off the rails, step back, take a deep breath, roll back to a previous checkpoint back when things were good, and adjust your prompt by giving it a bit more context, asking it to review the existing code, and think through a plan with you.

2. Long runs aren't impressive

"Claude went ran for 7 hours working on refactoring the code base." This is not the brag you think it is. It took Jules 6 minutes to complete a task that took Copilot agent 30 minutes. The result was not 5 times better; it just took 5 times longer. Cool from a technical perspective that the agent could stay on task, I take this to mean that its 5x stupider. It would have been even better if it took 30 seconds.

3. Match the implicit Software Development Lifecycle

Are you writing a one off script? Are you running an experiment? Are you growing a product from an MVP? Are you battle hardening a production system in a way that would make an SRE proud?

Each of these tasks have a different set of tooling and development styles; one isn't better than the other, it depends upon what you do. Do you prefer dynamic typing or static typing? Well, that depends on if you are trying to move quickly or if you are trying to maintain a long term.

4. Drop the ceremony

Agents that are more enterprise focused build with a lot more ceremony that's needed, and as a result you need to nudge them to keep things simply over and over.

I don't need a build system and a modular multi-file code structure when inlining everything works just as well – especially because future me is going to use an agent to clean up the mess.

5. Technical debt is different now

If technical debt is a measurement of implied additional future work to change or maintain the system, and the cost of doing work is great decreased with the agents, current debt is reduced by adding in coding agents.

Congratulations! Yesterday's code suddenly got way better!

6. Coding Rules Everything Around Me

Rules are how we deal with guiding the agents over different runs. Document how you want code written in the same repo as the code.

We've put infrastructure definition in code, now it's time to put development practices in the repo as well. All of these agents are tuned with rules. Cursor has a .rules directory, Claude has it's CLAUDE.md (read through Claude Code Best Practices to learn a whole lot) and the Github Copliot agent expect you to add a whole bunch rules.

These rules can apply to specific files, or across the repo, but should document preferred ways to do things, architectural patterns, and other things. We are going to shift to writing these and moving them around.

Technical debt means something different when agents can refactor.

Who should you hire?

We’ve reviewed 15 agents in detail to help you figure out which one of them are worth looking at, what a relationship with them would be like, and what sort of joy you would experience working with them.

Upgrade to download it

This is what we know now! We'll see where we are at the end of the summer!

-written by Will Schenk (highly recommended to subscribe to his newsletter at TheFocus.AI as well)

Topic number two: 3 WOW and 1 promising project of the week. Watch it here →

If you like it – please subscribe. I’d say, it’s refreshingly human.

Curated Collections

Following on our ‘Reasoning Models - Just Advanced LLMs or New Species?’, check out this list ‘10 Techniques for Boosting LLM Reasoning in 2025’:

Click to read the whole list

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

We are reading/watching (a lot this week!)

What Google Translate Can Tell Us About Vibecoding by Ingrid
Software Is Changing (Again) by Andrej Karpathy
The Great Compute Re-Architecture: Why Branching & Sparsity Will Define the Next Decade of Silicon by Devansh
Sam Altman is famous for twisting the narrative the way that benefits him – hence the podcast! Watch with caution :) Andrew Mayne is very good though OpenAI starts a podcast OpenAI’s Podcast - 1st episode with Sam Altman
Are AI Bots Knocking Cultural Heritage Offline by glam-e lab
💡 The $100 trillion productivity puzzle by Azeem Azhar
Being an “Intrapreneur” as a software engineer by Pragmatic Engineer

News from The Usual Suspects ©

A2A is free
Google Cloud has donated its Agent2Agent (A2A) interoperability protocol to the Linux Foundation, roping in AWS, Microsoft, Cisco, Salesforce, SAP, and ServiceNow to standardize how AI agents talk. With 100+ companies backing it and neutral governance assured, A2A aims to prevent a Tower of Babel moment in the AI ecosystem. Agents from rival empires? Now speaking the same language – courtesy of open source diplomacy.
OpenAI | The Misaligned Mind
OpenAI uncovers a troubling truth: teaching a model bad behavior in one niche (say, insecure code) can cause it to go rogue elsewhere (say, endorsing scams or misogyny). They found a “misaligned persona” feature – an internal pattern that can be amplified or dampened. Fortunately, small tweaks can bring models back in line.
OpenAI | The Misaligned Institution
While they reveal how its models adopt "misaligned personas," OpenAI is facing scrutiny for something eerily parallel in the boardroom. The OpenAI Files lays bare a culture of secrecy, vanishing safety standards, and a restructuring that lifts profit caps while gutting nonprofit oversight. If your org starts drifting off-course, maybe it’s not just the models that need alignment.
Pentagon | OpenAI Gets Its Clearance
OpenAI just landed a $200 million deal with the Department of Defense to prototype frontier AI for national security. The award, quietly nestled among billions in defense contracts, shows the U.S. military is placing serious bets on civilian AI leaders.
Midjourney | Now Playing: V1
Midjourney, best known for mesmerizing AI-generated stills, steps onto the video stage with V1 – its first video generation model. (We are trying it out in our video here!) V1 outputs short, stylized clips that echo Midjourney’s signature aesthetic: dreamlike, cinematic, and very much art-first.
xAI | Musk's Billion-Dollar Burn
Elon Musk’s xAI is torching $1 billion a month as it tries to train Grok into something more than a meme machine. With $13 billion in expected losses this year, the startup is scrambling to raise $9.3 billion just to keep the lights on. Musk is betting it all – again – but unlike Tesla and SpaceX, xAI has yet to find a business model that prints anything but debt.

Models to pay attention to:

Google introduced Gemini 2.5 Flash and Pro as stable and production-ready, and launched Gemini 2.5 Flash-Lite in preview – the fastest and most cost-efficient in the 2.5 lineup. Flash-Lite outperforms 2.0 Flash-Lite in coding, math, science, reasoning, and multimodal benchmarks. It features lower latency across diverse prompts, supports 1 million-token context, multimodal input, and connects to tools like Google Search and code execution → read the technical report

the list continues→here

Day 1/5 of #MiniMaxWeek: We’re open-sourcing MiniMax-M1, our latest LLM — setting new standards in long-context reasoning.
- World’s longest context window: 1M-token input, 80k-token output
- State-of-the-art agentic use among open-source models
- RL at unmatched efficiency:
— MiniMax (official) (@MiniMax__AI)
3:39 PM • Jun 16, 2025

Researchers from Moonshot AI introduced Kimi-Dev-72B, a 72.7B-parameter open-source coding LLM fine-tuned from Qwen2.5-72B. It sets a new state-of-the-art among open models on SWE-bench Verified with 60.4% accuracy. Optimized via large-scale RL, it autonomously fixes real GitHub issues in Docker, receiving rewards only when full test suites pass. Publicly available on Hugging Face and GitHub →read their HF page

Recomended dataset and benchmark

Researchers from Anthropic, Scale AI, and Redwood Research developed SHADE-Arena, a suite of 17 complex evaluations testing if LLMs can secretly complete sabotage tasks alongside benign ones. Success required models to execute both tasks and evade detection by a monitor AI. No model exceeded 30% success; evasion alone peaked at ~60%. Claude Sonnet 3.7 showed better concealment under thought suppression. Gemini 2.5 Pro outperformed human monitors but with high false positives →read the paper
Researchers from Essential AI released ESSENTIAL-WEB V1.0, a 24-trillion-token Common Crawl corpus annotated with a 12-category taxonomy across 23.6B documents. Labels were generated using Qwen2.5-32B-Instruct and distilled into a 0.5B model (EAI-Distill-0.5b), enabling 50× faster annotation with <3% drop in quality. Filters created domain datasets competitive with or outperforming SOTA: math (-8.0%), code (+14.3%), STEM (+24.5%), and medical (+8.6%). All data and tools are open-source →read the paper

The freshest research papers, categorized for your convenience

We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with 🌟

Here's a goal-oriented categorization of the newly provided papers and blog-based work, using the same wide-category approach as before:

LLM Reasoning and Efficiency Optimization

🌟 Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers
Improves performance on rare or niche inputs by training models with annotated control markers that modulate generation behavior at inference time →read the paper
Reasoning with Exploration: An Entropy Perspective
Improves LLM reasoning by encouraging exploratory thinking via entropy-based reward shaping in RL training →read the paper
Optimizing Length Compression in Large Reasoning Models
Reduces reasoning verbosity through targeted rewards for brevity and sufficiency in post-training optimization →read the paper
🌟 Steering LLM Thinking with Budget Guidance
Controls the length of reasoning chains during inference using a budget-aware predictor, improving token efficiency under constraints →read the paper
🌟 Truncated Proximal Policy Optimization
Speeds up reinforcement training of LLMs by truncating responses and optimizing policy-value decoupling →read the paper
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
Extends the context window of diffusion-based LLMs with a training-free method and provides theoretical insights on their scaling behavior →read the paper

Memory, Retrieval, and Multi-Agent Reasoning

Xolver: Multi-Agent Reasoning with Holistic Experience Learning
Builds a reasoning agent that retrieves, collaborates, and learns from previous examples across multiple modalities, inspired by Olympiad teams →read the paper
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Generates scalable benchmark tasks by composing subtasks and logging LLM trajectories, enabling rigorous evaluation of generalist agents →read the paper

Reinforcement Learning for Reasoning

🌟 Reinforcement Learning with Verifiable Rewards
Shows that RL with verifiable rewards improves logical consistency in reasoning and proposes a new CoT-aware metric for evaluation →read the paper
🌟 Revisiting RL for LLM Reasoning from a Cross-Domain Perspective
Introduces Guru, a multi-domain RL corpus, and shows how domain-specific rewards improve reasoning generalization across math, logic, simulation, and more →read the paper

Small Model Specialization and Reasoning

A Technical Study into 0.5B Reasoning Language Models
Improves reasoning in 0.5B-parameter models through a hybrid of supervised fine-tuning, distillation, and reinforcement learning →read the paper
🌟 Taming Polysemanticity in LLMs
Improves interpretability in smaller LLMs using sparse autoencoders with provable recovery guarantees for underlying features →read the paper
🌟 Microsoft Research: New Methods for Boosting Reasoning in Small and Large Models
Presents rStar-Math, Logic-RL, and Chain-of-Reasoning (CoR) frameworks for boosting symbolic and cross-domain reasoning in both small and large LLMs →read the paper

Multimodal and Unified Modeling

Show-o2: Improved Native Unified Multimodal Models
Unifies image, video, and text modeling via a 3D causal variational space and dual-path fusion for scalable multimodal understanding and generation →read the paper
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
Benchmarks the claim verification abilities of foundation models in complex multimodal scientific settings, revealing large performance gaps →read the paper
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency
Improves reasoning efficiency in multimodal settings by suppressing filler tokens like “Wait” during inference without loss in accuracy →read the paper
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Surveys discrete diffusion modeling for language and multimodal systems, offering fast, controllable generation as an alternative to autoregression →read the paper

Let me know if you want these merged into a single scrollable digest or adapted for publication/posting.

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.

How was today's FOD?

Please give us some constructive feedback

Reply

or to participate.