This Week in Turing Post:
Wednesday, AI 101, Technique: What are Chain-of-Agents and Chain-of-RAG
Friday, Agentic Workflow: we explore Reasoning
Turing Post is a reader-supported publication. To be the first to receive new posts, consider becoming a paid subscriber. Your support helps us keep news digests free β
The main topic β Deep Research from OpenAI makes me rethink my work routine
Turing Post is a lean operation, run full-time by just two people. While I work with a few trusted contributors, the heavy lifting is done by Alyona and me. I wasnβt actively looking to add someone new to the team β yet here we are, and I couldnβt be happier about it.
Meet our newest hire: DeepResearch β $200/month.
Despite the ongoing controversies surrounding OpenAI, their latest release, DeepResearch (seriously, you guys need to up your naming game), is a game-changer. Itβs not replacing anyone at Turing Post, but it has significantly cut down the hours we spend on research-heavy tasks. What used to take ages now gets done in a fraction of the time. To the point that I'm now rethinking my well-established workflow.
What do people say on the web?
I didnβt have to give up a Perplexity subscription β I never had one. Instead, Iβve been using a combination of cross-prompting Gemini Deep Research and ChatGPT o1 or o3-mini, but DeepResearch might simplify that routine. It shifts the workflow from active searching to supervising an AI-generated research process. Itβs a different level of actual helpβlike having a virtual research assistant to whom you give a prompt, step away while it works, and return to a finished analysis.
My summary is that DeepResearch is amazing as a well-organized starting point. What I also liked is that it feels like a promise of a working agent. DeepResearch will ask clarifying questions if your prompt is ambiguous, then proceed step-by-step. The result is a more robust and context-aware research process compared to a single-turn question-answer system. Or β if you donβt know the answers to its questionsβjust say, βDo as you see fit, knowing what Iβm working on.β It does a pretty good job of figuring things out on its own. Pretty dope.
It also understands time frames: if you need fresh materials from February 3 to February 10, it will search specifically within that range.
Andrew Maynard, a professor and author, wrote that after using DeepResearch, heβs βbeginning to wonder when research and scholarship that isnβt augmented by AI will be seen as an anachronism.β βUsing it feels like giving a team of some of the best minds around PhD-level questions, and having them come back with PhD-level responses β all within a few hours.β (from Does OpenAI's Deep Research signal the end of human-only scholarship?)
This means DeepResearch can identify cross-domain links or examples that might otherwise be overlooked, offering fresh perspectives. In professional settings, this can support more well-rounded decision-making β for example, a product manager can quickly gather insights from scientific research, market data, and consumer opinions in one place, rather than relying on multiple teams or lengthy research processes. It makes you multifaceted!
Mollick was impressed by the depth, but he and others like economist Kevin Bryan pointed out the current limitations in data access β notably that having access to better search and paywalled content would make such agents far more usefulβ.
How does DeepResearch work?
OpenAIβs breakthrough with DeepResearch lies in its ability to take coherent actions throughout its reasoning process. Unlike traditional AI agents that struggle with long-term focus, this model maintains progress without getting distracted. Unlike Geminiβs approach, which searches for sources first and then compiles a report, OpenAIβs version dynamically searches and takes actions as needed. This makes it more adaptable and efficient. Under the hood, itβs powered by the o3 model with reinforcement learning, allowing it to act within its reasoning process. The results depend on the chosen research target.
Here is a lengthy research report produced by DeepResearch following the prompt: βI need technical details about how DeepResearch from OpenAI works. Give me model architecture, system architecture, and deeper insights into proprietary aspectsβ:
Not without limitations
Occasional Inaccuracies and Hallucinations β as every LLMs, it can misstate facts, confuse similar terms, or generate incorrect information. ALWAYS verify.
Difficulty Assessing Source Credibility β Doesnβt always distinguish authoritative sources from unreliable ones, sometimes including outdated or low-quality information.
Outdated or Stale Information β May cite old data, especially in fast-changing fields, unless explicitly prompted for the latest updates.
Inconsistent Instruction Adherence β Sometimes includes topics it was told to exclude or doesnβt fully follow user guidance.
Potentially Incomplete in Niche Depth β Might miss important details or references that an expert would consider essential.
Overwhelming Length and Irrelevant Details β Tends to provide exhaustive reports, sometimes including excessive or tangential information.
High Cost and Limited Access β Available only to ChatGPT Pro users at $200/month, making it inaccessible to many casual users.
Opaque βBlack Boxβ Reasoning β Users donβt see how it selects or evaluates sources, making its conclusions harder to fully trust without verification.
But you know, this is the worst youβve ever experienced it.
Best Practices for Using DeepResearch Efficiently
Craft a Detailed, Focused Prompt β Be clear and specific in your query to avoid irrelevant results. Use ChatGPT to refine your prompt before submitting it.
Provide Context or Examples β Giving background information or specifying the desired answer format helps guide the AIβs research. A lot. Here is an example from Ben Thompson, author of The Stratechery.

Engage with Clarification Questions β Answer any follow-up questions from DeepResearch to fine-tune its direction before it starts searching. Its questions are helpful on their own, prompting you to think things through and clarify what you really want.
Specify Scope and Bias Preferences β Direct the AI on preferred sources, date ranges, or perspectives (e.g., βfocus on peer-reviewed studiesβ or βexclude politically biased sourcesβ).
Verify and Refine the Output β Treat the AI's report as a first draft, fact-check key claims, and run follow-up queries to clarify or correct missing details.
Request Summaries or Actionable Insights β After a long report, ask for a concise summary, key takeaways, or recommendations to make the information more digestible.
Manage Time β Plan around DeepResearchβs processing time (5β30 minutes) and work on other tasks while its βthinkingβ.
Maintain βjob descriptionβ for your new employee β Create a list of tasks that DeepResearch can assist with or automate right now. Keep track of how others use it. Try incorporating it into your routine and adjust as needed.
Have you tried it? What are your recommendations? Will it change your work routine?
From Our Partners: Everything you need to know about AI agents
Galileo just dropped aΒ 100-page ebook on AI agents, so you can create powerful, reliable agents like an expert:
Match the right agentic framework for your use case
Evaluate and improve performance
Identify failure points and production issues
Curated Collections
π³ Turing Post is now on π€ Hugging Face! You can read the rest of this article there (itβs free!) β
We are reading/watching
Three Observations from Sam Altman. I donβt usually analyze texts in this section. But here are a few highlights you canβt miss from Altmanβs text:
βAGI is a weakly defined term, but generally speaking we mean it to be a system that can tackle increasingly complex problems, at human level, in many fields.β
βAGI is just another tool in this ever-taller scaffolding of human progress we are building together.β
Observations: 1. βThe intelligence of an AI model roughly equals the log of the resources used to train and run it.β 2. βThe cost to use a given level of AI falls about 10x every 12 months, and lower prices lead to much more use.β 3. βThe socioeconomic value of linearly increasing intelligence is super-exponential in nature.β
βWe are now starting to roll out AI agents, which will eventually feel like virtual co-workers.β
βMany of us expect to need to give people more control over the technology than we have historically, including open-sourcing more, and accept that there is a balance between safety and individual empowerment that will require trade-offs.β
The End of Programming as We Know It by Tom OβReilly
Must-see course about LLMs by Andrej Karpathy (+3 hours long video)
Top models to pay attention to
SmolLM2: When Smol Goes Big optimizes a 1.7B parameter model with 11T tokens and specialized datasets (FineMath, Stack-Edu, SmolTalk). Outperforms larger models in benchmarks, showing the power of data-centric training and efficient instruction tuning.
Sundial: A Family of Highly Capable Time Series Foundation Models introduces a time-series foundation model using flow-matching loss and large-scale tokenization. Achieves state-of-the-art zero-shot forecasting and 11.34Γ faster inference, addressing mode collapse issues.
Satori: RL with Chain-of-Action-Thought enhances LLM reasoning with reinforcement learning and autoregressive search. Achieves 93.2% on GSM8K math benchmarks, demonstrating improved self-reflection and generalization to out-of-domain tasks.
Ola: Omni-Modal Language Model an open-source LLM that progressively learns to integrate text, vision, audio, and video.
Llasa: LLaMA-based Speech Synthesis at Scale develops a Transformer-based text-to-speech model inspired by LLM scaling principles.
The freshest research papers, categorized for your convenience
There were quite a few super interesting research papers this week, we mark the ones we recommend the most with π in each section.
LLM Techniques and Optimizations
Activation-Informed Merging of LLMs proposes Activation-Informed Merging (AIM) to merge fine-tuned LLMs by preserving key activation-space weights, boosting performance without retraining ββread the paper
Content-Format Integrated Prompt Optimization (CFPO) introduces a prompt design method optimizing both content and format to enhance LLM responses βread the paper
Reasoning and Multi-Step Problem Solving
πAlphaGeometry2 (Olympiad Geometry Solver) enhances AlphaGeometry to solve IMO-level geometry problems with a broader formal language ββread the paper
BOLT: Bootstrapping Long Chain-of-Thought presents a method to train LLMs for long reasoning chains without relying on distillation from larger models βread the paper
Token Assorted (Mixing Latent & Text Tokens) proposes compressing early reasoning steps into latent tokens to shorten chain-of-thought sequences βread the paper
ScoreFlow (Optimizing LLM Agent Workflows) develops a Score-based Direct Preference Optimization (Score-DPO) method for optimizing multi-agent workflows ββread the paper
πZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning evaluates LLMs on logic grid puzzles, revealing how complexity diminishes accuracy despite enhanced inference strategies ββread the paper
The Jumping Reasoning Curve? assesses GPT-[N] and o-[N] models on multimodal puzzles, highlighting major reasoning performance gaps and inference costs ββread the paper
Demystifying Long Chain-of-Thought Reasoning in LLMs analyzes how supervised fine-tuning and reinforcement learning impact reasoning performance ββread the paper
πLimo: Less is More for Reasoning demonstrates that curated reasoning samples outperform massive datasets in LLM training ββread the paper
Model Efficiency and Scaling
MAGA: Massive Genre-Audience Data Augmentation reformulates existing text into different genres and audiences to generate synthetic pretraining data ββread the paper
ParetoQ: Low-Bit Quantization Scaling Laws studies extreme LLM quantization and identifies critical phase changes in low-bit regimes ββread the paper
Alignment and Safety Improvements
Feature Flow for Steerable LLMs introduces a causal cross-layer interpretability framework to track semantic features βread the paper
πβGreat Models Think Alikeβ (AI Oversight Risks) examines the risks of using one LLM to oversee another, revealing bias towards models that βthink alikeβ ββread the paper
PILAF (Optimal Preference Sampling for RLHF) introduces a preference-sampling strategy that focuses ranking queries on maximizing the true reward ββread the paper
DuoGuard: Multilingual LLM Guardrails via Two-Player RL uses adversarial training between a generator and a guard model to improve safety across languages ββread the paper
Domain-Specific Applications of LLMs
Clinical Reasoning Limitations (M-ARC Benchmark) tests LLMs on medical reasoning tasks, exposing their limitations in open-ended clinical decision-making ββread the paper
LLMs for Legal Analysis (IRAC in Law) evaluates how LLMs handle legal reasoning using the IRAC framework, revealing issues like hedging and hallucinations ββread the paper
HackerRank-ASTRA (Code Generation Evaluation) benchmarks LLMs on complex multi-file coding tasks, assessing consistency and robustness ββread the paper
Open-Source vs. Proprietary LLM Innovations
The Open-Source Advantage in LLMs argues that open-source models, despite trailing in raw performance, offer the best path for research and ethical AI ββread the paper
UltraIF: Closing the Instruction-Following Gap develops a method to train open-source models to match proprietary instruction-following abilities using sub-query decomposition ββread the paper
Thatβs all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve





