🌁#87: Why DeepResearch Should Be Your New Hire

This Week in Turing Post:

Wednesday, AI 101, Technique: What are Chain-of-Agents and Chain-of-RAG
Friday, Agentic Workflow: we explore Reasoning

Turing Post is a reader-supported publication. To be the first to receive new posts, consider becoming a paid subscriber. Your support helps us keep news digests free →

Become a Supporter

The main topic – Deep Research from OpenAI makes me rethink my work routine

Turing Post is a lean operation, run full-time by just two people. While I work with a few trusted contributors, the heavy lifting is done by Alyona and me. I wasn’t actively looking to add someone new to the team – yet here we are, and I couldn’t be happier about it.

Meet our newest hire: DeepResearch – $200/month.

Despite the ongoing controversies surrounding OpenAI, their latest release, DeepResearch (seriously, you guys need to up your naming game), is a game-changer. It’s not replacing anyone at Turing Post, but it has significantly cut down the hours we spend on research-heavy tasks. What used to take ages now gets done in a fraction of the time. To the point that I'm now rethinking my well-established workflow.

What do people say on the web?

Image Credit: Reddit

I didn’t have to give up a Perplexity subscription – I never had one. Instead, I’ve been using a combination of cross-prompting Gemini Deep Research and ChatGPT o1 or o3-mini, but DeepResearch might simplify that routine. It shifts the workflow from active searching to supervising an AI-generated research process. It’s a different level of actual help—like having a virtual research assistant to whom you give a prompt, step away while it works, and return to a finished analysis.

— # (#)

My summary is that DeepResearch is amazing as a well-organized starting point. What I also liked is that it feels like a promise of a working agent. DeepResearch will ask clarifying questions if your prompt is ambiguous, then proceed step-by-step. The result is a more robust and context-aware research process compared to a single-turn question-answer system. Or – if you don’t know the answers to its questions—just say, “Do as you see fit, knowing what I’m working on.” It does a pretty good job of figuring things out on its own. Pretty dope.

It also understands time frames: if you need fresh materials from February 3 to February 10, it will search specifically within that range.

Andrew Maynard, a professor and author, wrote that after using DeepResearch, he’s “beginning to wonder when research and scholarship that isn’t augmented by AI will be seen as an anachronism.” “Using it feels like giving a team of some of the best minds around PhD-level questions, and having them come back with PhD-level responses – all within a few hours.” (from Does OpenAI's Deep Research signal the end of human-only scholarship?)

This means DeepResearch can identify cross-domain links or examples that might otherwise be overlooked, offering fresh perspectives. In professional settings, this can support more well-rounded decision-making – for example, a product manager can quickly gather insights from scientific research, market data, and consumer opinions in one place, rather than relying on multiple teams or lengthy research processes. It makes you multifaceted!

— # (#)

Mollick was impressed by the depth, but he and others like economist Kevin Bryan pointed out the current limitations in data access – notably that having access to better search and paywalled content would make such agents far more useful.

How does DeepResearch work?

OpenAI’s breakthrough with DeepResearch lies in its ability to take coherent actions throughout its reasoning process. Unlike traditional AI agents that struggle with long-term focus, this model maintains progress without getting distracted. Unlike Gemini’s approach, which searches for sources first and then compiles a report, OpenAI’s version dynamically searches and takes actions as needed. This makes it more adaptable and efficient. Under the hood, it’s powered by the o3 model with reinforcement learning, allowing it to act within its reasoning process. The results depend on the chosen research target.

Here is a lengthy research report produced by DeepResearch following the prompt: ‘I need technical details about how DeepResearch from OpenAI works. Give me model architecture, system architecture, and deeper insights into proprietary aspects’:

ChatGPT - DeepResearch Technical Details

Shared via ChatGPT

chatgpt.com/share/67aa5af5-6a5c-8010-b36c-57cd01643b87

Not without limitations

Occasional Inaccuracies and Hallucinations – as every LLMs, it can misstate facts, confuse similar terms, or generate incorrect information. ALWAYS verify.
Difficulty Assessing Source Credibility – Doesn’t always distinguish authoritative sources from unreliable ones, sometimes including outdated or low-quality information.
Outdated or Stale Information – May cite old data, especially in fast-changing fields, unless explicitly prompted for the latest updates.
Inconsistent Instruction Adherence – Sometimes includes topics it was told to exclude or doesn’t fully follow user guidance.
Potentially Incomplete in Niche Depth – Might miss important details or references that an expert would consider essential.
Overwhelming Length and Irrelevant Details – Tends to provide exhaustive reports, sometimes including excessive or tangential information.
High Cost and Limited Access – Available only to ChatGPT Pro users at $200/month, making it inaccessible to many casual users.
Opaque “Black Box” Reasoning – Users don’t see how it selects or evaluates sources, making its conclusions harder to fully trust without verification.

But you know, this is the worst you’ve ever experienced it.

Best Practices for Using DeepResearch Efficiently

Craft a Detailed, Focused Prompt – Be clear and specific in your query to avoid irrelevant results. Use ChatGPT to refine your prompt before submitting it.
Provide Context or Examples – Giving background information or specifying the desired answer format helps guide the AI’s research. A lot. Here is an example from Ben Thompson, author of The Stratechery.

Engage with Clarification Questions – Answer any follow-up questions from DeepResearch to fine-tune its direction before it starts searching. Its questions are helpful on their own, prompting you to think things through and clarify what you really want.
Specify Scope and Bias Preferences – Direct the AI on preferred sources, date ranges, or perspectives (e.g., “focus on peer-reviewed studies” or “exclude politically biased sources”).
Verify and Refine the Output – Treat the AI's report as a first draft, fact-check key claims, and run follow-up queries to clarify or correct missing details.
Request Summaries or Actionable Insights – After a long report, ask for a concise summary, key takeaways, or recommendations to make the information more digestible.
Manage Time – Plan around DeepResearch’s processing time (5–30 minutes) and work on other tasks while its “thinking”.
Maintain “job description” for your new employee – Create a list of tasks that DeepResearch can assist with or automate right now. Keep track of how others use it. Try incorporating it into your routine and adjust as needed.

Have you tried it? What are your recommendations? Will it change your work routine?

From Our Partners: Everything you need to know about AI agents

Galileo just dropped a 100-page ebook on AI agents, so you can create powerful, reliable agents like an expert:

Match the right agentic framework for your use case
Evaluate and improve performance
Identify failure points and production issues

→Read the eBook

Curated Collections

🔳 Turing Post is now on 🤗 Hugging Face! You can read the rest of this article there (it’s free!) →

We are reading/watching

Three Observations from Sam Altman. I don’t usually analyze texts in this section. But here are a few highlights you can’t miss from Altman’s text:
- “AGI is a weakly defined term, but generally speaking we mean it to be a system that can tackle increasingly complex problems, at human level, in many fields.”
- “AGI is just another tool in this ever-taller scaffolding of human progress we are building together.”
- Observations: 1. “The intelligence of an AI model roughly equals the log of the resources used to train and run it.” 2. “The cost to use a given level of AI falls about 10x every 12 months, and lower prices lead to much more use.” 3. “The socioeconomic value of linearly increasing intelligence is super-exponential in nature.”
- “We are now starting to roll out AI agents, which will eventually feel like virtual co-workers.”
- “Many of us expect to need to give people more control over the technology than we have historically, including open-sourcing more, and accept that there is a balance between safety and individual empowerment that will require trade-offs.”
The End of Programming as We Know It by Tom O’Reilly
Must-see course about LLMs by Andrej Karpathy (+3 hours long video)

Top models to pay attention to

SmolLM2: When Smol Goes Big optimizes a 1.7B parameter model with 11T tokens and specialized datasets (FineMath, Stack-Edu, SmolTalk). Outperforms larger models in benchmarks, showing the power of data-centric training and efficient instruction tuning.
Sundial: A Family of Highly Capable Time Series Foundation Models introduces a time-series foundation model using flow-matching loss and large-scale tokenization. Achieves state-of-the-art zero-shot forecasting and 11.34× faster inference, addressing mode collapse issues.
Satori: RL with Chain-of-Action-Thought enhances LLM reasoning with reinforcement learning and autoregressive search. Achieves 93.2% on GSM8K math benchmarks, demonstrating improved self-reflection and generalization to out-of-domain tasks.
Ola: Omni-Modal Language Model an open-source LLM that progressively learns to integrate text, vision, audio, and video.
Llasa: LLaMA-based Speech Synthesis at Scale develops a Transformer-based text-to-speech model inspired by LLM scaling principles.

The freshest research papers, categorized for your convenience

There were quite a few super interesting research papers this week, we mark the ones we recommend the most with 🌟 in each section.

LLM Techniques and Optimizations

Activation-Informed Merging of LLMs proposes Activation-Informed Merging (AIM) to merge fine-tuned LLMs by preserving key activation-space weights, boosting performance without retraining →read the paper
Content-Format Integrated Prompt Optimization (CFPO) introduces a prompt design method optimizing both content and format to enhance LLM responses →read the paper

Reasoning and Multi-Step Problem Solving

🌟AlphaGeometry2 (Olympiad Geometry Solver) enhances AlphaGeometry to solve IMO-level geometry problems with a broader formal language →read the paper
BOLT: Bootstrapping Long Chain-of-Thought presents a method to train LLMs for long reasoning chains without relying on distillation from larger models →read the paper
Token Assorted (Mixing Latent & Text Tokens) proposes compressing early reasoning steps into latent tokens to shorten chain-of-thought sequences →read the paper
ScoreFlow (Optimizing LLM Agent Workflows) develops a Score-based Direct Preference Optimization (Score-DPO) method for optimizing multi-agent workflows →read the paper
🌟ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning evaluates LLMs on logic grid puzzles, revealing how complexity diminishes accuracy despite enhanced inference strategies →read the paper
The Jumping Reasoning Curve? assesses GPT-[N] and o-[N] models on multimodal puzzles, highlighting major reasoning performance gaps and inference costs →read the paper
Demystifying Long Chain-of-Thought Reasoning in LLMs analyzes how supervised fine-tuning and reinforcement learning impact reasoning performance →read the paper
🌟Limo: Less is More for Reasoning demonstrates that curated reasoning samples outperform massive datasets in LLM training →read the paper

Model Efficiency and Scaling

MAGA: Massive Genre-Audience Data Augmentation reformulates existing text into different genres and audiences to generate synthetic pretraining data →read the paper
ParetoQ: Low-Bit Quantization Scaling Laws studies extreme LLM quantization and identifies critical phase changes in low-bit regimes →read the paper

Alignment and Safety Improvements

Feature Flow for Steerable LLMs introduces a causal cross-layer interpretability framework to track semantic features →read the paper
🌟“Great Models Think Alike” (AI Oversight Risks) examines the risks of using one LLM to oversee another, revealing bias towards models that “think alike” →read the paper
PILAF (Optimal Preference Sampling for RLHF) introduces a preference-sampling strategy that focuses ranking queries on maximizing the true reward →read the paper
DuoGuard: Multilingual LLM Guardrails via Two-Player RL uses adversarial training between a generator and a guard model to improve safety across languages →read the paper

Domain-Specific Applications of LLMs

Clinical Reasoning Limitations (M-ARC Benchmark) tests LLMs on medical reasoning tasks, exposing their limitations in open-ended clinical decision-making →read the paper
LLMs for Legal Analysis (IRAC in Law) evaluates how LLMs handle legal reasoning using the IRAC framework, revealing issues like hedging and hallucinations →read the paper
HackerRank-ASTRA (Code Generation Evaluation) benchmarks LLMs on complex multi-file coding tasks, assessing consistency and robustness →read the paper

Open-Source vs. Proprietary LLM Innovations

The Open-Source Advantage in LLMs argues that open-source models, despite trailing in raw performance, offer the best path for research and ethical AI →read the paper
UltraIF: Closing the Instruction-Following Gap develops a method to train open-source models to match proprietary instruction-following abilities using sub-query decomposition →read the paper

That’s all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve