Turing Post
Posts
AI 101: What's New in Test-Time Scaling?

AI 101: What's New in Test-Time Scaling?

The test-time scaling journey into world models, agents, diffusion with a reminder to stay cautious and keep it controllable

Alyona Vert.
August 13, 2025

At the beginning of the year, we dove into one of the main concepts that became central to taking models to the next level of reasoning. Yes, it’s test-time compute and its scaling. That article was one of the most popular, showing huge interest in what’s happening at the inference stage of AI models.

Test-time compute is the key to influencing a model's behavior after the training stage – during inference, when it shows all its gained capabilities. But scaling test-time compute can unlock even more from models. This strategy also came with the rise of Language Reasoning Models (LRMs), which are the spotlight together with agents.

A lot happened since the beginning of the year in test-time compute research – from new strategies for smarter scaling to studies revealing its limits and occasional drawbacks. It’s still a bit of a messy topic, but its role is hard to overstate. Test-time compute sits at the heart of shaping a model’s real-world performance, and the discussions around it echo what’s happening in many parts of AI – from multimodal systems and world models to agents and reasoning models’ behavior.

Today we’re going to look at three of the latest outstanding approaches to test-time compute scaling:

Chain-of-Layers (CoLa) – a method that allows for better control and optimization of reasoning models.
Blending Vision-Language Models (VLMs) and world models with the MindJourney test-time scaling framework.
Applying the diffusion process for test-time scaling to build a better deep research agent – Google Cloud’s TTD-DR.

We will also look at the opposite of the test-time scaling strategy to see what can go wrong and be better aware of the bad effects.

So, let’s investigate the true side of test-time compute scaling – now more proven over time.

In today’s episode, we will cover:

A quick reminder about test-time compute and its scaling
Controllable test-time compute
- Chain-of-Layers (CoLa) method
- How CoLa actually influences the performance
World models, VLMs and test-time scaling
- How does MindJourney work?
- Results of MindJourney
- Strengths and limitations
Google Cloud’s Test-Time Diffusion Deep Researcher (TTD-DR)
- TTD-DR workflow
- Why TTD-DR is better than other Researchers
- Benefits and issues
Where test-time scaling falls short: 5 key failures
Conclusion
Sources and further reading

A quick reminder about test-time compute and its scaling

Test-time compute (TTC) is the processing power an AI model uses when generating responses after training. In other words, it’s the compute cost of using the model during the inference stage. This includes all the operations the model performs, such as passing inputs through its layers, reasoning step-by-step, and generating outputs.

Test-time scaling is the strategy of using more compute at inference time. This can mean increasing the number of reasoning steps by implementing Chain-of-Thought (CoT) reasoning, using larger context windows, running multiple model passes and aggregating results, etc. This approach prioritizes slower reflective “thinking” over fast responses.

The main idea behind this is that performance can improve without retraining, because the model “works harder” on each task. Here is what the most up-to-date approaches can offer to apply this scaling strategy more smartly and effectively to various models and agents.

Let’s start with how we can actually control the reasoning length in models.

Controllable test-time compute

Test-time scaling usually means increasing compute, like deeper models, more layers, multiple forward passes at inference time to improve performance. But as this paradigm was used more, many started to follow the principle: Why always use more compute budget if some tasks are better performed without explicit “thinking”? Many advanced reasoning and agentic models – Claude 4, Qwen3, GLM-4.5, already include the feature of switching between fast responses and extended step-by-step thinking.

Following this shift, a recent paper from the Department of Computer Science, University of Maryland proposes its own point of view on dynamic and adaptive “thinking”. Instead of using the full model in a fixed way for every input, it explores how to adjust the model's depth and layer composition per input, at test time.

The approach selectively uses more compute (for example, repeating layers for harder problems) or less compute, like skipping layers for easier ones, which they call test-time depth adaptation. To make this real they propose a Chain-of-Layer (CoLa) method. It’s like a challenge to naive test-time scaling, because it shows you can often do better with less, if you're smart about layer use. So let’s dive into how it works.

Chain-of-Layers (CoLa) method

The main idea behind the research from University of Maryland is to make a pretrained model smarter at test time by customizing which layers it uses without any retraining. So the researchers treat the layers of the model like building blocks that can be rearranged. For each test example, they build a custom version of the model using those blocks. This flexible way of building a model for each input is called CoLa (Chain-of-Layers). It allows to:

Join Premium members from top companies like Microsoft, Google, Hugging Face, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Simplify your learning journey 👆🏼

Reply

or to participate.