Turing Post
Posts
Topic 19: Inside LLaVA-o1

Topic 19: Inside LLaVA-o1

Let's explore a smarter Vision-Language Model (VLM) that thinks step-by-step

Alyona Vert.
November 27, 2024

🍁 First of all: Happy Thanksgiving to everyone who celebrates! We’re truly grateful to have such incredible readers like you. Knowing that our work might be helpful to you is already the greatest reward. Thank you for being part of the Turing Post community! 🦃

Now let’s dissect one of the most promising Vision-Language Models (VLMs)!

VLMs face unique challenges compared to Large Language Models (LLMs), which are more extensively developed and reliable. While LLMs often excel in reasoning tasks, VLMs tend to struggle with systematic reasoning, frequently bypassing structured thought processes and arriving at final answers prematurely –leading to errors. To address these gaps, researchers from leading Chinese universities and prominent research labs collaborated to develop LLaVA-o1, a VLM designed to reason step-by-step and maintain a well-organized thought process. The model also employs a specific inference-time scaling technique to ensure more reliable, verified answers.

Social networks and media flashed titles: “LLaVA-o1 to challenge OpenAI's o1 model.” Is that so? And what sets LLaVA-o1 apart? Let’s find out!

In today’s episode, we will cover:

VLMs limitations
Here comes LLaVA-o1
How does LLaVA-o1 work?
The secret sauce: Stage-level beam search
How well does LLaVA-o1 perform?
How does LLaVA-o1 stand out?
Not without limitations
Conclusion
Bonus: Resources

VLMs limitations

Vision-Language Models (VLMs) are gaining popularity as they can understand the world from both sides – words and images. But they often struggle with challenging problems, like answering complex questions about images. The problem is that most VLMs just give quick answers without working through the problem in a structured way. And it’s not what we all need. And even Chain-of-Thought (CoT) reasoning isn’t a perfect solution, because VLMs still make mistakes, jump to conclusions, or produce nonsense, and their reasoning process remains poorly organized at its core.

Compared to VLM, LLMs are more reliable for tasks that need logical thinking. For example, OpenAI's o1 is great at reasoning thanks to inference-time scaling, which includes: breaking down complex problems step-by-step, using multiple attempts (answers) to solve a problem and iterative reasoning.

So how can we enhance VLMs reasoning? Can we use these o1 tips to achieve better performance of VLMs?

Here comes LLaVA-o1

A group of Chinese researchers from Peking University, Tsinghua University, Peng Cheng Laboratory, Peking University Shenzhen Graduate School, Alibaba DAMO Academy, and Lehigh University, motivated by the idea of making VLMs more efficient in complex reasoning, introduced LLaVA-o1.

It’s a smart VLM that can reason step-by-step in a clear, structured way. Instead of jumping to answers, it divides reasoning into four stages and uses stage-level beam search, a specialized inference-time scaling method, to create multiple answers for each stage. This brings LLaVA-o1 closer to OpenAI’s o1 reasoning process.

Let's take a closer look at this model in more detail.

How does LLaVA-o1 work?

The rest of this article, with detailed explanation and relevant resources, is available to our Premium users only →

Reply

or to participate.