This website uses cookies

Read our Privacy policy and Terms of use for more information.

Let’s dissect one of the most promising Vision-Language Models (VLMs)!

VLMs face unique challenges compared to Large Language Models (LLMs), which are more extensively developed and reliable. While LLMs often excel in reasoning tasks, VLMs tend to struggle with systematic reasoning, frequently bypassing structured thought processes and arriving at final answers prematurely –leading to errors. To address these gaps, researchers from leading Chinese universities and prominent research labs collaborated to develop LLaVA-o1, a VLM designed to reason step-by-step and maintain a well-organized thought process. The model also employs a specific inference-time scaling technique to ensure more reliable, verified answers.

Social networks and media flashed titles: “LLaVA-o1 to challenge OpenAI's o1 model.” Is that so? And what sets LLaVA-o1 apart? Let’s find out!

In today’s episode, we will cover:

  • VLMs limitations

  • Here comes LLaVA-o1

  • How does LLaVA-o1 work?

  • The secret sauce: Stage-level beam search

  • How well does LLaVA-o1 perform?

  • How does LLaVA-o1 stand out?

  • Not without limitations

  • Conclusion

  • Bonus: Resources

Why VLMs Struggle with Systematic Reasoning

Vision-Language Models (VLMs) are gaining popularity as they can understand the world from both sides – words and images. But they often struggle with challenging problems, like answering complex questions about images. The problem is that most VLMs just give quick answers without working through the problem in a structured way. And it’s not what we all need. And even Chain-of-Thought (CoT) reasoning isn’t a perfect solution, because VLMs still make mistakes, jump to conclusions, or produce nonsense, and their reasoning process remains poorly organized at its core.

Compared to VLM, LLMs are more reliable for tasks that need logical thinking. For example, OpenAI's o1 is great at reasoning thanks to inference-time scaling, which includes: breaking down complex problems step-by-step, using multiple attempts (answers) to solve a problem and iterative reasoning.

So how can we enhance VLMs reasoning? Can we use these o1 tips to achieve better performance of VLMs?

What Is LLaVA-o1? A Step-by-Step Vision-Language Model

A group of Chinese researchers from Peking University, Tsinghua University, Peng Cheng Laboratory, Peking University Shenzhen Graduate School, Alibaba DAMO Academy, and Lehigh University, motivated by the idea of making VLMs more efficient in complex reasoning, introduced LLaVA-o1.

It’s a smart VLM that can reason step-by-step in a clear, structured way. Instead of jumping to answers, it divides reasoning into four stages and uses stage-level beam search, a specialized inference-time scaling method, to create multiple answers for each stage. This brings LLaVA-o1 closer to OpenAI’s o1 reasoning process.

Let's take a closer look at this model in more detail.

How LLaVA-o1 Works: 4 Structured Reasoning Stages

To make reasoning more structured and systematic, LLaVA-o1 breaks the process into four stages:

  • Summary: The model gives a brief overview of the question or task, focusing on the main problem.

  • Caption: If there’s an image, it describes the important parts related to the question.

  • Reasoning: It carefully thinks through the question and comes up with a preliminary answer.

  • Conclusion: The model provides a final answer based on the reasoning.

Image Credit: Original paper

Each stage is clearly marked with special tags like <SUMMARY>...</SUMMARY>, so the model stays focused and avoids going off track. The stages happen automatically, and the user doesn’t need to guide the whole process. The first three stages are hidden from the user, while the conclusion is what the user sees. Whether you need a short answer or a detailed explanation, LLaVA-o1 will provide it.

This stage of processing organizes the reasoning of problems, but it alone isn’t enough to elevate a VLM to a new level of reasoning with high accuracy. This is where another feature of LLaVA-o1 comes into play.

Stage-Level Beam Search: LLaVA-o1's Inference-Time Scaling

Inference-time scaling is about making the model smarter when it’s actually solving tasks in real-time. LLaVA-o1 uses a new method called stage-level beam search to improve its reasoning process. At every reasoning stage, it generates, compares and refines multiple possible answers before moving forward to the next stage.

Image Credit: Original paper

If we break this process down, it will look like this:

  1. Generating options:

    • For the first reasoning stage (for example, summarizing), the model creates multiple possible answers.

  2. Picking the best option:

    • The model compares two randomly chosen answers, decides which one is better, and keeps the best one.

    • This step repeats until only the best response from all the options is left.

  3. Moving to the next stage:

    • The process is repeated for the next reasoning stage, ensuring that only the best responses are carried forward.

  4. Finish all stages:

    • By the time the model reaches the final conclusion stage, it has refined its reasoning through every step.

The stage-level beam search approach ensures the final answer is the best one and is based on solid logic and accurate steps.

LLaVA-o1 Benchmarks: vs GPT-4o-mini, Gemini-1.5-pro, and Open-Source Models

Let's examine LLaVA-o1's performance results and the factors they depend on.

When researchers tested LLaVA-o1 on the original question-answer pairs, its performance didn’t show good results. Since existing datasets don’t explain reasoning well, researchers created a special LLaVA-o1-100k dataset to teach LLaVA-o1 reason logically. This dataset includes about 100,000 question-answer pairs generated by GPT-4, showcasing step-by-step reasoning examples. This dataset has made a significant impact.

Even though LLaVA-o1 was trained on only 100k examples, it showed clear improvements. On average, it scored 6.9% higher than the base model. This means it’s much better at general question answering tasks and avoiding hallucinations. The model improved the most on tasks that need reasoning, such as:

  • Instance reasoning (figuring out specific details from the input)

  • Logical reasoning (solving problems step-by-step)

  • Math and science

LLaVA-o1’s performance on six benchmarks

LLaVA-o1’s performance in six skills

The structured tags, such as <SUMMARY> and <REASONING>, are essential because they help the model organize its thoughts. When these tags were removed, performance dropped significantly.

When researchers compared inference approaches they found that LLaVA-o1 with stage-level beam search significantly outperformed other inference scaling methods:

  • Best-of-N: Slight improvement (+0.6%).

  • Sentence-level beam search: Performance dropped by 1.9% because breaking down tasks too much, sentence by sentence, doesn't work well for open-ended reasoning.

  • Stage-level beam search: Performance improved by 2.6%, showing it’s the most effective method. As the number of candidate responses increased, performance of LLaVA-o1 improved, confirming that stage-level beam search scales well.

And finally, LLaVA-o1 was tested against both open-source and closed-source models, including some much larger ones, and the experiments showed that:

  • LLaVA-o1 outperformed baseline models like Llama-3.2-11B-Vision-Instruct.

  • It showed better results than several open-source models, such as InternVL2-8B and VILA-1.5-40B.

  • It even beat certain advanced closed-source models, like GPT-4o-mini and Gemini-1.5-pro.

Image Credit: Original paper

Image Credit: Original paper

The advantages of the LLaVA-o1 performance have been mentioned earlier and in this section, so let's summarize them in one place to make it easier to digest.

How does LLaVA-o1 stand out?

Here are the advantages of the LLaVa-o1 model compared to other VLMs:

  • Better thinking process and problem solving: LLaVa-o1 doesn’t guess the answer – it works through the problem logically and thoroughly, organizing its reasoning into clear steps.

  • Adaptable answers: The model adjusts the level of detail in its answers based on what the user needs.

  • Smart scaling: It works well even on difficult tasks by leveraging efficient stage-level beam search technique for generating multiple options and choosing the best one. It also achieves higher accuracy as the number of candidate responses in beam search increases.

  • Accurate answers: It avoids mistakes by keeping only the best reasoning paths.

  • Performance: LLaVA-o1 performs well across tasks requiring general question answering, mathematical reasoning, and hallucination control. It also outperforms larger and even closed-source models in advanced reasoning tasks.

But what about the disadvantages of LLaVA-o1?

LLaVA-o1 Limitations: What It Can't Do Yet

Several potential limitations can become a problem for efficient implementation of LLaVA-o1. They include:

  • Limited dataset size: LLaVA-o1-100k dataset is relatively small compared to datasets used by other state-of-the-art models, which might limit the model's generalization to diverse and unseen tasks.

  • Resource for scaling: Generating and evaluating multiple candidate answers for improved performance require significant computational resources.

  • Dependency on structured tags: Performance drops significantly when structured tags are removed, limiting LLaVA-o1’s flexibility in unstructured environments.

  • Focus on reasoning-heavy tasks: While LLaVA-o1 excels in reasoning-heavy tasks, it shows smaller improvements in areas like coarse perception and fine-grained perception, where less systematic reasoning is required.

LLaVA-o1 vs OpenAI o1: Are They Rivals?

LLaVA-o1's ability to think through problems in a structured way makes it more accurate and reliable than older models. It sets a new benchmark for models that combine language and vision, especially for tasks requiring deep reasoning. It also proposes an alternative approach to step-by-step reasoning (thinking through stages) and an efficient, scalable stage-level beam search method for refining the best answers. The future will show how these approaches can be expanded and what else can be added to VLMs to enhance their complex multimodal reasoning.

As for the question about rivaling OpenAI o1. Here is the answer:

While OpenAI’s o1 dominates textual reasoning with advanced chain-of-thought techniques, LLaVA-o1 excels in visual reasoning, introducing structured, step-by-step processes and stage-level beam search.

Though LLaVA-o1 outperforms larger models in reasoning-heavy tasks, its smaller dataset and reliance on structured tags limit generalizability. It’s not a direct rival to OpenAI’s o1 but a significant step forward for Vision-Language Models, highlighting their complementary roles in advancing AI reasoning.

Bonus: Resources

Reply

Avatar

or to participate

Keep Reading