This website uses cookies

Read our Privacy policy and Terms of use for more information.

Step-by-step thinking is one of the most effective ways to enhance Large Language Models (LLMs) performance accuracy. It also makes the reasoning process more transparent. In our previous articles, we discussed Chain-of-Thought and Chain(s)-of-Knowledge techniques that focus on text reasoning. However, using words and symbols isn’t enough for the entire range of tasks. Some problems require visual thinking, similar to how humans imagine pictures while reasoning. To make this visual thinking in models a reality, the Whiteboard-of-Thought prompting technique was proposed. It allows models to create simple drawings, enabling them to handle problems that require visual thinking better than using words alone! Today, we will dive into this fascinating idea and explore its capabilities.

In today’s episode, we will cover:

  • Limitations of LLMs in visual reasoning

  • Here comes Whiteboard-of-Thought (WoT)

  • How does WoT work?

  • Is WoT prompting really good? 

  • Advantages of WoT

  • Not without limitations

  • Conclusion

  • Bonus: Resources

Limitations of LLMs in visual reasoning

LLMs are very powerful for tasks involving logic, like math or symbolic reasoning, by writing out their steps in text, which is known as the Chain-of-Thought technique. However, they struggle with solving problems that require visual thinking, even though they've been trained on many different types of data, including images.

Even top AI models like GPT-4 might make mistakes in their answers as they don't reason about visual details. For example, when asked about “a lowercase letter that looks like a circle with a line coming down on the right side,” they might answer “b” instead of “q.”

Let’s look at how we, humans, solve these visual tasks. We naturally switch to imagining pictures in our minds or drawing things out to help us understand the problem. So why don’t we apply this way of thinking to AI models?

Here comes Whiteboard-of-Thought (WoT)

“Our key idea is that visual reasoning tasks demand visuals.”

To solve this problem, researchers from Columbia University decided to utilize multimodal large language models (MLLMs) that can process both images and text. They came up with a solution called "Whiteboard-of-Thought" (WoT) prompting. This approach allows the model to "draw" its thinking process and reasoning steps as images on a metaphorical "whiteboard," similar to how a person might use a real whiteboard. The model then uses these images to continue reasoning and solve the problem.

Image Credit: Project page

What’s more, this method doesn’t require any special training or examples. It uses tools the models already know, like creating images through coding libraries such as Matplotlib and Turtle. So, let’s look at the realization of this method.

How does WoT work?

In simple terms, the process works like this: the model writes code to create a drawing based on a question, then the drawing is processed, and the model uses it to find the answer.

Here is how it works step by step:

  1. Creating visuals with MLLMs:
    Normally, MLLMs don’t know how to create images like visuals, but we can teach them by having them write simple Python code using libraries like Matplotlib or Turtle. These tools help create minimal, abstract visuals, like simple shapes or symbols, that the model can use to think through the problem.

  2. The process:
    When given a question, the model is told to create a visualization using code before trying to answer. For example, the model gets a prompt like:

    “You write code to create visualizations using the {Matplotlib/Turtle} library in Python, which the user will run and provide as images. Do NOT produce a final answer to the query until considering the visualization.”

    The model then writes code based on the question.

  3. Making the drawing:
    The code the model writes is run to create an image, and that image is given back to the model to help it continue thinking or to produce the final answer.

There is another potential option – use text-to-image models to create diverse visuals, but they struggle with the precise drawings needed for tasks like visual reasoning. However, they have the potential for improvement and could be integrated into the WoT approach.

Now, it’s interesting to analyze how giving models the ability to "draw" their thinking affects their performance.

Is WoT prompting really good? 

Let’s look at the real achievements of WoT approach across different tasks. Researchers conducted all experiments in the zero-shot setting and compared the results to two baselines: direct prompting and zero-shot chain-of-thought (CoT). 

  • ASCII understanding tasks:

    • These tasks require recognizing shapes made from text characters. WoT outperformed other models by allowing AI to create simple visualizations. For tasks like recognizing ASCII digits, words, and kanji characters, WoT achieved up to 73.8% accuracy, while traditional CoT methods were much lower, with accuracy around 1.1%-27.2% for the same tasks. 

Image Credit: Project page

Image Credit: Project page

The researchers also analyzed errors, particularly in MNIST task (collection of handwritten digits) and found that most errors in the MNIST task stem from the model's visual perception rather than image creation, achieving 80.8% accuracy on actual images.

  • Spatial navigation:

    • In spatial tasks like map navigation, the WoT approach improved performance on non-grid structures (hexagons, circles) to 61% accuracy, compared to 8% with CoT. However, in simpler grid-based structures, traditional methods still performed better.

Image Credit: Project page

  • General performance:

    • In specific settings, for example, calligrams and video game art, where CoT failed (with 0% accuracy in some cases), WoT could achieve up to 92% accuracy:

      • WoT correctly interpreted the poem by creating a visual representation first.

      • In video game art, text descriptions alone make it difficult for AI models to understand player’s visual creations, but WoT solves this by generating the final visual result for easier evaluation.

Advantages of WoT

Here is a breif summary of WoT’s key benefits, that make it a useful and even revolutional technique:

  • Unlocks visual reasoning: WoT enables AI models to solve visual tasks by creating and reasoning with images. This is especially helpful for tasks like understanding shapes, diagrams, and spatial navigation.

  • Simple and flexible: The method uses existing tools, like Python's Matplotlib or Turtle libraries, to create visuals without needing specialized training or additional modules. This makes it easy to implement.

  • Improves performance on complex tasks: WoT enhances the model's performance on tasks requiring visual or spatial reasoning, where text alone isn't enough, achieving better results than traditional LLMs.

  • Error analysis and adjustments: WoT allows easier identification of errors, such as issues in visualizing or interpreting an image. This makes it easier to improve the model's performance over time.

  • Scalability: As models and computer vision systems improve, WoT's effectiveness will likely grow, making it a powerful tool for future AI developments.

Not without limitations

Despite having many advantages WoT approach has several limitations, that mostly come from current limitations of models:

  • Reliance on accurate vision systems: WoT relies on the model's visual interpretation accuracy. Many errors arise from visual perception issues, and despite improvements in computer vision, limitations still remain.

  • Struggles with complex visuals: WoT handles simple visuals well but has trouble with complex figures like geometric shapes or intricate diagrams, as the models aren't advanced enough to fully understand detailed visuals.

  • Code execution issues: Sometimes, errors occur during the process of generating visuals, such as coding mistakes or failures in producing the correct image. This limits the effectiveness of WoT in certain cases.

  • Limited to current model capabilities: WoT relies on the existing abilities of MLLMs to write code and interpret images. If these underlying abilities are not well-developed, WoT's performance will also be limited.

Conclusion

Whiteboard-of-Thought represents a shift towards more diverse prompting of MLLMs. As computer vision and MLLMs continue to improve rapidly, WoT, being dependent on their capabilities, has not yet shown its full potential. Overall, WoT demonstrates that visual thinking is possible and helps solve a wider range of tasks more accurately. It's amazing how WoT gives AI models the ability to 'think' visually by drawing and solve problems when words are not enough. It will be very interesting to see how this method performs on ARC.

Bonus: Resources

Thank you for reading 🩶

Reply

Avatar

or to participate

Keep Reading