10+ Research papers to learn more about Vision Language Models (VLMs)

Vision Language Models (VLMs) represent a significant advancement in AI, bridging the gap between visual perception and natural language understanding. These models are designed to understand and generate language in the context of visual inputs, such as images or videos.

VLMs are used in various applications, including image captioning, visual question answering, and image generation from text descriptions. They are also used in tasks such as object detection and scene understanding, where both visual and textual context are important.

Here is a list of research papers for better understanding of how VLMs work:

“An Introduction to Vision-Language Modeling” by Meta covers their definition, functions, and training methods, as well as approaches for their evaluation. It tells about existing families of VLMs, such as CLIP, FLAVA, MaskVLM, Generative-based VLMs and more. Additionally, it explores the extension of VLMs to video content. → Read more
“An image is worth 16x16 words: Transformers for image recognition at scale” research demonstrates that a pure transformer, applied to image patches, excels at image classification. It shows how The Vision Transformer (ViT), when trained on large datasets and then tested on benchmarks like ImageNet, CIFAR-100, and VTAB, performs better than the best CNNs and uses fewer computational resources. → Read more
“Learning Transferable Visual Models From Natural Language Supervision” explains how learning from raw text about images offers a broader supervision alternative. This research shows that training a model to match captions with images is effective and can be done on a large scale, using 400 million image and text pairs from the internet. → Read more
This paper also introduces OpenAI’s CLIP approach, so it is useful to discover this post as well: → CLIP: Connecting text and images
“ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision” introduces the Vision-and-Language Transformer (ViLT), a minimal Vision-and-Language Pre-training (VLP) model that simplifies visual input processing to a convolution-free approach, similar to textual input processing. → Read more
“Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision” explains the ALIGN method that uses a noisy dataset of over one billion image-alt text pairs. Its simple dual-encoder architecture aligns visual and language data with a contrastive loss. → Read more
“BEIT: BERT Pre-Training of Image Transformers” introduces BEIT, a self-supervised vision representation model that uses a masked image modeling task to pretrain vision Transformers. → Read more
“Flamingo: a Visual Language Model for Few-Shot Learning”. Flamingo's architectural innovations include bridging pretrained vision-only and language-only models, managing mixed visual and text data, and easily processing images or videos. → Read more
“BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation” proposes a new VLP framework that effectively handles both vision-language understanding and generation tasks. BLIP uses a captioner to generate synthetic captions and a filter to remove noisy ones, effectively utilizing noisy web data. → Read more
“Language Is Not All You Need: Aligning Perception with Language Models”. In this research paper Microsoft introduces trained on web-scale multimodal data, KOSMOS-1 MLLM. Explore how it excels in zero-shot and few-shot learning, handling tasks like language understanding, generation, OCR-free NLP, multimodal dialogue, image captioning, visual question answering, and image recognition with text instructions. → Read more
“DeepSeek-VL: Towards Real-World Vision-Language Understanding”. DeepSeek-VL integrates extensive pretraining, curated data, and high-resolution processing capabilities to achieve high performance across various applications. Researchers show how this approach addresses common limitations in other multimodal models. → Read more
“Chameleon: Mixed-Modal Early-Fusion Foundation Models” presents a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any sequence. Chameleon introduces a stable training approach, an alignment recipe, and architectural parameterization tailored for mixed-modal settings. → Read more

10+ Research papers to learn more about Vision Language Models (VLMs)

Here is a list of research papers for better understanding of how VLMs work:

Reply

Keep Reading

Turing Post