Vision Language Models (VLMs) represent a significant advancement in AI, bridging the gap between visual perception and natural language understanding. These models are designed to understand and generate language in the context of visual inputs, such as images or videos.Β
VLMs are used in various applications, including image captioning, visual question answering, and image generation from text descriptions. They are also used in tasks such as object detection and scene understanding, where both visual and textual context are important.
Here is a list of research papers for better understanding of how VLMs work:
βAn Introduction to Vision-Language Modelingβ by Meta covers their definition, functions, and training methods, as well as approaches for their evaluation. It tells about existing families of VLMs, such as CLIP, FLAVA, MaskVLM, Generative-based VLMs and more. Additionally, it explores the extension of VLMs to video content. β Read more
βAn image is worth 16x16 words: Transformers for image recognition at scaleβ research demonstrates that a pure transformer, applied to image patches, excels at image classification. It shows how The Vision Transformer (ViT), when trained on large datasets and then tested on benchmarks like ImageNet, CIFAR-100, and VTAB, performs better than the best CNNs and uses fewer computational resources. β Read more
βLearning Transferable Visual Models From Natural Language Supervisionβ explains how learning from raw text about images offers a broader supervision alternative. This research shows that training a model to match captions with images is effective and can be done on a large scale, using 400 million image and text pairs from the internet. β Read more
This paper also introduces OpenAIβs CLIP approach, so it is useful to discover this post as well:Β β CLIP: Connecting text and images
βViLT: Vision-and-Language Transformer Without Convolution or Region Supervisionβ introduces the Vision-and-Language Transformer (ViLT), a minimal Vision-and-Language Pre-training (VLP) model that simplifies visual input processing to a convolution-free approach, similar to textual input processing. β Read more
βScaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervisionβ explains the ALIGN method that uses a noisy dataset of over one billion image-alt text pairs. Its simple dual-encoder architecture aligns visual and language data with a contrastive loss. β Read more
βBEIT: BERT Pre-Training of Image Transformersβ introduces BEIT, a self-supervised vision representation model that uses a masked image modeling task to pretrain vision Transformers. β Read more
βFlamingo: a Visual Language Model for Few-Shot Learningβ. Flamingo's architectural innovations include bridging pretrained vision-only and language-only models, managing mixed visual and text data, and easily processing images or videos. β Read more
βBLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generationβ proposes a new VLP framework that effectively handles both vision-language understanding and generation tasks. BLIP uses a captioner to generate synthetic captions and a filter to remove noisy ones, effectively utilizing noisy web data. β Read more
βLanguage Is Not All You Need: Aligning Perception with Language Modelsβ. In this research paper Microsoft introduces trained on web-scale multimodal data, KOSMOS-1 MLLM. Explore how it excels in zero-shot and few-shot learning, handling tasks like language understanding, generation, OCR-free NLP, multimodal dialogue, image captioning, visual question answering, and image recognition with text instructions. β Read more
βDeepSeek-VL: Towards Real-World Vision-Language Understandingβ. DeepSeek-VL integrates extensive pretraining, curated data, and high-resolution processing capabilities to achieve high performance across various applications. Researchers show how this approach addresses common limitations in other multimodal models. β Read more
βChameleon: Mixed-Modal Early-Fusion Foundation Modelsβ presents a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any sequence. Chameleon introduces a stable training approach, an alignment recipe, and architectural parameterization tailored for mixed-modal settings. β Read more
