Are multimodal systems the future of AI?

Next Week in Turing Post:

Wednesday, AI 101: What is KAN?
Friday: A new investigation into the world if AI Unicorns – xAI.

If you like Turing Post, consider becoming a paid subscriber. You’ll immediately get full access to all our articles, investigations, and tech series →

Upgrade to paid

The short answer is yes. Our senses are the foundation of our knowledge. We learn by interacting with the world through sight, sound, touch, taste, and smell. This sensory input not only allows us to navigate our environment and survive, but it also plays a crucial role in our intellectual and cognitive development.

Consider the Cambrian explosion, a period of rapid evolutionary diversification. As was suggested by the zoologist Andrew Parker, the emergence of vision in early animals during this time is believed to have been a major catalyst for their development. The ability to see opened up a whole new world of information, leading to advancements in hunting, predator avoidance, and overall survival strategies. Similarly, our senses act as a gateway to learning, enabling us to gather information, make connections, and build upon our existing knowledge. This interplay between sensory experience and learning is not limited to the biological world – the importance of incorporating multi-sensory input into language models is becoming increasingly recognized in AI.

Enter "Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs". This paper highlights the importance of vision-centric approaches in developing Multimodal Large Language Models (MLLMs). While language models have shown strong scaling behaviors, the design choices for vision components are often insufficiently explored. Cambrian-1 addresses this gap by systematically evaluating various visual representations within MLLMs, ranging from language-supervised models like CLIP to self-supervised learning approaches like DINO.

A key innovation introduced in Cambrian-1 is the Spatial Vision Aggregator (SVA) (Check our explanation What is Spatial Intelligence), a dynamic and spatially-aware connector that integrates high-resolution vision features with language models while reducing the number of tokens. This allows for more efficient processing of visual information and better performance on tasks requiring detailed image understanding.

Applications of MLLMs extend far beyond simple image captioning or visual question answering. These systems can now engage in complex reasoning about visual scenes, interpret charts and diagrams, and even solve mathematical problems presented visually. In healthcare, for instance, an MLLM could analyze both medical images and patient records, providing more comprehensive and accurate diagnostic insights.

The future of AI lies in creating systems that understand our world in all its complexity. By bridging the gap between language and vision, MLLMs are bringing us closer to AI systems that can truly see, understand, and communicate about the world as humans do.

From the “Cambrian-1” paper: “We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.

Website https://cambrian-mllm.github.io
Code https://github.com/cambrian-mllm/cambrian
Models https://huggingface.co/nyu-visionx/
Data https://huggingface.co/datasets/nyu-visionx/Cambrian-10M
CV-Bench https://huggingface.co/datasets/nyu-visionx/CV-Bench
Evaluation https://github.com/cambrian-mllm/cambrian”

🌝 Recommended – Finally: Instant, accurate, low-cost GenAI evaluations

Why are Fortune 500 companies everywhere switching to Galileo Luna for enterprise GenAI evaluations?

97% cheaper, 11x faster, and 18% more accurate than GPT-3.5
No ground truth data set needed
Customizable for your specific evaluation requirements

SEE LUNA IN ACTION

Twitter Shoutout (to us;)

— # (#)

Recommend Turing Post to your friends

News from The Usual Suspects ©

Microsoft and Skeleton Key
- The company has identified a new AI jailbreak technique, Skeleton Key, which uses a multi-turn strategy to bypass AI model guardrails, allowing harmful content production. This attack affects various generative AI models. Microsoft has implemented mitigations, such as Prompt Shields and Azure AI Content Safety, to detect and block such attacks. Recommendations for developers include input filtering, output filtering, and abuse monitoring. The research was shared with other AI providers for broader industry mitigation efforts.

Hugging Face and new LLM leaderboard

— # (#)

LangChain: LangGraph v0.1 + LangGraph Cloud = scalable and reliable agent deployment
- LangGraph v0.1 offers precise control for building agentic applications, while LangGraph Cloud, now in beta, provides fault-tolerant infrastructure. Companies like Klarna and Replit are already leveraging these tools to enhance AI initiatives. LangGraph Cloud features integrated monitoring and streamlined deployment.
Amazon almost devours Adept
- Amazon is strengthening its AI capabilities by hiring executives from Adept, a startup specializing in automating enterprise workflows. Adept’s co-founder and CEO David Luan and his team will join Amazon’s AGI division, led by Rohit Prasad. Adept will continue operating independently, and Amazon will license some of its technology.
Google DeepMind makes Gemma 2 available
- Google’s high-performance AI model, is now available in 9B and 27B parameter sizes. It delivers superior performance and efficiency, even on a single NVIDIA H100 GPU. Gemma 2 integrates easily with major AI frameworks and supports budget-friendly deployments. Researchers can access Gemma 2 for free on Kaggle or via Colab notebooks, with academic credits available for cloud services.
Imbue on training a 70B parameter model
- This model outperforms GPT-4 on reasoning tasks. They detailed the end-to-end infrastructure setup process, shared scripts for health checks, and thanked partners for support. They emphasized reproducibility, automated solutions, and deep understanding of infrastructure, inviting interested individuals to join their team.
Anthropic introduces Claude Projects and “Build with Claude” contest
- Pro and Team users can now organize chats and share knowledge effectively. Projects include a 200K context window for comprehensive document integration. Features like custom instructions and Artifacts enhance productivity.
- Anthropic's "Build with Claude" contest invites developers to create innovative projects using the Claude API. The virtual hackathon runs from June 26th to July 10th, 2024. The top three projects will win $10,000 in API credits.
- Also, read the Time’s interview with Dario Amodei
Stability AI and the new CEO
- Prem Akkaraju becomes the new CEO. The company also secured a financial bailout from an investor group led by Sean Parker. This recapitalization, necessary due to financial troubles, will likely reduce the stakes of existing investors.
Baidu and upgraded Ernie 4.0
- Ernie 4.0 Turbo AI model, now with 300 million users, is aiming to maintain its competitive edge in China. Baidu also enhanced its PaddlePaddle AI ecosystem. In response to OpenAI blocking its API in China, Baidu and other domestic AI firms are offering migration services to attract affected users.
Different approaches by the leading AI researchers
- While Ilya Sutskever launches Safe Superintelligence Inc. to build – drumroll, please – safe superintelligence (SSI). Andrej Karpathy is fun and hands on: GitHub - karpathy/LLM101n: LLM101n: Let's build a Storyteller. It is a course teaching how to create a Storyteller AI LLM from scratch using Python, C, and CUDA. The practical approach seems to be a better way to understand the benefits and risks of fast-developing AI.

We are watching/reading:

AI scaling myths by AI Snake Oil
Why Your Generative AI Projects Are Failing by Gradient Flow
Interesting Content in AI, Software, Business, and Tech by Devansh
RLHF roundup by Nathan Lambert
An Interview with Marques Brownlee (MKBHD) About Being a YouTube Star by Stratechery

The freshest research papers, categorized for your convenience

Our top

LLM Critics Help Catch LLM Bugs OpenAI researchers developed CriticGPT, an LLM critic trained via RLHF, which outperformed human reviewers in finding bugs and providing accurate feedback. In tests, CriticGPT detected more errors than humans in 63% of cases. Combining it with human reviews enhanced overall reliability and reduced errors →read the paper
Meta Large Language Model Compiler: Foundation Models of Compiler Optimization Meta AI unveiled a suite of models called LLM Compiler, enhancing code optimization. Trained on extensive LLVM-IR and assembly code, these models predict and improve compiler optimizations. Available in multiple model sizes, LLM Compiler is pivotal for compiler R&D →read the paper
Also, watch a recent interview with Mark Zuckerberg
WARP: On the Benefits of Weight Averaged Rewarded Policies Google DeepMind's WARP merges LLMs using weight averaging to improve RLHF, addressing common issues like knowledge forgetting and reward hacking. This method balances reward optimization with alignment to human preferences, showing promise in real-world applications →read the paper
Can LLMs Learn by Teaching? A Preliminary Study tested whether LLMs can self-improve through teaching, using feedback from student interactions. Results indicate notable enhancements in model accuracy and capabilities, suggesting a viable path for LLMs to learn independently from human data →read the paper
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Hugging Face introduced FineWeb, a massive dataset derived from web crawls, optimized for LLM training through advanced filtering and deduplication. The FineWeb-Edu subset specifically boosts performance on educational benchmarks, with all resources publicly available to further LLM research →read the paper

Optimization and Enhancement Techniques

STEP-DPO: Step-Wise Preference Optimization for Long-Chain Reasoning of LLMs – Optimizes each reasoning step individually in mathematical reasoning tasks, improving model accuracy by nearly 3% on MATH, outperforming several closed-source models. Read the paper
Adam-mini: Use Fewer Learning Rates To Gain More – Introduces an optimizer that matches or surpasses AdamW’s performance while reducing memory usage by 45-50%, improving efficiency and throughput in resource-constrained environments. Read the paper
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs – Addresses the imbalance in Retrieval-Augmented Generation by processing entire Wikipedia articles into smaller units, improving retrieval scores and performance in open-domain question answering. Read the paper
Unlocking Continual Learning Abilities in Language Models – Proposes MIGU, a rehearsal-free and task-label-free method to combat catastrophic forgetting in LMs, enhancing continual finetuning and pretraining performance. Read the paper
Confidence Regulation Neurons in Language Models – Explores how LLMs handle uncertainty in token predictions, identifying "entropy neurons" that regulate output confidence and "token frequency neurons" that adjust token logits based on token frequency. Read the paper

Benchmarks and Evaluation

LiveBench: A Challenging, Contamination-Free LLM Benchmark – Develops a benchmark for LLMs that updates frequently to prevent test set contamination, evaluating various tasks with objective ground-truth scoring. Read the paper
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions – Evaluates LLMs on challenging programming tasks requiring diverse function calls from 139 libraries, highlighting the need for further advancements in code generation. Read the paper
LongIns: A Challenging Long-context Instruction-based Exam for Large Language Models – Assesses LLMs' understanding of long contexts, revealing that top models struggle with long sequences and perform poorly even at 16k context length. Read the paper
OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far? – Introduces a benchmark to evaluate AI models' intelligence across multiple disciplines, highlighting the need for further advancements to achieve superintelligence. Read the paper
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs – Evaluates chart understanding in multimodal LLMs, revealing significant room for improvement in chart comprehension capabilities compared to human performance. Read the paper
Benchmarking Mental State Representations in Language Models – Assesses various LMs to determine their ability to represent mental states, finding that larger models and those fine-tuned with instruction-tuning or RLHF perform better. Read the paper

Data Generation and Enhancement

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data – Enhances LLMs' retrieval and reasoning capabilities with long-context inputs by finetuning on synthetic datasets, mitigating hallucination issues found in other methods. Read the paper
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets – Introduces a data generation pipeline for creating reliable function-calling datasets, resulting in state-of-the-art performance on the Berkeley Function-Calling Benchmark. Read the paper
Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts – Develops a semi-automatic approach for generating information-seeking dialog datasets using LLMs, improving response quality and reducing time and effort. Read the paper

Cultural and Ethical Considerations

How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions – Evaluates whether LLMs align their advice with the cultural values of different countries, highlighting the need for better training to ensure cultural sensitivity. Read the paper
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges – Evaluates the performance and biases of various LLMs as judges, revealing misalignments and biases that need to be addressed for more accurate evaluations. Read the paper

Advanced Techniques and New Models

Symbolic Learning Enables Self-Evolving Agents – Introduces a framework that allows language agents to self-optimize using symbolic networks, enabling autonomous learning and evolution post-deployment. Read the paper
Cognitive Map for Language Models: Optimal Planning via Verbally Representing the World Model – Demonstrates that training language models to construct cognitive maps enhances their planning abilities, providing insights into developing more advanced AI systems. Read the paper
Segment Any Text: A Universal Approach for Robust, Efficient, and Adaptable Sentence Segmentation – Introduces SAT, a sentence segmentation model robust to missing punctuation, adaptable to new domains, and highly efficient, outperforming current models. Read the paper
Simulating Classroom Education with LLM-Empowered Agents –Presents SimClass, a multi-agent classroom simulation framework that enhances user experience by effectively simulating traditional classroom interactions. Read the paper

Safety and Security

WILDTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models – Introduces an automated red-teaming framework to discover and compose unique jailbreak tactics, enhancing LLM robustness against adversarial queries. Read the paper
AUTODETECT: Towards a Unified Framework for Automated Weakness Detection in Large Language Models – Introduces a framework with three LLM-powered agents to systematically uncover weaknesses in LLMs, demonstrating significant improvements in model performance. Read the paper
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool – Introduces MemServe, an LLM serving system integrating optimizations to enhance context caching and disaggregated inference, improving job completion time and response time. Read the paper

Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!

FOD#58: Are multimodal systems the future of AI?