FOD#51: No AGI without Computer Vision

CV veterans are ready to enhance AI reasoning with spatial intelligence, groundbreaking papers and tother relevant ML news

Next Week in Turing Post:

  • Wednesday, Computer Vision History Series: A new episode!

  • Friday: AI Unicorn: Moonshot AI

If you like Turing Post, consider becoming a paid subscriber. That way you help us keep going and provide historical series for wider audience→

We recently started the computer vision (CV) history series, believing that the next big breakthroughs in the pursuit of Artificial General Intelligence (AGI) critically depend on advancements in CV, a field spearheaded by pioneers like Stanford’s Professor Fei-Fei Li. And Professor Li didn’t make us wait long. Known for developing ImageNet, which has been foundational to spatial AI development, last week she launched a venture (already backed with funding from a16z) aimed at enhancing AI's reasoning through spatial intelligence. This approach allows AI to comprehend three-dimensional spaces and dynamics, vital for complex tasks in diverse environments.

Fei-Fei Li wants to bridge gaps in AI's environmental interactions, similar to Yann LeCun’s efforts with his JEPA family. I-JEPA, Meta's advanced image processing model, leverages self-supervised learning to excel in tasks like object detection and image classification, without needing labeled datasets. Similarly, V-JEPA revolutionizes video analysis by predicting video sequence gaps and supporting applications in automated video editing, surveillance, and educational tools. LeCun always insists that despite advancements in natural language processing (NLP) with models like GPT, visual perception remains crucial for AI's interaction with the world. Having that “in mind,” an AI will be able to plan and reason based on visual inputs. With spatial intelligence, Fei-Fei Li plans to enhance AI's ability to emulate human cognitive skills in perceiving and engaging with the physical world.

The field's growth, driven by deep learning and convolutional neural networks, has made it possible for AI to process visual information akin to human sight, setting the stage for future breakthroughs that could seamlessly integrate AI into our daily life.

The rhetoric that comes from academics differs drastically from that of Sam Altman, who in a recent interview with another Stanford professor, stated that it doesn’t matter to him whether the annual expenditure is $5 billion or $50 billion; his focus is on creating AGI. What AGI (or Superintelligence, which OpenAI recently adopted as the main term and goal) entails is not described. So far, it seems that it involves the rollout of more sophisticated language models such as GPT-5 and GPT-6. For sure, both Altman and the GPTs are phenomenal in generating text, but as the push for spatial intelligence reminds us, human cognitive prowess isn't just about mastering language – it's about understanding the whole scene.

The AI Quality conference

Our friends from the MLOps Community are hosting a conference, and it’s a must-visit. First: the quality of speakers and content. Second: the vibe. You will learn, make important contacts, and enjoy your time.

As many people say: “The field is moving so fast, its hard to tell what is true vs false, what is good practice vs outdated”, the AI Quality conference hosted on June 25th in San Francisco aims to spotlight common problems, answer questions, and outline solutions for you and your team to be more successful with your AI endeavors. Among speakers will be practitioners from Open AI, Anthropic, LlamaIndex, W&B, Reddit and others! →agenda 

Twitter Library

News from The Usual Suspects ©

Microsoft Expands Its AI Safety Roster from 350 to 400 personnel to enhance trust in AI-generated content. This initiative includes deploying 30 responsible AI features and aligns with the National Institute for Standards and Technology's guidelines →read more


  • Joins MongoDB’s Enterprise AI Program Cohere has become a part of MongoDB’s AI Applications Program, aiming to streamline the deployment of generative AI across enterprise platforms. This collaboration focuses on enhancing productivity while ensuring data privacy and security across various deployment environments→read more 

  • Publishes a New Study “Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.” They discovered that using a diverse panel of smaller LLMs (PoLL) to evaluate the quality of outputs is more efficient and accurate than using a single large model like GPT-4. Their study across multiple datasets found: PoLL not only reduces costs and bias but also aligns better with human judgment, particularly in reducing intra-model bias. This method was over 7x cheaper and provided a broader perspective by integrating varied model assessments →read the paper

JPMorgan Taps AI for Thematic Investment with IndexGPT, an AI-driven tool that utilizes OpenAI's GPT-4 for creating thematic investment baskets. This innovation reflects Wall Street's continued foray into AI-enhanced financial solutions, aimed primarily at institutional clients →explore details

Alibaba Unveils Qwen1.5-110B, marking its entry into the 100B+ parameter model echelon. The model boasts multilingual support, efficient serving, and a competitive edge against current SOTA models, promising enhanced scalability and performance →discover more

Additional reading: One Year of Ranking Chinese LLMs by ChinAI  

AI21's Enterprise Move with Jamba-Instruct AI21 has rolled out Jamba-Instruct, an enterprise-optimized version of its Jamba model, now available for commercial use. This model stands out in tasks requiring extensive context and promises reliable performance for enterprise applications →read announcement

OpenAI Partners with Stack Overflow to Boost Developer Tools In a strategic move, OpenAI teams up with Stack Overflow to integrate OverflowAPI into its services. This partnership will enrich OpenAI’s models with Stack Overflow’s trusted content, enhancing both developer productivity and AI accuracy. The planned OverflowAI project is set to launch in 2024, marking a significant advancement in developer resources →read more

DrEureka (Nvidia):

The freshest research papers, categorized for your convenience

Our top-3:

KAN: Kolmogorov–Arnold Networks

Researchers from Massachusetts Institute of Technology, California Institute of Technology, Northeastern University, and The NSF Institute for Artificial Intelligence and Fundamental Interactions developed Kolmogorov-Arnold Networks (KANs), a new neural network model that replaces fixed activation functions on nodes with learnable activation functions on edges. Unlike traditional Multi-Layer Perceptrons (MLPs), KANs utilize learnable univariate functions parametrized as splines instead of linear weights, leading to superior performance in terms of accuracy and interpretability. This model demonstrates potential for scientific collaborations in discovering mathematical and physical laws. The ML community is very excited and looking forward for new developments in this area →read the paper

Better & Faster Large Language Models via Multi-token Prediction

Researchers from FAIR at Meta suggest training LLMs to predict multiple tokens at once to improve sample efficiency and speed. Multi-token prediction doesn't increase training time but enhances performance in code and natural language tasks. Their method also boosts inference speeds up to 3x and helps models perform better on challenging generative tasks →read the paper

A Careful Examination of LLM Performance on Grade School Arithmetic

Researchers from Scale AI critically examined LLMs' arithmetic capabilities by developing GSM1k, a new benchmark mirroring GSM8k, to rigorously test true reasoning versus potential dataset contamination effects. Analyzing various LLMs, they observed up to 13% performance drops on GSM1k compared to GSM8k, highlighting significant overfitting particularly in models like Phi and Mistral. Further analysis indicated a correlation between models' exposure to GSM8k and their performance differential on the two benchmarks, suggesting partial memorization issues, although data contamination was not the sole cause of overfitting →read the paper


  • WILDCHAT: 1M CHATGPT INTERACTION LOGS IN THE WILD: Compiles a dataset of 1 million user interactions with ChatGPT, providing insights for studying conversational AI and toxicity →read the paper

  • OpenStreetView-5M: The Many Roads to Global Visual Geolocation: Introduces a dataset with over 5.1 million georeferenced street view images for enhancing global visual geolocation capabilities in computer vision models →read the paper

Improvements in Model Architecture and Efficiency

  • Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting: Enhances the inference speed of LLMs using a self-speculative decoding framework that utilizes internal sub-networks without a separate draft model, significantly reducing latency →read the paper

  • Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge: Develops a speculative decoding algorithm integrating sequential knowledge to improve efficiency and accuracy in LLMs →read the paper

Specialized Applications of LLMs

  • Capabilities of Gemini Models in Medicine: Discusses Med-Gemini, a multimodal model designed for medical applications, integrating web search capabilities and customizing for new medical modalities →read the paper

  • Extending Llama-3’s Context Ten-Fold Overnight: Enhances Llama-3-8B-Instruct's context length for better performance on long-context tasks using fine-tuning techniques →read the paper

Novel Methods and Techniques for Model Training and Alignment

  • Iterative Reasoning Preference Optimization: Improves reasoning in LLMs by iteratively training on preference pairs, enhancing accuracy on complex tasks →read the paper

  • Self-Play Preference Optimization for Language Model Alignment: Introduces a self-play method for aligning LLMs, using a two-player game framework to find the Nash equilibrium and improve response quality →read the paper

  • Is Bigger Edit Batch Size Always Better?: Examines the impact of edit batch sizes on LLM performance, advocating for smaller, sequential batches for effective scaling of model editing methods →read the paper

  • FLAME: Factuality-Aware Alignment for Large Language Models: Presents a new alignment method focusing on enhancing the factual accuracy of LLMs through specialized fine-tuning and reinforcement learning →read the paper

Integration and Optimization of LLMs Across Multiple Domains

  • Octopus v4: Graph of Language Models: Utilizes functional tokens to route queries to specialized LLMs, optimizing each for specific tasks and enhancing performance across domains →read the paper

  • PROMETHEUS 2: An Open Source Language Model Specialized in Evaluating Other Language Models: Develops an open-source model for evaluating LLMs, closely aligning with human judgments and improving assessment accuracy →read the paper

  • LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report: Shows the efficacy of LoRA fine-tuning across multiple LLMs, demonstrating superior task-specific performance →read the paper

Theoretical and Survey-Based Insights

  • A primer on the inner workings of transformer-based language models: Offers a comprehensive introduction to the mechanisms of Transformer-based LLMs →read the paper

  • RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing: Surveys Retrieval-Augmented Language Models, detailing their impact and development in enhancing NLP tasks →read the paper

We are also reading:

If you decide to becoming a Premium subscriber, you can expense this subscription through your company. Please also send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. 🤍 

Thank you for reading! We appreciate you.

Leave a review!

Login or Subscribe to participate in polls.

Join the conversation

or to participate.