FOD#42: Debunking Moravec's Paradox

Is 2024 the year of advanced robotics?

Next Week in Turing Post:

  • Wednesday, Token 1.21: Foundation models and Data Privacy. Your guide

  • Friday: AI Infrastructure Unicorn Series

If you like Turing Post, please consider to support us. You will also have full access to our most interesting articles and investigations →

Are we on the brink of debunking Moravec’s Paradox? In the 1980s, AI and robotics researcher Hans Moravec highlighted a counterintuitive aspect of AI: tasks requiring high-level reasoning – like chess or Go – are easier for AI to master than basic sensory and motor skills – such as walking or identifying your mom’s face – which humans find instinctive. Adding complexity, these “simpler” skills actually demand much more computational power. This insight sheds light on the complexity of replicating human-like perception and dexterity, outcomes of millions of years of evolution, as opposed to logical reasoning, a more recent development. In today's AI and ML landscape, this paradox underscores the challenges in creating robots and AI systems capable of seamlessly navigating and interacting with the physical world.

However, last week, Bernt Bornich, CEO and founder of 1x, a humanoid robotics company, wrote, “New progress update on the droids dropping in 4 weeks, looks like Moravec's paradox might be debunked, and we just didn't have the data.” My suspicion is that this has something to do with the advancements in foundation models. Originally known for their ability to perform a wide range of tasks based on a single type of data (like text for language models), these models become "multimodal" when integrating and interpreting information across different sensory inputs, closely mirroring human-like understanding.

Could the embodiment of AI, with all its sensory inputs, plus reasoning-imitation algorithms like LLMs, be the pool of data that disproves Moravec’s paradox? 

Another intriguing development caught my attention. Huang Jensen, Nvidia’s CEO, responded to a question from Wired about what current development could change everything. Jensen replied, “There are a couple of things. One doesn’t really have a name, but it’s part of the work we’re doing in foundational robotics. If you can generate text and images, can you also generate motion? The answer is probably yes. And if you can generate motion, you can understand the intent and generate a generalized version of articulation. Therefore, humanoid robotics should be right around the corner.

Something to observe in the coming weeks!

In related news, from the robotics universe, Figure AI, a humanoid robotics startup, made headlines by raising approximately $675 million in funding. What's more impressive is the list of backers: Amazon, NVIDIA, Microsoft, OpenAI, Intel, LG, and Samsung. This indicates a strong belief in the potential of humanoid robotics to disrupt various sectors.

Yet, there are skeptical voices. Rodney Brooks, who coined Nouvelle AI*, posted last week: “Tele-op robots presented as autonomous, like the Tesla Optimus humanoid folding a shirt, and 1X humanoid robots, are misrepresentations of what robots are actually doing, which can also be called LIES. Note that the Stanford robot cooking and cleaning videos are also tele-operated.

If 2023 was the year of LLMs, are we ready to evolve to an embodied AI and make 2024 the year of robots?

*Nouvelle AI is about learning from surroundings, not just following pre-set rules or using complex algorithms.

From our partners

SciSpace is a next-gen AI platform for researchers where you can easily discover 280 million+ papers, do effortless literature reviews, chat with, understand, and summarize PDFs with its AI copilot, and so much more.

Get unlimited access to SciSpace. For annual package use the promo code: TP40. For monthly package: TP20. Very useful tool!

Twitter Library

News from The Usual Suspects ©

Magic and it’s Context Window

  • Nat Friedman and Daniel Gross invested $100 million in Magic, an AI startup developing a coding assistant with advanced capabilities beyond Microsoft's GitHub Copilot. Magic's AI can process 3.5 million words, offering an "unlimited context window" for understanding and generating code in a company's unique style. It also aims for "active reasoning" akin to OpenAI's models, suggesting it can logic through new problems.

Google – the good and the bad

  • Good: Google has launched Gemma – and open SML

    It’s a set of open AI models that emphasize a lightweight design and responsible AI use. It supports major frameworks and hardware, including Google Cloud TPUs and NVIDIA GPUs. Don't be misled; it's not open-source, but it represents an important move by Google. Gemma can be categorized as a Small Language Model (SLM), offering less intense compute requirements and greater efficiency.

  • Bad: Gemini and the Diversity That Went Astray

Mistral goes Large

In other newsletters

  • The race to the first useful humanoid robot: 20 demos by James Darpinian

  • 20 examples of how people are using custom GPTs to make their teams more productive by Lenny Rachitsky 

  • I just liked this issue about Amazon's big speech model; fractal hyperparameters; and Google's open models by Jack Clark 

  • AI and Causality by Data Machina

  • The Most Important AI Model Is The Business Model by Matt McIlwain

We are watching

The freshest research papers, categorized for your convenience

Enhancing Large Language Models (LLMs)

  • LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens: Expands the processing capability of LLMs to handle over 2 million tokens, pushing the boundaries of context window sizes for more comprehensive understanding and generation tasks. Read the paper.

  • OmniPred: Language Models as Universal Regressors: Demonstrates the versatility of LLMs in performing numerical regression tasks, suggesting their potential as universal tools for predictive modeling across a variety of domains. Read the paper.

  • Divide-or-Conquer? Which Part Should You Distill Your LLM?: Investigates efficient strategies for distilling large models into smaller, more manageable ones, particularly for reasoning tasks, emphasizing the importance of decomposition over problem-solving. Read the paper.

  • ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition: Proposes a method to enhance the efficiency of self-attention mechanisms in LLMs, crucial for improving performance and reducing resource consumption. Read the paper.

  • USER-LLM: Efficient LLM Contextualization with User Embeddings: Introduces a framework for personalizing LLM interactions using user embeddings, enhancing the model's responsiveness to individual user preferences and histories.

Multimodal and Multi-Agent Systems

  • World Model on Million-Length Video and Language with RingAttention: Explores integrating video and language for advanced AI understanding and interaction, leveraging a novel RingAttention mechanism for efficient multimodal learning. Read the paper.

  • AgentScope: A Flexible yet Robust Multi-Agent Platform: Develops a multi-agent platform that enhances cooperation and flexibility among agents, addressing the complexity of multi-agent systems and their practical applications. Read the paper.

  • TinyLLaVA: A Framework of Small-scale Large Multimodal Models: Focuses on the design and analysis of small-scale multimodal models, proving that with strategic optimizations, smaller models can achieve or surpass the performance of larger counterparts. Read the paper.

  • AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling: Introduces a versatile multimodal language model that processes speech, text, images, and music, demonstrating the power of discrete representations in unifying various data modalities within a single framework. This approach simplifies the integration of new modalities without needing to modify the underlying architecture or training methodologies. read the paper

  • A Touch, Vision, and Language Dataset for Multimodal Alignment: Presents a novel dataset that enhances multimodal understanding by incorporating touch with vision and language, aiming to advance touch-vision-language alignment and understanding through a tactile encoder and text generation model. read the paper

Advancements in Specific Domains

  • YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information: Introduces a novel object detection model that leverages programmable gradient information for enhanced accuracy and efficiency in learning. Read the paper.

  • Beyond A: Better Planning with Transformers via Search Dynamics Bootstrapping*: Demonstrates the application of Transformers in complex planning tasks, offering a method that surpasses traditional search algorithms in efficiency and effectiveness. Read the paper.

Developer Tools and APIs

  • API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs: Presents a vast dataset designed for training LLMs to interact with APIs, addressing the challenge of creating effective models for API usage and integration. Read the paper.

  • OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement: Develops an open-source system for code generation, execution, and refinement, facilitated by a dataset of multi-turn interactions, aiming to bridge the gap between code generation models and practical coding tasks. Read the paper.

Security and Adversarial Research

  • Coercing LLMs to do and reveal (almost) anything: Explores the susceptibility of LLMs to a wide range of adversarial attacks, highlighting the need for comprehensive security measures to protect against unintended behaviors and data extraction. Read the paper.

Model Efficiency and Quantization

  • OneBit: Towards Extremely Low-bit Large Language Models: Discusses a novel framework for quantizing LLM weight matrices to 1-bit to drastically reduce storage and computational demands while maintaining performance, enabling efficient deployment of LLMs on resource-constrained devices. read the paper

Instruction Tuning and Data Quality

  • Reformatted Alignment: Introduces REALIGN, a method for refining instruction data quality for LLMs to better align with human values, emphasizing the importance of instruction data quality in model alignment and suggesting areas for further exploration in LLM science. read the paper

  • Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models: Proposes a novel method for instruction tuning that generates synthetic instruction data across all disciplines, showcasing a scalable and customizable approach to instruction tuning without relying on specific training data. read the paper

  • Instruction-tuned Language Models are Better Knowledge Learners: Proposes a pre-instruction-tuning method to enhance LLMs' knowledge updating capabilities, demonstrating significant improvements in factual knowledge absorption and cross-domain generalization. read the paper

If you decide to become a Premium subscriber, remember, that you can expense this subscription through your company! Join our community of forward-thinking professionals. Please also send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. 🤍 Thank you for reading

How was today's FOD?

Please give us some constructive feedback

Login or Subscribe to participate in polls.

Reply

or to participate.