Last week, we covered Zhipu AI, a Chinese GenAI unicorn with a $2.5 billion valuation and a unique LLM. But while working on it, we kept coming across Moonshot AI, another Chinese AI unicorn with strong ties to Tsinghua University. Also valued at $2.5 billion, this startup is only one year old. It claims, though, that models like Google's PaLM, Meta's LLaMa, and Stable Diffusion have adopted many core technologies authored by the Moonshot AI team. On its website, the company is also referred to by several other names, including 月之暗面, transliterated as YueZhiAnMian, which directly translates to "Dark Side of the Moon," a homage to Pink Floyd’s legendary album – a personal favorite of co-founder Yang Zhilin. Why the company is so influential, how it disrupts the long-context window, what are three layers of AGI, will Chinese companies be ready to rival the US companies? Curious to know what’s on that dark side of the moon? Let’s give it a shot!
Table of Contents
How Moonshot AI Was Founded
Called one of the four new AI tigers of China (along with Zhipu AI, Baichuan, and MiniMax), Moonshot AI was established just over a year ago, in March 2023, by Yang Zhilin, an assistant professor in the School of Interdisciplinary Information at Tsinghua University. Yang, who studied computer science at Tsinghua University and holds a PhD from Carnegie Mellon University, invited Zhou Xinyu and Wu Yuxin, ex-students of Tsinghua University as well, to collaborate on large-scale AI models with a particular focus on long-context windows. For a deeper look at the latest techniques for handling long context in LLMs, see our overview of 10 newest ways for efficient processing of long context.
It’s truly remarkable that, in less than a year, Moonshot AI shocked everyone with a $1 billion round and a staggering valuation of $2.5 billion. To understand why the investors believed in them, we will need to understand who the founders are. They are all in their early 30s, but their achievements are impressive and at the edge of technology.
Who Are the Founders of Moonshot AI?
Moonshot AI is strongly influenced by the technical backgrounds of its founders, Yang Zhilin, Zhou Xinyu, and Wu Yuxin. Moonshot AI’s website states that the core members of the founding team have participated in the research and development of many large models like Google Gemini, Google Bard, Pangu NLP, and Wu Dao. Also, it says that products and models like Google PaLM, Meta LLaMa, and Stable Diffusion have adopted many core technologies authored by the Moonshot AI team. How so?
Yang Zhilin
Among his many publications, Yang’s most cited papers are the papers published during Yang’s PhD at Carnegie Mellon University and work at Google Brain. Quoc V. Le (co-invented the doc2vec and seq2seq models in NLP) is among co-authors:
“Transformer-XL: Attentive language models beyond a fixed-length context” in 2019 introduced a method to extend the context length beyond fixed limits in Transformer models, enhancing their ability to understand and generate more coherent long-form text by maintaining context across different segments. This advancement significantly improves performance on tasks involving lengthy documents or datasets.
“Xlnet: Generalized autoregressive pre-training for language understanding” presented a novel approach to language model pre-training that combines the strengths of both autoregressive and autoencoding methods, enabling it to outperform existing models like BERT on several NLP benchmarks by capturing a broader range of data dependencies.
Interestingly, Yang contributed to the creation of Zhipu AI’s models as a part of the Tsinghua University research group. He was among the authors of the first version of the General Language Model (GLM) published in March 2022 and contributed to CodeGeeX published in August 2023. Apparently, the work on CodeGeeX took place before the Moonshot AI’s creation and was published a bit later. Yang also co-founded Recurrent AI, a company that develops algorithms to analyze speech.
Xinyu Zhou
Xinyu Zhou worked at Hulu, Tencent, and Megvii. He co-authored:
DoReFa-Net and ShuffleNet both address the challenge of deploying deep neural networks on hardware with limited computational resources. The papers for these methods count more than 10,000 citations.
The EAST framework with almost 2,000 citations presented a novel approach to detecting text in natural scenes. This method improves both the speed and accuracy of text detection in complex images, which is valuable for real-world applications like automated driving, augmented reality, and text analysis in natural environments.
Yuxin Wu
Yuxin Wu worked at Google Brain on foundation models and at Meta AI Research on computer vision. He created detectron2, a platform for object detection, segmentation and other visual recognition tasks and one of the most popular Facebook AI projects.
Xinyu and Yuxin's expertise is applicable in cases when the model needs to be optimized to work on a mobile device or a developer’s personal and usually limited compute power. Also, text detection and the experience with images are great additions to the development of multimodal foundation models. Both of these areas are at the cutting-edge of the current development of language and multimodal models.
What Does Moonshot AI Do?
According to Bloomberg, Beijing Dark Side of the Moon Technology provides computer system services, technical consulting, technology transfer, technology promotion, and other services. Beijing Dark Side of Moon Technology also sells computer equipment.
It’s all important business words, but what we know for sure is that the company is focused on the development of large language models (LLMs). Specifically, its unique selling point is that it’s working on processing long-form context and response in Chinese, a research area of the founder, Yang Zhilin.
It’s main product Kimi Chat, introduced in October 2023, has become Moonshot AI’s shot to the moon (sorry, couldn’t help myself!). This LLM is distinguished by its ability to process long texts – managing up to 200,000 Chinese characters, a capacity that far exceeds its nearest competitors. Being able to digest and work with large text inputs without constant fine-tuning is pivotal for sectors like finance, law, and academia, where the ability to rapidly analyze and summarize extensive documents is invaluable. Unlike other models that may use compromises in their design, Kimi Chat maintains high performance without shortcuts, thanks to innovative engineering approaches.
Moonshot AI is not only interested in leading with technological innovation, it’s main objective is to be demanded by the market. Which was certainly apealing for the investors.
How Kimi Chat Works: Architecture and Technology
KimiChat is based on the Transformer-XL, a neural architecture Zhilin Yang co-authored with Zihang Dai, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov in 2019.
This architecture significantly advances the field of language modeling by overcoming the fixed-length context limitation inherent in standard Transformer models.


How does it work: Transformer-XL incorporates a novel segment-level recurrence mechanism and an innovative positional encoding scheme, allowing it to learn dependencies over longer contexts without losing coherence over time. This architecture dramatically extends the effective context length by 80% over RNNs and up to 450% over traditional Transformers, as the model retains a memory of previous segments to enhance its understanding of extended sequences.
Benchmarks: The transformative potential of Transformer-XL is further evidenced by its performance improvements. It achieves state-of-the-art results on various language modeling benchmarks, significantly reducing perplexity scores across datasets like enwiki8, text8, WikiText-103, One Billion Word, and Penn Treebank. Transformer-XL also enhances computational efficiency during model evaluation, running up to 1,800 times faster than its predecessors, which presents a substantial advancement in processing speed.
Using Transformer-XL enhancements, KimiChat became a robust solution for tasks requiring the management of long-term dependencies, setting a new standard for future developments in language models and potentially other sequential data applications. And lossless long context became Moonshot AI north star.
Moonshot AI's Core Strategy: Lossless Long Context
At the heart of Moonshot AI's strategy is the concept of "lossless long-context." Traditionally, AI models require frequent fine-tuning to adapt to new data or user interactions. However, Moonshot AI aims to leverage the rich history of user interactions as a dynamic and evolving basis for personalization, potentially reducing the need for constant model adjustments.
And they move fast. Just six months after their groundbreaking launch with a 200,000-character context window, in March 2024 Moonshot AI announced support for an unprecedented 2 million-character dialogue window!
Their efforts prompt rapid responses from other industry giants:
In 2023, Baichuan2-192K announced a 192K token context window, processing up to 350,000 Chinese characters at once;
These developments highlight a narrowing gap between China and the US in the realm of LLMs, underscoring China's growing influence and potential in AI technology.
Why does long context matter?
Long-text processing is critical because it allows for more nuanced and comprehensive interactions, from handling extensive customer service inquiries to managing complex data analysis tasks without the need for segmentation. This capability is becoming a key differentiator in the LLM market, where depth and context of understanding can greatly enhance performance.
Kimi Chat vs GPT-4, Claude, and Gemini: Context Window Comparison
OpenAI's GPT-4Turbo-128k offers a context window of 128K tokens.
Anthropic’s Claude 3 family offers a 200K context window upon launch noting that 1M tokens are available for specific use cases.
Google's Gemini 1.5 Pro comes with a standard 128,000 token context window. But a private preview version, supports a context window of up to 1 million tokens, with successful tests of up to 10 million tokens.
Cohere models like Command R and Command R plus support 128K context length.
Mixtral 8x22B supports a 64k context window.
Moonshot AI's Mission: From Kimi to AGI
Yang, one of the most public figures among the founders of Moonshot, shares on his website that the ultimate goal of all his work is to maximize the value of artificial intelligence. He works on achieving general cognitive intelligence using natural language as a key interface between humans and AI.
For Moonshot AI, one of the goals is to achieve AGI. Yang Zhilin aims to surpass existing AI companies like OpenAI by focusing intensively on user-centric innovations and personalized interactions through AI. His mission is to develop advanced AI technologies that prioritize lossless long context and personalization, enabling AI-native products to deliver highly customized user experiences without the need for traditional model fine-tuning. Zhilin believes in integrating technical idealism with commercial pragmatism to drive both product excellence and utility.
Yang Zhilin's Three-Layer Framework for AGI
First Layer: Scaling Laws and Next-Token Prediction
This foundational layer is common across the industry, involving scaling laws combined with next-token prediction capabilities. OpenAI is currently leading in this area due to substantial investments over recent years.
Second Layer: Representation and Data Bottlenecks
Universal Representation: Challenges in representing the world comprehensively, such as encoding complex, multi-dimensional data beyond text.
Data Scarcity: Addressing the limited availability of data inputs through self-evolving AI systems, which aim to function with continuous power input without the need for constant data feeding.
Third Layer: Advanced Capabilities and Diverse Functionalities
Encompasses development of long-context processing, generation across multiple modalities, multi-step planning, enhanced instruction following, and diverse agent functions.
This layer offers significant potential for differentiation and innovation in AI technology, driven by advancements in the underlying technical variables.
Yang Zhilin thinks the third layer is the Moonshot AI opportunity to become the best and surpass Open AI.
Moonshot AI Funding and Valuation
Investment rounds and valuation

a
KimiChat Price

Image Credit: Moonshot.cn
API documentation and pricing can be found here: https://platform.moonshot.cn/docs/intro#主要概念
Conclusion
The narrative around AGI in China is not overloaded with the kind of doomism so familiar to everyone in the U.S. Chinese AI startups, as illustrated by Zhipu AI and Moonshot AI, openly state AGI as their goal and regularly and conversationally describe their pathways toward achieving it. This transparency seems to be an advantage of the Chinese tech scene. Additionally, an enormous amount of funding is readily invested into AI development. Many Chinese founders — a significant number of whom are scholars — receive education not only locally but also gain insights from American and European universities, thereby enriching their experience. We expect to see more AI startups emerging from China, characterized by abundant talent and unencumbered by the usual U.S. constraints.
Thank you for reading, please feel free to share with your friends and colleagues. 🤍
We have 12 more GenAI Profiles for you to read →









