Mistral AI Roadmap: Inside Les Ministraux Models

Previously, in our GenAI Unicorn series, we discussed the rise of French Mistral AI company and their first models, Mistral 7B and Mistral 8×7B. This year, many new Mistral models have been released, demonstrating Mistral’s commitment to rapidly achieving increasingly better performance with their mostly open-source models. Even more notably, Mistral aims to make their models not only powerful but also as small as possible. Join us as we trace Mistral's strategic roadmap, explore their latest breakthroughs, and unpack the unique performance of les Ministraux – small models with outsized capabilities. This piece provides technical insights, practical applications, and everything you need to know about Mistral’s influence on model democratization and edge computing.

In today’s episode, we will cover:

Beginning: From smaller models to larger ones
A shift to specialized models and bigger upgrades
Back to smaller models – why?
What tendencies can we notice through Mistral’s timeline?
Deep dive: how the newest Ministral models work
How good are Ministral models?
Benefits of “Les Ministraux”
Limitations
Implementation
Conclusion
Bonus: All links in one place

Mistral AI Model Timeline: From 7B to Mixtral 8x22B

As we wrote before, “This French startup, founded in April 2023 with the ambitious goal of challenging the European Union's technological supremacy, has earned both admiration and skepticism." They act fast and bold. Let’s look at the timeline of Mistral’s releases.

It took them only about five months to introduce their first model! In September 2023, Mistral unveiled Mistral 7B, a compact but powerful language model with 7.3 billion parameters. It surpassed larger models like Llama 2 13B on many benchmarks and matched the performance of Llama 34B in various areas. Grouped-query and sliding window attention were the key innovations, which led to Mistral 7B’s fast, efficient processing and handling of longer sequences with less memory.

Image Credit: Mistral 7B blog post

A bit later, in December 2023, they launched the Mixtral 8x7B model, a sparse mixture-of-experts (SMoE) with 46.7 billion parameters, using only 12.9 billion per token. Its dynamic routing algorithm made it excel in handling large texts (up to 32K tokens), multiple languages (English, French, Spanish, German, Italian), and code generation, and it was 6x faster than Llama 2 70B. Mixtral 8×7B already demonstrated significantly better performance than their first Mistral 7B.

Image Credit: Mixtral of Experts blog post

In February 2024, Mistral introduced its new flagship model, Mistral Large. It excelled in handling complex, multilingual tasks in English, French, Spanish, German, and Italian, with enhanced capabilities in text understanding, transformation, and code generation. Mistral also collaborated with Microsoft to make Mistral Large available through Microsoft Azure, ensuring that powerful AI could be widely accessible.

The small version was developed alongside the large one, optimized for low latency and cost, making it ideal for efficient, real-time applications.

Image Credit: Mistral “Au Large” blog post

In April 2024, Mistral unveiled a new, larger SMoE model, Mixtral 8x22B. This version used only 39 billion active parameters out of 141 billion, and its key strengths were:

multilingual support (English, French, Italian, German, Spanish),
advanced capabilities in math, coding, and function calling,
a 64K token context window for large documents (instead of 32K).

Mixtral 8x22B demonstrated a significant leap in performance compared to its predecessors:

Image Credit: Mistral’s “Cheaper, Better, Faster, Stronger” blog post

Codestral, Mathstral, Mistral Large 2: Specialized Models

Codestral 22B, released in May 2024, was Mistral AI's first model specialized for coding tasks, supporting over 80 languages like Python, Java, SQL, and C++. It was fast, efficient, and cost-effective with 39 billion active parameters.

They said in the release: “We see Codestral as a new stepping stone towards empowering everyone with code generation and understanding.”

Image Credit: Mistral’s “Codestral: Hello, World!” blog post

July 2024 was exceptionally full of releases. Mistral introduced four models almost at once:

Mathstral was developed with Project Numina specifically for STEM applications. It excelled in advanced math and logical reasoning, supporting academic research with top performance in complex, multi-step tasks.
Codestral Mamba was built on the Mamba architecture with fast, linear time inference, allowing it to handle extremely long inputs of up to 256,000 tokens, making it ideal for code and productivity tasks. "Codestral Mamba is another step in our effort to study and provide new architectures."
Mistral collaborated with NVIDIA to develop Mistral NeMo 12B, a powerful model with a 128k-token context window, excelling in reasoning, world knowledge, and coding. It featured enhanced multilingual capabilities with the Tekken tokenizer, supporting over 100 languages, including Chinese, French, Arabic, and more.
Image Credit: Mistral NeMo blog post
Mistral Large 2 combined all the advancements to upgrade the previous Mistral Large version:
- Enhanced multilingual processing (supporting over a dozen languages, including English, Spanish, Japanese, and Arabic)
- Handling extensive texts up to 128k tokens
- Strong performance on code, math, and reasoning benchmarks, matching leading models like GPT-4 and Claude 3.
Image Credit: Mistral “Large Enough” blog post

Why Mistral Returned to Small Models: Pixtral & Ministral

After a short break, in September 2024, Mistral introduced Pixtral 12B, their first multimodal model that processes images and text and is ideal for understanding charts, figures and documents. Featuring a 400M-parameter vision encoder and a 128K token context window, Pixtral handles various image sizes and formats, outperforming similar models in multimodal reasoning and document question answering.

Image Credit: “Announcing Pixtral 12B” blog post

At the same time, Mistral upgraded their Mistral Small. With 22 billion parameters, it now stands between Mistral NeMo 12B and Mistral Large 2, providing a balance of power and cost efficiency.

Finally, the latest release, in October 2024, on the anniversary of the groundbreaking Mistral 7B, included two powerful new models for on-device and edge use: Ministral 3B and Ministral 8B, also called "les Ministraux." They are optimized for low-latency use and can work alongside larger models like Mistral Large.

Key Trends in Mistral's Model Strategy

As we can see through this 1,5-year journey, Mistral has started from smaller models, than upgraded capabilities of its specialized models, gathering everything in larger powerful models. Then it returned to smaller models which inherited the power achievements of larger ones and so here they are with les Ministraux.

This demonstrates Mistral’s commitment to making models as small as possible while saving enough power to compete with larger models. By making their models fully or partly open-source and collaborating with companies like NVIDIA and Microsoft, they bring powerful and efficient AI to the world.

This picture from Mistral’s blog perfectly demonstrates their goal to expand the development of small powerful models.

Image Credit: “Un Ministral, des Ministraux” blog post

Now, let’s explore more precisely what results Mistral researchers have achieved with their two newest Ministral models.

How Ministral 3B & 8B Work: Architecture & Attention

Mistral 3B and Mistral 8B are designed for on-device and edge computing. While staying under 10 billion parameters, they can be adapted for a range of tasks, from managing workflows to handling specialized tasks. They both support up to 128K context length, are trained on large amounts of multilingual and code data, and support function calling.

Ministral 8B employs special feature, an interleaved sliding-window attention pattern, that makes it faster and more memory-efficient. Here is how it works:

We’ll begin by explaining the original sliding-window attention method used in Mistral 7B. This method processes tokens within a “window” rather than all at once. Each layer looks back at tokens within a limited window, but as layers stack, the model can see further back than the immediate window. This setup limits memory use because it only stores tokens within the window size.
Image Credit: Mistral 7B blog post
The interleaved sliding-window attention pattern is a variant of the sliding-window attention mechanism designed to improve efficiency when handling long sequences. Its working process is a little bit different:
- Tokens within one window partially overlap with tokens in the next, allowing each token to connect with more surrounding information.
- In each layer, attention is given to tokens from multiple windows, not just the previous one.

This interleaving structure increases the range of attention, enabling the model to access distant tokens across layers while maintaining computational efficiency.

Let’s see how these technological upgrades influenced Ministral models’ performance.

Ministral Benchmarks: How They Compare to Llama & Gemma

One of the most fascinating thing about Ministral models is that even Ministral 3B, being the smallest one in the whole Mistral family, when fine-tuned, outperforms it predecessor Mistral 7B and in some cases other models of larger size like Llama 3.1 8B and Gemma 2 9B.

Pretrained models. Image Credit: “Un Ministral, des Ministraux” blog post

Instruct modes. Image Credit: “Un Ministral, des Ministraux” blog post

Benefits of “Les Ministraux”

Here is a summary of all the features stated above, that make Ministral models effective small models:

Efficiency and speed: Both models are compute-efficient with low latency, and Ministral 8B features interleaved sliding-window attention for faster, memory-efficient processing.
High context length: They support up to 128k tokens, enabling work with longer sequences or contexts.
Superior performance: Outperform similar models in various benchmarks, with strong reasoning and task-handling capabilities even compared to larger models.
Edge optimization: Built for on-device and edge computing, they are ideal for privacy-sensitive and offline applications.
Various applications: Handle diverse tasks, from workflow management to specialized functions, suitable for hobbyists and large organizations.

Ministral Model Limitations

As Ministral models are small and new models, they may have the following limitations:

Limited to specific use cases: Designed for edge computing, Ministral models may not be ideal for tasks requiring heavy computational power or complex deep learning tasks.
Context limitations on vLLM: Although Ministral models support a maximum context length of 128k tokens, using them on the vLLM platform currently limits this to 32k tokens.
Quantization assistance needed: To achieve optimal performance, users might need help with quantization, which can be an extra step in the deployment process.

Implementation

"Les Ministraux" models were created to offer fast, efficient solutions for tasks such as offline translation, smart assistants without internet, local data analysis, and autonomous robots. When Ministral 3B is more suitable for smaller devices like smartphones, Ministral 8B is made for devices like laptops as it requires more GPU memory.

Both models can also be used in agentic workflows that involve multiple steps, serving as helpers for larger models like Mistral Large 2. They can be set up to handle tasks such as understanding inputs, directing tasks, and calling the right functions based on user needs.

Is Mistral's Small-First Bet Paying Off?

In this episode, we explored the full range of Mistral models to understand the innovations that led to their compact yet competitive Ministral models. While many companies focus on scaling up, Mistral stands out by delivering powerful performance in smaller, more accessible models – a trend gaining traction in AI. Mistral’s approach exemplifies how efficiency and democratization can go hand in hand, redefining what 'small but mighty' means in AI.

Bonus: Resources