OpenAI Whisper Model: Speech Recognition Explained

We don’t often focus on audio AI systems, but today we’re diving into automatic speech recognition (ASR) using OpenAI’s groundbreaking Whisper model. ASR systems are designed to automatically convert spoken language into text, typically requiring fine-tuning for specific tasks. Whisper, however, breaks that mold with its open-source, multilingual capabilities, allowing it to handle transcription and translation tasks without additional tuning. OpenAI, known for keeping most of their advanced models proprietary, made an exception by releasing Whisper as open-source in 2022. Now, in 2024, its latest V3 Turbo version has made waves with speeds eight times faster than its predecessor, large-v3, and the ability to run on cloud servers or even locally, all while maintaining comparable accuracy. Let’s explore what makes Whisper so efficient for handling multiple speech recognition tasks.

In today’s episode, we will cover:

Limitations of existing speech recognition models
Here comes Whisper model
Story of Whisper
How does Whisper work?
How good is Whisper?
Whisper’s advantages
It can be HOT
Limitations
Conclusion
Bonus: Resources

Limitations of existing speech recognition models

Recent advances in speech recognition come from unsupervised pre-training techniques, which learn from raw audio without labels. However, these models still require fine-tuning to handle specific tasks, which can be complex. Additionally, there is a risk that models learn patterns specific to the training data and can still make mistakes when faced with new data.

Despite improvements in encoders, the need for fine-tuning and the limitations of weak decoders continue to affect model performance. In an ideal world, speech recognition should perform well across various settings without constant adjustments. Some supervised training approaches across multiple datasets have shown more consistent results, but the amount of high-quality supervised data is still small.

What if there was an approach that allows models to perform well across languages and speech recognition tasks without specific fine-tuning?

Here comes Whisper model

In Whisper, OpenAI uses large-scale weak supervision to train the system for speech recognition. It is designed not only for speech recognition but also for translation tasks, transcribing speech into text in its original language or translating it into English, voice activity detection, and language identification.

There are 9 modifications of Whisper model of different sizes and capabilities:

Image credit: OpenAI Whisper Model Card

Earlier versions of Whisper used the following training data:

680,000 hours of labeled audio data,
with 117,000 hours covering 96 languages,
and 125,000 hours of translation data.

The latest Whisper large-v3-turbo was trained on 680,000 hours of audio and the corresponding transcripts collected from the internet where:

438,000 hours represent English-language audio and matched English transcripts,
126,000 hours (~18%) represent non-English audio and English transcripts,
117,000 hours (~17%) represent non-English audio and the corresponding transcript.

Non-English data now includes 98 different languages, so in total, Whisper works with 99 languages.

Released in September 2022, Whisper has gained more popularity now in 2024. But why did this happen?

Story of Whisper

First of all, we’ll give a brief timeline of Whisper releases:

September 2022: Whisper original series
December 2022: Whisper large-v2, an improved large model
November 2023: Whisper large-v3, which is a better and upgraded version of large-v2
September 2024: Whisper large-v3-turbo model, optimized for inference speed

It's a pruned version of the large-v3 model, meaning it has fewer decoder layers (4 instead of 32). This contributes to its speed and efficiency. Now Whisper is completely open-source and can be run in your browser via Hugging Face. As OpenAI claims, the open-source version of Whisper and the API version are the same, but the API offers an optimized process.

But why is Whisper popular now more than in 2022?

We have found several reasons that might have cause this tendency. First of all, the huge rise of OpenAI, as a major player in AI, leads to more attention to all of their developments, including Whisper. Secondly, fully open-source makes Whisper accessible for everyone and involve more people to use it. And thirdly (maybe it’s the most obvious reason), the updated version of Whisper demonstrates better performance and capabilities compared to the first versions of the model.

How does Whisper work?

An encoder-decoder Transformer architecture has been proven to scale well, that’s why OpenAI has chosen it to build Whisper. This capability is crucial for a large-scale weak supervision approach. Let’s explore all the part of the model architecture.

Whisper architecture

Image credit: Whisper architecture, original paper

The audio is processed into a Mel spectrogram (a visual representation of sound), normalized, and passed through the model for training. In Whisper large-v3 the spectogram input uses 128 Mel frequency bins.
The encoder processes the audio features through layers of transformations.
The decoder uses tokens (small units of data) to predict the transcription.
A byte-level tokenizer is used for text processing, based on the GPT-2 model, with adjustments for multilingual capabilities.

Multitask setup

To release the main idea of making a multitask model researchers simplified the entire speech processing pipeline, rather than splitting it into many components. Multitask setup allows Whisper to handle tasks like transcription, translation, voice activity detection, and language identification. This makes the overall system more efficient and easier to manage. So how did researchers release everything in one model?

The secret lies in tokens. The model is instructed on what task to perform by providing it with special tokens (markers) that guide it. These tokens are part of the input.

Here's how it works:

Image credit: Original paper

Language detection: Whisper starts by predicting the language being spoken, using a unique token for each language (99 languages in total).
Task selection: Then, it uses tokens to decide if it should transcribe the speech or translate it.
Timestamps: If needed, the model predicts the start and end times of each spoken word, adding time markers before and after each text.
“No speech” token: If there's no speech in the audio, the model predicts a “<|nospeech|>" token. This tells the system that there is nothing to transcribe or translate in that part of the audio.
Once the task is done, the model adds a token to mark the end of the transcript or translation.

Now, let’s look at the results that Whisper demonstrates across these multiple tasks.

How good is Whisper?

Speech recognition systems are typically measured using Word Error Rate (WER). The following results were achieved in Whisper large-v2 analysis but the large-v3 model shows improved performance over a wide variety of languages and 10% to 20% reduction of errors compared to Whisper large-v2.

English speech recognition: Whisper models were tested on LibriSpeech benchmark. Although Whisper’s WER of 2.5% on LibriSpeech was modest, it performed much better on other datasets. This shows that Whisper has strong generalization abilities.
Whisper also excelled in out-of-distribution tasks (working on data different from what it was trained on), showing a 55.2% average error reduction across various speech recognition datasets.
Image credit: Original paper
Multilingual speech recognition: Whisper outperformed models like XLS-R and mSLAM in zero-shot evaluations, especially on Multilingual LibriSpeech (MLS) and VoxPopuli. However, its performance varies by language. It struggles with languages like Hebrew, Chinese, Telugu, and Korean due to unique scripts and linguistic differences.
Image credit: OpenAI Whisper GitHub
Speech translation: Whisper sets a new state-of-the-art on CoVoST2, particularly strong in low-resource languages, without fine-tuning.
Robustness to noise: It exhibited high robustness to noise, degrading less in noisy environments (e.g., pub noise) compared to other models.
Long-form transcription: Whisper is competitive with state-of-the-art commercial systems, performing well on lengthy audio files like podcasts and interviews.
Image credit: Original paper

It can be HOT (running Whisper)

You can run Whisper, OpenAI's speech recognition model, on both CPU and GPU. While it's possible to use a CPU, this approach can be quite slow, particularly with larger models like Whisper Large V3. A GPU significantly speeds up transcription, but it comes with its own challenges. Several users have reported their systems getting seriously hot when running Whisper locally, especially with powerful GPUs working over extended periods. To avoid overheating issues, you can consider using cloud-based options.

A few options to consider are, WhisperWebGPU on HugginFace, or Groq, which offers a cloud-based implementation of Whisper with their Whisper Large V3 Turbo model. Groq’s infrastructure famous for its speed, provides extremely fast performance, clocking in at 216x real-time speed, making it a good solution for those needing rapid transcription without the worry of managing local hardware. Groq’s Whisper model is accessible via GroqCloud.

Whisper’s Advantages

Multilingual capabilities: Support speech recognition in multiple languages, making it versatile for global use. It also shows robustness to accents and technical language.
Zero-shot learning: Performs well in speech recognition without needing specific fine-tuning for new datasets.
Robustness to noise: Maintains accuracy even in noisy environments.
Multitask functionality: Can accurately handle transcription, translation, and language identification in one model.
Scalability: Improved performance with larger datasets and models.

Limitations

Despite Whisper’s improvements in many areas, it still has several noticeable limitations:

Hallucinations: Whisper, trained on large, noisy data with weak supervision, might generate text not found in the audio. This happens as it mixes language knowledge with transcription, sometimes predicting words that aren’t present.
Uneven language performance: It struggles with low-resource languages due to limited training data and shows varied accuracy across accents, dialects, and demographic groups (e.g., gender, race, or age). This causes higher word error rates for some speakers.
Repetitive text generation: Due to its sequence-to-sequence architecture, the model can generate repetitive outputs. Techniques like beam search and temperature scheduling help reduce this issue but cannot fully eliminate it.

Conclusion

Whisper marks a significant step forward in automatic speech recognition, with its open-source availability making it accessible to a wide range of users, from individual developers to large companies. Its multilingual support, noise resilience, and ability to handle multiple tasks in one model give it an edge in a space where most ASR systems require specific fine-tuning.

That said, Whisper's progress also underscores ongoing challenges in ASR technology, such as its uneven performance across languages and occasional issues with repetitive text generation. Despite these limitations, Whisper demonstrates what's possible in audio AI, offering a versatile and efficient solution for many speech recognition tasks.

Bonus: Resources

How did you like it?

Thank you for reading 🩶

Whisper Model Explained: OpenAI's Open-Source Speech Recognition