Text-to-audio models AI systems designed to β you guessed it right! β convert written text into sound. These models are used for the following purposes:
Text-to-speech (TTS) models generate spoken language from text input. They are used in virtual assistants, audiobooks and navigation systems.
Music generation models create music from textual descriptions or instructions. They are employed in creative tools, entertainment and automated music composition.
Sound effect generation: models produce specific sound effects based on textual descriptions. They are useful for video game development, movies and virtual environments.
Here is a list of newest and classical text-to-audio models of different types:
JASCO, a text-to music model developed by Meta, generates realistic and high-quality music clips from symbolic and audio-based inputs. It uses Flow Matching speech generation technique to achieve high-quality sound. JASCO provides detailed control over musical elements and specific parts of the music, like when certain chords or beats should play. β Read more
Stable Audio Open is Stability AIβs new open-weights text-to-audio model that generates up to 47 seconds of stereo audio at 44.1kHz from text prompts. This model has three components: an autoencoder for waveform compression, a T5-based text embedding, and a transformer-based diffusion model (DiT). With Stable Audio Open you can generate realistic sounds and field recordings. β Read more
MELLE demonstrates fast and simple approach for speech synthesis that avoids the use of vector quantization. Introduced by Microsoft and The Chinese University of Hong Kong, MELLE creates the sound of speech directly from written text using mel-spectrogram frames. β Read more
VALL-E by Microsoft is a neural codec language model that treats text-to-speech (TTS) as a language modeling task. VALL-E converts phonemes to discrete codes and then to waveforms. These discrete codes represent both the text and the speaker's voice. It handles tasks like zero-shot TTS, where it generates speech from a 3-second recording of a new speaker, speech editing and content creation together with models like GPT. β Read more
Extended models:
VALL-E X supports cross-lingual TTS can synthesize personalized speech in another language for a monolingual speaker
VALL-E R improves on VALL-E by offering more accurate phoneme alignment, faster decoding speeds, and fewer errors like typos, making it more robust and efficient for text-to-speech tasks.
VALL-E 2 achieves human-level performance in zero-shot TTS. It reduces errors with Repetition Aware Sampling and speeds up processing with Grouped Code Modeling, making speech more robust and natural.
Suno AI is an AI-driven music creation tool that generates melodies, harmonies, and full compositions from text prompts or lyrics. It provides high-quality instrumental tracks across various genres, making it suitable for musicians, professionals, enthusiasts, and educators to enhance their music projects.β Read more
Bark, a transformer-based text-to-audio model by Suno, generates realistic multilingual speech, music, background noise, and sound effects. It supports various languages, detects language from input text, and uses native accents for code-switched text. Bark can also produce nonverbal sounds like laughing, sighing, and crying. β Read more
WaveNet, a deep neural network by Google DeepMind, efficiently handles high-resolution audio data to produce natural-sounding speech, outperforming other systems in English and Mandarin. It can mimic various speakers and generate realistic music fragments. WaveNet also shows promise in phoneme recognition, demonstrating versatility beyond text-to-speech applications. β Read more
Jukebox model by Open AI creates music with singing. It uses a multi-scale VQ-VAE to compress raw audio into codes and then an autoregressive Transformer to generate the music. Jukebox produces high-quality, varied songs and can be guided by artist, genre, and lyrics for control. β Read more
Voicebox is an speech generation model from Meta. It excels in various speech tasks by learning from large datasets. Voicebox can synthesize speech in six languages, remove noise, edit content, and transfer audio styles. It can generate speech up to 20 times faster than the most advanced auto-regressive models. β Read more
Audiobox by Meta is an advanced model for generating various types of audio, including speech and sound. It offers detailed control over audio styles and can create novel styles based on text descriptions. Audiobox sets new benchmarks in audio quality and speed, making audio creation more accessible and efficient. β Read more
MusicLM is a model that creates high-quality music from text descriptions like "a calming violin melody backed by a distorted guitar riff." Developed by Google, It generates consistent music at 24 kHz for several minutes. MusicLM can also transform hummed or whistled tunes based on text descriptions. β Read more
MusicFX by Google is an upgrade of MusicLM. MusicFX can create compositions up to 70 seconds in length and music loops. Google claims it produces higher quality and faster music. It also has a DJ mode and is generally available. β Read more
Deep Voice 3 is a TTS system by Baidu that uses a fully-convolutional attention-based neural network. It handles large datasets, with over 800 hours of audio from 2,000+ speakers. Deep Voice reduces common errors, compares waveform synthesis methods, and can process 10 million queries daily on a single GPU server. β Read more
DITTO: Diffusion Inference-Time Optimization (DITTO) optimizes initial noise latents to control pre-trained text-to-music models during inference. Using differentiable feature matching and gradient checkpointing, DITTO supports tasks like inpainting, outpainting, looping, and controlling intensity, melody, and structure without model fine-tuning. β Read more
