What AI Music Generators Can Do (And How They Do It)

Company

AssemblyAI

Date Published

Sept. 22, 2023

Author

Marco Ramponi

Word count

2202

Language

English

Hacker News points

URL

www.assemblyai.com/blog/what-ai-music-generators-can-do-and-how-they-do-it

Summary

The article provides an overview of MusicLM, a text-to-music generation model developed by Google Research and DeepMind. It explains how MusicLM is trained on a large dataset of music paired with descriptive annotations to generate high-fidelity audio from natural language prompts. The training process involves three main stages: pretraining, fine-tuning, and aligning the model's outputs with human preferences. During pretraining, MusicLM learns general audio representations by predicting masked tokens in audio spectrograms. Fine-tuning adjusts the model to generate music based on text descriptions. Lastly, alignment with human judgments refines the model's performance. The article also discusses several important aspects of MusicLM, such as: 1. Architecture: The model consists of a stack of transformer layers that process audio spectrograms and text embeddings. It uses two additional components for better controllability and output quality - residual vector quantization (RVQ) and token interleaving patterns. 2. RVQ: MusicLM compresses audio data into discrete token streams using multiple codebooks through RVQ. This technique allows the model to capture complex musical structures while reducing computational complexity. 3. Token Interleaving Patterns: These patterns determine how the model predicts tokens from different codebooks during inference. The authors of MusicGen empirically evaluate various interleaving strategies and highlight the benefits of using a simple delayed pattern. 4. Timing-Conditioning: Controlling Output Duration: Unlike previous models, which were trained to produce audio of fixed lengths, MusicLM introduces timing-conditioning to enable generating audio with specified durations. This feature is inspired by Stable Audio and relies on incorporating learned embeddings representing the start time and cumulative duration of the original audio into the model's inputs. Overall, MusicLM represents a significant advancement in text-to-music generation, showcasing promising capabilities for creating high-quality, diverse, and controllable music from natural language prompts. However, it still faces challenges such as generating coherent structures in extended outputs and accurately reproducing vocal sounds. Despite these limitations, the field is moving towards commercial deployment of such models, indicating further exciting developments on the horizon.