MusicGEN#
In this section, we take MusicGEN [CKG+24]. as an example to introduce the auto-regressive modeling of the music generative model via the transformer architecture [VSP+17].
Neural Audio Codec#
MusicGEN is an auto-regressive text-to-music generative model. The input of the MusicGEN leverages the Encodec [DefossezCSA23] to process music time-domain signals into discrete tokens as neural audio codec tokens.
As illustrated in the figure above, the Encodec architecture consists of 1D convolutional and 1D deconvolutional blocks in its encoder and decoder networks. The bottleneck block features a multi-step residual vector quantization (RVQ) mechanism, which converts the continuous latent music embeddings from the encoder into discrete audio tokens. The objective of the decoder is to reconstruct the input time-domain signals from these audio tokens. This reconstruction is trained using a combination of different objectives, including L1 loss in the time-domain signals, L1 loss in the mel-spectrogram signals, L2 loss in the mel-spectrogram signals, and adversarial training with multi-resolution STFT discriminators.
The pretrained Encodec model preprocesses the music time-domain signals into audio tokens, which serve as one part of the input for the MusicGEN model.
MusicGEN#
MusicGEN utilizes a transformer decoder architecture to predict the next audio token based on the preceding audio tokens, as illustrated by the following probability function:
When incorporating text into the music generation task, MusicGEN employs two methods to condition the text for the music generation target, as illustrated in the figure above:
Time-domain Concatenation: utilizing the text tokens generated by the T5 model as prefix tokens preceding the audio tokens, serving as a conditioning mechanism.
Cross Attention: forwarding the Keys and Values (K,V) of text tokens, and the Queries (Q) of audio tokens into the cross-attention module of the MusicGEN.
The new probability function is demonstrated as: