Datasets#

Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn’t directly distributed with the dataset and may be subject to copyright.

Table 2 Music description datasets.#

Dataset

Content

Size (# annotations)

Accompanying Audio

Audio Length

Audio License

Text source

Dataset License

MusicCaps

Captions

5.5k


(YT IDs from AudioSet)

10s

-

Human-written (by musicians)

CC BY-SA 4.0

Song Describer Dataset

Captions

1.2k


(MTG-Jamendo)

2min

CC (various)

Human-written (crowdsourced)

CC BY-SA 4.0

YouTube8M-MusicTextClips

Captions

3.2k


(YT IDs from YouTube8M)

10s

-

Human-written (crowdsourced)

CC BY-SA 4.0

WikiMute

Music descriptions

9k

~29s

-

Human-written (crawled from Wikipedia)

CC BY-SA 3.0

LP-MusicCaps

Captions

2.2M / 88k / 22k


(MusicCaps, MagnaTagATune, and Million Song Dataset ECALS)

30s / 10s

-

Synthetic (generated from tags via GPT-3.5)

CC-BY-NC 4.0

MusicQA

Question-answer pairs

113k


(MusicCaps, MagnaTagATune)

10s / 30s

-

Synthetic (generated from tags/captions via MPT-7B)

MIT

MusicInstruct

Question-answer pairs

28k / 33k


(MusicCaps)

10s

-

Synthetic (generated from captions via GPT-4)

CC BY-NC 4.0

MusicBench

Captions

53k


(MusicCaps)

10s

-

Human-written captions expanded via text templates

CC BY-SA 3.0

MUCaps

Captions

22k

❌ (YT IDs from AudioSet)

10s

-

Synthetic (generated from audio via MU-LLaMA)

CC BY-NC-ND 4.0

MuEdit

Music editing instructions

11k


(MusicCaps)

10s

-

Synthetic (generated from audio via MU-LLaMA)

CC BY-NC-ND 4.0

FUTGA

Captions (fine-grained)

51.8k


(MusicCaps, Song Describer Dataset)

2-5min

-

Synthetic (generated from audio via FUTGA)

Apache-2.0

MARD

Album reviews

264k

-

-

Human-written (Amazon customers)

MIT

Human-written text#

Among the datasets containing music captions, only three feature fully human-written descriptions: MusicCaps [ADB+23], the Song Describer Dataset [MWD+23] and YouTube8M-MusicTextClips [MSSR23].

Some example of music captions from the SDD are shown below:

Hide code cell source
import base64
import pandas as pd
import datasets
from IPython.core.display import HTML

sdd = datasets.load_dataset('renumics/song-describer-dataset')

df = pd.DataFrame({
    'Audio': [audio["bytes"] for audio in sdd['train'][:10]['path']],
    'Caption': [caption for caption in sdd['train'][:10]['caption']],
})

# Function to create audio HTML
def create_audio_html(audio_bytes):
    audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
    audio_html = f'<audio controls><source src="data:audio/wav;base64,{audio_base64}" type="audio/wav"></audio>'
    return audio_html

# Apply the function to create audio HTML tags
df['Audio'] = df['Audio'].apply(create_audio_html)

# Display the DataFrame as HTML
display(HTML(df.to_html(escape=False)))
Audio Caption
0