Datasets#
Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn’t directly distributed with the dataset and may be subject to copyright.
Dataset |
Content |
Size (# annotations) |
Accompanying Audio |
Audio Length |
Audio License |
Text source |
Dataset License |
---|---|---|---|---|---|---|---|
Captions |
5.5k |
❌ |
10s |
- |
Human-written (by musicians) |
CC BY-SA 4.0 |
|
Captions |
1.2k |
✅ |
2min |
CC (various) |
Human-written (crowdsourced) |
CC BY-SA 4.0 |
|
Captions |
3.2k |
❌ |
10s |
- |
Human-written (crowdsourced) |
CC BY-SA 4.0 |
|
Music descriptions |
9k |
❌ |
~29s |
- |
Human-written (crawled from Wikipedia) |
CC BY-SA 3.0 |
|
Captions |
2.2M / 88k / 22k |
❌ |
30s / 10s |
- |
Synthetic (generated from tags via GPT-3.5) |
CC-BY-NC 4.0 |
|
Question-answer pairs |
113k |
❌ |
10s / 30s |
- |
Synthetic (generated from tags/captions via MPT-7B) |
MIT |
|
Question-answer pairs |
28k / 33k |
❌ |
10s |
- |
Synthetic (generated from captions via GPT-4) |
CC BY-NC 4.0 |
|
Captions |
53k |
❌ |
10s |
- |
Human-written captions expanded via text templates |
CC BY-SA 3.0 |
|
Captions |
22k |
❌ (YT IDs from AudioSet) |
10s |
- |
Synthetic (generated from audio via MU-LLaMA) |
CC BY-NC-ND 4.0 |
|
Music editing instructions |
11k |
❌ |
10s |
- |
Synthetic (generated from audio via MU-LLaMA) |
CC BY-NC-ND 4.0 |
|
Captions (fine-grained) |
51.8k |
❌ |
2-5min |
- |
Synthetic (generated from audio via FUTGA) |
Apache-2.0 |
|
Album reviews |
264k |
❌ |
- |
- |
Human-written (Amazon customers) |
MIT |
Human-written text#
Among the datasets containing music captions, only three feature fully human-written descriptions: MusicCaps [ADB+23], the Song Describer Dataset [MWD+23] and YouTube8M-MusicTextClips [MSSR23].
Some example of music captions from the SDD are shown below:
Show code cell source
import base64
import pandas as pd
import datasets
from IPython.core.display import HTML
sdd = datasets.load_dataset('renumics/song-describer-dataset')
df = pd.DataFrame({
'Audio': [audio["bytes"] for audio in sdd['train'][:10]['path']],
'Caption': [caption for caption in sdd['train'][:10]['caption']],
})
# Function to create audio HTML
def create_audio_html(audio_bytes):
audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
audio_html = f'<audio controls><source src="data:audio/wav;base64,{audio_base64}" type="audio/wav"></audio>'
return audio_html
# Apply the function to create audio HTML tags
df['Audio'] = df['Audio'].apply(create_audio_html)
# Display the DataFrame as HTML
display(HTML(df.to_html(escape=False)))
Audio | Caption | |
---|---|---|
0 |