MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Overview

MuChoMusic is a benchmark for evaluating music understanding in multimodal audio-language models. It comprises 1,187 multiple-choice questions, all validated by human annotators, associated with 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Each question comes with three distractors composed to test different aspects of language and audio understanding. In the knowledge category, questions probe a model's ability to recognise pre-acquired knowledge across various musical aspects. Questions that test reasoning are instead designed to require the synthesis and analytical processing of multiple musical concepts.

Dataset

MuChoMusic includes 1,187 carefully crafted multiple-choice questions tied to 644 music tracks sourced from MusicCaps and the Song Describer Dataset (SDD). These tracks were chosen for their diverse genres and high-quality recordings, ensuring consistent and reliable evaluation of audio-language models.

We generated the questions using a Large Language Model, Gemini 1.0 Pro, which transformed detailed human-written music captions into challenging multiple-choice questions. Each question includes one correct answer and three carefully designed distractors that test different aspects of music comprehension. To ensure the dataset’s accuracy, every question was rigorously validated by human annotators, who filtered out any ambiguous or incorrect options.

MuChoMusic uses a structured taxonomy to automatically categorise questions into knowledge and reasoning dimensions. This process spans a diverse range of musical concepts — from melody and rhythm to mood, genre, and cultural context — providing a broad framework for evaluating music comprehension in audio-language models.

Results

Using MuChoMusic, we evaluate five open-source models, three specialised in the music domain and two general-purpose, and find that Qwen-Audio achieves the highest scores on most dimensions.

We observe that even the best models can only answer less than 50% of the questions correctly. Surprisingly, out of those considered, models trained on music-specific tasks tend to overall perform worse than those trained on a wider variety of general-audio tasks including speech and everyday sounds.

Insights

In an attempt to understand why models perform poorly, we analyse how results change when using only a single distractor (a) or when passing perturbed audio (b).

Experiments with distractors and audio perturbations

From both these experiments, we discover an over-reliance on the language modality pointing to a need for better multimodal integration.

BibTeX


    @inproceedings{weck2024muchomusic,
        title={MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models},
        author={Weck, Benno and Manco, Ilaria and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Bogdanov, Dmitry},
        booktitle = {Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR)},
        year={2024}
    }