tagtraum genre annotations
for the Million Song Dataset

Name: Genre Annotations for the MSD: CD2C (truth by consensus)
License: Research only, strictly non-commercial.
Keywords: Computing, Audio, Music Information Retrieval, MIR, Genre, MSD, AGR

The Million Song Dataset (MSD) is a collection of one million songs annotated with features from The Echonest (now part of Spotify). Additional annotations to the MSD are provided by datasets like The Last.fm Dataset, musiXmatch, or the Million Song Dataset Benchmarks by Schindler et al. Amongst other features, the latter also contains song-level genre annotations derived from the All Music Guide.

To increase the accuracy and granularity of MSD genre annotations, and thus facilitate music genre recognition research, the tagtraum genre annotations are based on multiple source datasets and allow for ambiguity. Details can be found in this publication.
The slides for the oral presentation are available here.

A similar method was also used to learn genre ontologies from crowd-sourced genre labels.

Genre Ground Truth

These three ground truths were generated based on the Last.fm dataset, the Top-MAGD dataset and the beaTunes Genre Dataset (BGD).

Name	Labels	File	Description
CD1	133,676	msd_tagtraum_cd1.cls.zip	Constructed from BGD, LFMGD, and Top-MAGD, same labels as Top-MAGD, contains minority votes.
CD2	280,831	msd_tagtraum_cd2.cls.zip	Based on modified BGD and LFMGD. Additional labels Metal and Punk, International = World, removed Vocal. Some labels ambiguous.
CD2C	191,401	msd_tagtraum_cd2c.cls.zip	Same as CD2 without ambiguous annotations.

Classification Tasks

These tasks are meant to be similarly constructed as the ones published by Schindler. However, there is no correspondence on the identifier level, i.e. these are independent tasks.

Non-stratified splits
90% training data	CD1	CD2	CD2C
80% training data	CD1	CD2	CD2C
66% training data	CD1	CD2	CD2C
55% training data	CD1	CD2	CD2C

Stratified splits
90% training data	CD1	CD2	CD2C
80% training data	CD1	CD2	CD2C
66% training data	CD1	CD2	CD2C
55% training data	CD1	CD2	CD2C

Splits with fixed size per genre
1,000 samples training data / genre set	CD1	CD2	CD2C
2,000 samples training data / genre set	CD1	CD2	CD2C
3,000 samples training data / genre set	-	CD2	-

Co-occurrences & Trees

BGD and LFMGD are generated based on co-occurrences and derived genre trees (taxonomies). These files contains both the relative co-occurrences (values below 0.0001 were dropped) and the generated genre trees in JSON format.

Note, that by far the most user submissions came from English speaking users, followed by German, French, and Spanish. In the publication, only the labels submitted by English speakers were used.

Source	User-Language	File	Description
Last.fm	Unspecified	lastfm.json.zip	Used for CD1, CD2, CD2C.
beaTunes	English	beatunes_eng.json.zip	Based on 521,070,246 submissions. Used for CD1, CD2, CD2C.
beaTunes	German	beatunes_deu.json.zip	Based on 97,876,937 submissions. Informative only.
beaTunes	French	beatunes_fra.json.zip	Based on 43,316,474 submissions. Informative only.
beaTunes	Spanish	beatunes_spa.json.zip	Based on 27,142,179 submissions. Informative only.
beaTunes	Dutch	beatunes_nld.json.zip	Based on 21,164,860 submissions. Informative only.
beaTunes	Italian	beatunes_ita.json.zip	Based on 14,012,314 submissions. Informative only.
beaTunes	Japanese	beatunes_jpn.json.zip	Based on 11,034,788 submissions. Informative only.
beaTunes	Portuguese	beatunes_por.json.zip	Based on 8,440,576 submissions. Informative only.
beaTunes	Danish	beatunes_dan.json.zip	Based on 4,997,361 submissions. Informative only.
beaTunes	Russian	beatunes_rus.json.zip	Based on 4,521,323 submissions. Informative only.
beaTunes	Swedish	beatunes_swe.json.zip	Based on 4,569,080 submissions. Informative only.
beaTunes	Chinese	beatunes_zho.json.zip	Based on 3,311,139 submissions. Informative only.
beaTunes	Polish	beatunes_pol.json.zip	Based on 1,099,730 submissions. Informative only.
beaTunes	Korean	beatunes_kor.json.zip	Based on 805,969 submissions. Informative only.

Inferred Genre Annotations

Using co-occurrences and derived trees, we annotated both the Last.fm dataset and the matched beaTunes songs with seed-level genres.

Source	Name	File	Description
Last.fm	LFMGD	msd_lastfm_map.cls.zip	Last.fm dataset with additional inferred genre annotations.
beaTunes	BGD	msd_beatunes_map.cls.zip	beaTunes database matched with MSD. Original genre labels and inferred genre annotations.

FAQ

What is the licensing?

Research only, strictly non-commercial.

How to cite the dataset?

Hendrik Schreiber. Improving Genre Annotations for the Million Song Dataset. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 241-247, Málaga, Spain, Oct. 2015. [slides]

Other research.

tagtraum genre annotations for the Million Song Dataset

Genre Ground Truth

Classification Tasks

Co-occurrences & Trees

Inferred Genre Annotations

FAQ

tagtraum genre annotations
for the Million Song Dataset