The Million Song Dataset (MSD) is a collection of one million songs annotated with features from The Echonest (now part of Spotify). Additional annotations to the MSD are provided by datasets like The Last.fm Dataset, musiXmatch, or the Million Song Dataset Benchmarks by Schindler et al. Amongst other features, the latter also contains song-level genre annotations derived from the All Music Guide.
To increase the accuracy and granularity of MSD genre annotations, and thus facilitate
music genre recognition research, the tagtraum genre annotations are
based on multiple source datasets and allow for ambiguity. Details can be found in
this publication.
The slides for the oral presentation are available
here.
A similar method was also used to learn genre ontologies from crowd-sourced genre labels.
These three ground truths were generated based on the Last.fm dataset, the Top-MAGD dataset and the beaTunes Genre Dataset (BGD).
Name | Labels | File | Description |
---|---|---|---|
CD1 | 133,676 | msd_tagtraum_cd1.cls.zip | Constructed from BGD, LFMGD, and Top-MAGD, same labels as Top-MAGD, contains minority votes. |
CD2 | 280,831 | msd_tagtraum_cd2.cls.zip | Based on modified BGD and LFMGD. Additional labels Metal and Punk, International = World, removed Vocal. Some labels ambiguous. |
CD2C | 191,401 | msd_tagtraum_cd2c.cls.zip | Same as CD2 without ambiguous annotations. |
These tasks are meant to be similarly constructed as the ones published by Schindler. However, there is no correspondence on the identifier level, i.e. these are independent tasks.
Non-stratified splits | |||
---|---|---|---|
90% training data | CD1 | CD2 | CD2C |
80% training data | CD1 | CD2 | CD2C |
66% training data | CD1 | CD2 | CD2C |
55% training data | CD1 | CD2 | CD2C |
Stratified splits | |||
90% training data | CD1 | CD2 | CD2C |
80% training data | CD1 | CD2 | CD2C |
66% training data | CD1 | CD2 | CD2C |
55% training data | CD1 | CD2 | CD2C |
Splits with fixed size per genre | |||
1,000 samples training data / genre set | CD1 | CD2 | CD2C |
2,000 samples training data / genre set | CD1 | CD2 | CD2C |
3,000 samples training data / genre set | - | CD2 | - |
BGD and LFMGD are generated based on co-occurrences and derived genre trees (taxonomies). These files contains both the relative co-occurrences (values below 0.0001 were dropped) and the generated genre trees in JSON format.
Note, that by far the most user submissions came from English speaking users, followed by German, French, and Spanish. In the publication, only the labels submitted by English speakers were used.
Source | User-Language | File | Description |
---|---|---|---|
Last.fm | Unspecified | lastfm.json.zip | Used for CD1, CD2, CD2C. |
beaTunes | English | beatunes_eng.json.zip | Based on 521,070,246 submissions. Used for CD1, CD2, CD2C. |
beaTunes | German | beatunes_deu.json.zip | Based on 97,876,937 submissions. Informative only. |
beaTunes | French | beatunes_fra.json.zip | Based on 43,316,474 submissions. Informative only. |
beaTunes | Spanish | beatunes_spa.json.zip | Based on 27,142,179 submissions. Informative only. |
beaTunes | Dutch | beatunes_nld.json.zip | Based on 21,164,860 submissions. Informative only. |
beaTunes | Italian | beatunes_ita.json.zip | Based on 14,012,314 submissions. Informative only. |
beaTunes | Japanese | beatunes_jpn.json.zip | Based on 11,034,788 submissions. Informative only. |
beaTunes | Portuguese | beatunes_por.json.zip | Based on 8,440,576 submissions. Informative only. |
beaTunes | Danish | beatunes_dan.json.zip | Based on 4,997,361 submissions. Informative only. |
beaTunes | Russian | beatunes_rus.json.zip | Based on 4,521,323 submissions. Informative only. |
beaTunes | Swedish | beatunes_swe.json.zip | Based on 4,569,080 submissions. Informative only. |
beaTunes | Chinese | beatunes_zho.json.zip | Based on 3,311,139 submissions. Informative only. |
beaTunes | Polish | beatunes_pol.json.zip | Based on 1,099,730 submissions. Informative only. |
beaTunes | Korean | beatunes_kor.json.zip | Based on 805,969 submissions. Informative only. |
Using co-occurrences and derived trees, we annotated both the Last.fm dataset and the matched beaTunes songs with seed-level genres.
Source | Name | File | Description |
---|---|---|---|
Last.fm | LFMGD | msd_lastfm_map.cls.zip | Last.fm dataset with additional inferred genre annotations. |
beaTunes | BGD | msd_beatunes_map.cls.zip | beaTunes database matched with MSD. Original genre labels and inferred genre annotations. |
Research only, strictly non-commercial.
How to cite the dataset?Hendrik Schreiber. Improving Genre Annotations for the Million Song Dataset. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 241-247, Málaga, Spain, Oct. 2015. [slides]