Dan Ellis : Research : Music Similarity :

OpenNap (music collections) Data

As a basis for ground-truth in musical artist similarity, we collected lists of artists represented in the personal collections of music listeners. Many peer-to-peer file sharing systems allow the list of files on a particular node to be queried: Since summer 2001, we (meaning Brian Whitman) have been running such queries on some 3,700 nodes on the OpenNap file sharing network, and, from the file names, inferred the musical artists present in the collections represented at each node.

The data presented here, which reflects queries up until February 2002, comprises a total of about 1.6 million user-to-song relations. Regularization to remove misspellings and exclude unknown artists left the data described below (317,470 user-to-song relations).

We defined a set of 400 highly-represented artists which we call the aset400. The list of artists is in aset400.txt.

One compact way to represent the data is in terms of a similarity matrix, giving a similarity between each pair of the 400 artists, where a high similarity indicates a high likelihood of co-occurrence in user collections, and vice versa. This 400x400 matrix is available here as aset-opennap-sim.txt. All 160,000 values are smaller than 1, except for the leading diagonal (artists compared with themselves).

Here is the data in various forms:

And, for relating this data to our 400-element artist set:

Total song-in-collection observations 317,470
Total collections 3,245
Unique artists identified 4,591
Unique songs identified 65,047
Unique collection-artist relations 176,113
Average songs/collection 97.8
Average artists/collection 54.3
Maximum songs by a single artist in one collection 216
(The vast majority of collection-artist relations consist of a single song)
Most popular song 398 occurrences of "It Wasn't Me" by Shaggy
Artist with the most songs in collections 2822 songs (0.89%) by The Beatles (in 589 collections)
Artist appearing in most collections 982 collections (30.3%) containing songs by Madonna

More information about this measure, and what we did with it, is in our paper for ISMIR-02, The Quest for Ground Truth in Musical Artist Similarity.

Valid HTML 4.0! Last updated: $Date: 2003/08/07 13:41:52 $
Dan Ellis <dpwe@ee.columbia.edu>