In our ISMIR-2003 paper we considered the problem of evaluating and comparing different music similarity measures. Part of the problem is finding the ground truth against which to evaluate the measures, but even then we need to define the figure of merit to evaluate using this ground truth. We have come up with several measures directly related to the musicseer data, which collected artist similarity judgments directly from users over the web:
We also defined a metric to measure the similarity between any pair of similarity measures (one of which could be a similarity measure derived from the musicseer data, but that's not necessary). It calculates a weighted score of the agreement between the first few artists rated as most similar to each target artist in aset400. We call it the Top-N ranking agreement score, and it is defined by:
where si is the score for artist i, N is how many similar artists are considered in each case (we use 10), αr is the `decay constant' for the reference ranking (we used 0.50.33), αc is the decay constant for the candidate ranking (we used 0.50.67), and kr is the rank under the candidate measure of the artist ranked r under the reference measure. The overall agreement score is obtained by averaging over all artists, and normalizing by the maximum ideal score (which is 0.999 using our values). Thus the score varies from near to 0 for measures giving unrelated rankings to 1 for measures giving identical rankings (at least for the top N cases).
The Matlab script simvsgdtruth.m will compare a similarity matrix (e.g. a 400x400 matrix where each element is proportional to the similarity between the row and column artists) against musicseer survey-type data. Here's it might be used:
>> % Load the sim matrix >> ank = load('SIM-ank14C'); >> % Read the musicseer data >> [tr, sg, uid, trg, cho, nch] = textread('musicseer-results-2002-10-15-nodups.txt','%d %c %s %d %d %d'); >> % Choose just the survey data (unfiltered) >> Su = find(sg=='S'); >> % Build the mapping to convert musicseer artist IDs to aset400 indices >> [name, sqlid] = textread('aset400.3-canon-musicseer.ids','%s %d'); >> sql2topset = zeros(1,7000); >> % Make it so sql2topset(sql+2) will return the topset index, or 0 if not in aset400. >> % ("+2" is so that sql can be -1, which it sometimes is) >> sql2topset(sqlid+2) = 1:400; >> % OK, build the ground truth matrix: trial number, target, chosen, notchosen >> % for the unfiltered survey trials >> gdtrSu = [tr(Su),sql2topset(trg(Su)+2)',sql2topset(cho(Su)+2)',sql2topset(nch(Su)+2)']; >> % Now we can run the metric scoring: >> p = simvsgdtruth(ank, gdtrSu); 10997 trials, 98964 triplets 10905 valid trials (0 with empty notchosen), 19.73% first place agreement, avrank=4.314 >>
The Matlab script topNrankagree.m computes the top-N rank agreement score defined above. Here it is in use:
>> % Load the 400x400 aset400 sim matrices >> playlst = load('SIM-aotm'); >> collctn = load('SIM-opennap'); >> % How well does playlst agree with collctn ground truth? >> topNrankagree(collctn,playlst) ans = 0.2239 >> % What about the other way around (playlst as ground truth)? >> topNrankagree(playlst,collctn) ans = 0.2254 >> % Note: 'tied' orderings are randomized, so there is a random component to the results: >> topNrankagree(playlst,collctn) ans = 0.2273