MIDIAlign: You did what with MIDI?

We have aligned MIDI files with the song that they represent, so that we can create a mapping between each note in the MIDI file with the point in time in the original song that that note should be expressed.

Why would anybody do such a thing? Well, as bad as MIDI files sound, some of them are actually really good transcriptions of the songs they were meant to represent. The trouble is that there are slight timing differences (because of freedom of expression in performance) and also offsets caused by the MIDI and raw not starting at the same point, and also there are block errors, even in good transcriptions. By aligning them, we find areas where we can reasonably say we have a good transcription, for use in other things.

What kinds of stuff can you do with these transcriptions? Well, for starters there is polyphonic pitch extraction. We all know that transcription is an EXTREMELY difficult problem to solve... Over the past couple of years, researchers have discovered the dichotomy between mono and poly transcription: amateur listeners can transcribe (or, at least relatively transcribe) simple monophonic melodies. Thus it makes sense that monophonic transcription be done by computer models of the ear (Meddis + Hewitt, etc). But most humans can't transcribe polyphonic music - the ones that do have learned how to do it over years of ear training, and also the application of musical knowledge. Now, since experimental neuroscientists conjecture that expert knowledge is derived from pattern classification (think about a chess master - does Kasparov really sit there trying all of the possible moves? Well, what happens if i move my knight here then he will take my pawn and.... Instead he recognizes patterns that he's seen before and instantly knows how to react. Thus the appeal of speed-chess.). So now so many algorithms for polyphonic transcription are presented as machine learning problems.

Some of these seem really REALLY promising (Walmsley, Godsill + Rayner), except they are unable to report their results on anything except for some toy problems. This is because framewise accurate transcriptions of real-world music don't really exist. But once we have these midi alignments, all of a sudden we have a reasonable estimation of what the transcription is, and thus a measure of accuracy for a transcription algorithm.

But lets take it one step further, shall we? All of the MPE methods using statistical pattern recognition rely on either unsupervised learning algos, or on assumptions of what a "note" looks like. Now suppose we took hundreds of songs, along with their transcriptions, and used the labels of each frame as training data to a classifier. Then we would build something that explicitly learns from the data, and not from a note model.

Obviously, most MIDI transcriptions are awful. We wouldn't want to corrupt our training / evaluation set with bad transcriptions, so we need an automatic method of evaluating each alignment, to see if it is a reliable transcription or not. How we've done this is in an articule submitted for publication. However, you can see our evaluations here. Sort by song title or sort by feature.

Questions/comments please contact Rob