MIDIAlign: You did what with MIDI?
We have aligned MIDI files with the song that they represent, so that we
can create a mapping between each note in the MIDI file with the point in
time in the original song that that note should be expressed.
Why would anybody do such a thing? Well, as bad as MIDI files sound, some
of them are actually really good transcriptions of the songs they were
meant to represent. The trouble is that there are slight timing
differences (because of freedom of expression in performance) and also
offsets caused by the MIDI and raw not starting at the same point, and
also there are block errors, even in good transcriptions. By aligning
them, we find areas where we can reasonably say we have a good
transcription, for use in other things.
What kinds of stuff can you do with these transcriptions? Well, for
starters there is polyphonic pitch extraction. We all know that
transcription is an EXTREMELY difficult problem to solve... Over the past
couple of years, researchers have discovered the dichotomy between mono
and poly transcription: amateur listeners can transcribe (or, at least
relatively transcribe) simple monophonic melodies. Thus it makes sense
that monophonic transcription be done by computer models of the ear
(Meddis + Hewitt, etc). But most humans can't transcribe polyphonic music
- the ones that do have learned how to do it over years of ear training,
and also the application of musical knowledge. Now, since experimental
neuroscientists conjecture that expert knowledge is derived from pattern
classification (think about a chess master - does Kasparov really sit
there trying all of the possible moves? Well, what happens if i move my
knight here then he will take my pawn and.... Instead he recognizes
patterns that he's seen before and instantly knows how to react. Thus the
appeal of speed-chess.). So now so many algorithms for polyphonic
transcription are presented as machine learning problems.
Some of these seem really REALLY promising (Walmsley, Godsill + Rayner),
except they are unable to report their results on anything except for some
toy problems. This is because framewise accurate transcriptions of
real-world music don't really exist. But once we have these midi
alignments, all of a sudden we have a reasonable estimation of what the transcription is, and thus a measure of accuracy for a transcription algorithm.
But lets take it one step further, shall we? All of the MPE methods using statistical pattern recognition rely on either unsupervised learning algos, or on assumptions of what a "note" looks like. Now suppose we took hundreds of songs, along with their transcriptions, and used the labels of each frame as training data to a classifier. Then we would build something that explicitly learns from the data, and not from a note model.
Obviously, most MIDI transcriptions are awful. We wouldn't want to
corrupt our training / evaluation set with bad transcriptions, so we need
an automatic method of evaluating each alignment, to see if it is a
reliable transcription or not. How we've done this is in an articule
submitted for publication. However, you can see our evaluations here. Sort by song title or sort by feature.
Questions/comments please contact Rob