The ability to transcribe notes from music is a very rare and desirable talent. There exists a large demand for this function in the world of music and so an automatic note transcription system would be quite nice to have.
The system that I have developed does not try to solve the entire note transcription problem. Rather, it focuses on a specific transcription task where notes in the music are known a priori. This is accomplished by generating music audio from MIDI files which represent the ground truth note labels. Another constraint in my system is that the music must involve multiple instruments, each playing at most one note at a time. For simplicity's sake, I have kept the number of instruments in my experiments at two, although the system should be able to handle an arbitrary number. The system does not break down if instruments play chords, but as the chord will be transcribed to a single note, only one of the notes in the chord will be transcribed.
My procedure takes a model-based, additive approach. This means that a model is produced for each note separately. Note models distinguish themselves by their harmonic structure. Different notes will have different frequencies. Note models for the same note played by different instruments will have different spectral structures depending on the strength of harmonics produced by the instrument. Once the note models are produced, the music audio is approximated by the best linear combination of note models. Note models receiving the largest weights in this sum are the best candidates for being present in the music. Note onset times help to time-align the note approximations with the music.
I have evaluated my results on 10 Bach Inventions using the Dixon Success Formula which is equal to 100 x Number Correct / (Number Correct + False Positives + Missing Notes). In the following sections, I will explain the details my system.
The original MIDI song file has two tracks with the same instrument playing on both. We will need to make the instruments different and then split the MIDI into two separate instrument files with one track each. After converting the MIDI to WAV audio we will incorporate user feedback to help detect note onset events using MATLAB.
- Open the song MIDI file in Anvil Studio and set the instrument on the first track to "Vibraphone" and the instrument on the second track to "Acoustic Grand" (Piano) (of course, different instruments could also have been selected). Save the file.
- Open the newly saved MIDI file and delete each track one at a time, saving a "piano" version and a "vibes" version.
- Convert these three MIDI files to WAV audio using Apple's iTunes software.
- In MATLAB, read the song MIDI file into a note matrix using the readmidi function in the MIDI Toolbox.
-
Incorporate User Feedback to Detect Note Onset Events
- Create a 1024-window spectrogram of the entire song from the WAV audio. Take the amplitude of the lower half of the spectrogram (I assume that the frequency information in the top half of the frequency spectrum can be ignored).
- Sum the spectral amplitudes over the frequency axis to produce an amplitude vector.
- Divide the difference between every two adjacent values in this vector by the first point in the difference to produce relative differences.
- Plot these relative differences and allow the user to zoom in and select a vertical threshold. Points above this threshold will be considered note onsets events.
- Use an iterative loop to merge adjacent note onsets with small durations. Durations are defined as the differences between the indices of consecutive note onsets.
Of course, the exact threshold that the user selects to determine the note onset events will have a direct impact on the quality of the transcription. The only way to know if the threshold is good is to go through with the entire transcription process and then repeat it using a different threshold to see if you can improve the results with a different threshold.
Each note will be modeled by a type of averaged normalized frequency vector from the amplitude spectrogram. We have previously created WAV audio from the individual instrument MIDI files. The next step is to create an amplitude spectrogram from the WAV audio and then group the time slices for each unique note. The frames in these slice groupings are normalized and then averaged to create one model per note.
- Create an amplitude spectrogram from the WAV audio of a particular instrument in the song.
- Throw away the top half of the spectrogram (use the lower frequencies only).
- Load the MIDI for this instrument into a note matrix and find all the notes which play alone (not in chords for this instrument).
- Determine the number of seconds per spectrogram frame (with respect to the actual song length) and assign each frame to the corresponding note from the note matrix.
- Group the amplitude frames together by their corresponding notes.
- Sum the amplitudes in each group and determine which frames are significant. I used any frames that had more than 1/10th of the maximum amplitude in the group.
- Normalize each frame so that the sum of all amplitudes in the frame adds to one.
- Average the frames together, excluding frames which originally had significantly low total amplitude with respect to the maximum.
- The result will be one averaged normalized frequency vector for each unique note in the song played by this instrument.
Since this is an additive approach to transcription, we want to try to approximate the music by a linear combination of our note models. This means that we need the optimal set of weights with which to multiply each note model such that the grand sum is close to the actual song.
We can write this relationship in matrix notation as: W x M = A.
Here W is the matrix of weights, M is the matrix of note models from all instruments, and A is
matrix of the amplitude spectrogram of the song audio. Solving for the weights matrix, W, we arrive at:
W = A x pseudoinverse(M)
The resulting weights matrix contains one factor for each note model of each instrument for every frame in the amplitude spectrogram. We must isolate the weights for each instrument and then cluster the frames for each note in the music as approximated by the note onsets and durations gathered from the user feedback process previously mentioned.
Once the weights have been grouped together for each detected note, the most representative model must be selected. One way to do this would be to add all the weights together for each model and take the one with the highest sum. I felt this would be too sensitive to models that receive one very big weight with the rest being small. Instead, I choose the note model with the largest median weight for each cluster.
One criticism of this method is that we end up with some negative weights. While the optimal linear approximation can involve some models being subtracted from other models, I would argue that if we start out with good models, then there will be at least one model that does not get a negative weight.
Next, we combine adjacent notes that are probably two parts of the same note. I do this by matching together found notes with the same pitch and that had median weights that differ by less than 50%. In this case, I eliminate the second note and combine their durations.
The final step is to combine the transcriptions of each instrument into one large note matrix and then convert this back to MIDI using the convenient writemidi() script in the MIDI Toolbox.
To evaluate this system, I found ten Bach Inventions (1-10) from Bach Central each involving just two instruments playing (mostly) just one note at a time. For the most part, the tempo is moderate to fast. In each piece, I set the first instrument to Vibraphones and the second one to Acoustic Grand Piano.
I used the Dixon Success Formula which is equal to 100 x Number Correct / (Number Correct + False Positives + Missing Notes). The number of missing notes was modified so as to exclude additional notes in chords, as my system is not designed to find more than one note per instrument at a time.
An average Dixon Success Score of 88 % was achieved.
The following systems are not directly comparable to mine because they use different experimental data and possibly allow for finding chord structures whereas my system does not. However, I present them nonetheless.
- In last year's E6820 class, Graham Poliner [project] found a 57 % success rate in polyphonic note transcription with a considerably more complicated system involving neural networks and auditory masking functions.
- In 2004, Michael Jordan [website] published a paper Multi-Iinstrument Musical Transcription Using A Dynamic Graphical Model in which he obtains such disgustingly high note error rates that I will not publish them here. His system has the closest goal to mine, however his is much more complex, using a Dynamic Graphical Model.
- In 2002, Christopher Raphael
[website]
published a paper
Automatic Transcription of Piano Music
in which he finds a "note error rate" of 39% with 184 substitutions, 241 deletions, and 108 insertions out of 1360 notes.
If we consider a substitution as both a false positive and a missing note, then we get :
184 + 241 = 425 false positives and 184 + 108 = 292 missing notes.
Total number of notes - number missing = 1300 - 292 = 1008 Correct Notes.
Thus, we find a corresponding Dixon Success Score of 100 + 1008 / (1008 + 425 + 292 ) = 58 %
This system uses Hidden Markov Model.
In the following table, I have listed my results. Onset threshold refers to the vertical threshold in the plot of relative differences between summed spectrogram amplitudes found in the user feedback process. My qualitative assessments of the transcriptions are that they are all quite reasonable, also noticably worse than the original.
| Composition | Onset Threshold | Number Correct | False Positives | Missing Notes | Dixon Score |
| Invention 1 | 0.17305 | 462 | 6 | 9 | 96.8553 % |
| Invention 2 | 0.10907 | 608 | 22 | 81 | 85.5134 % |
| Invention 3 | 0.15712 | 469 | 27 | 55 | 85.1180 % |
| Invention 4 | 0.14708 | 476 | 10 | 67 | 86.0759 % |
| Invention 5 | 0.15473 | 738 | 7 | 57 | 92.0200 % |
| Invention 6 | 0.17146 | 1027 | 40 | 67 | 90.5644 % |
| Invention 7 | 0.18309 | 503 | 12 | 128 | 78.2271 % |
| Invention 8 | 0.21818 | 536 | 2 | 60 | 89.6321 % |
| Invention 9 | 0.15558 | 537 | 9 | 46 | 90.7095 % |
| Invention 10 | 0.18484 | 4 | 99 | 55 | 82.4830 % |
Despite my system's simplicity, it achieves a very remarkable success score of 88% on average. My results support the theory that music can be approximated well by an additive note-based model.
It should be noted that the Dixon Success Score does not take into account the alignment of the transcribed notes. In fact, my transcription procedure did alter the time alignment of the notes somewhat and this leads to further reduction in transcription quality. Mostly, this difference is not too bad.
This system was generally more effective at detecting the vibraphone notes in track one. This might be due to having better note models for the vibraphone and the fact that the first track plays higher pitches which could lead to better defined note models.
Part of the success of my system may be attributed to the fact that the note models I develop already come from the original music, so I can not possibly make the mistake of substituting a note that does not exist. Obviously, a truly general system would have built the note models for all likely notes for several octaves in each instrument's range. Then this nearly complete set of note models would be used for transcription. It is likely that having the possibility to find notes that do not occur in the song will lead to lower success scores.
The note onset identification process employing user feedback makes a huge difference in the success of the transcription. For most of the pieces I experimented on, I ended up repeating the procedure a couple of times to try to find the best thresholds. Future work would involve making this an automatic process. However, it looks like most of my thresholds were around 0.2, so if this were hard-coded into the code, it would likely work ok on Bach Inventions, at least.
Much time was spent converting file types from MIDI to WAV audio and then back. It would be very nice if MATLAB scripts were available to automate this process. This would also help speed up the evaluation process.
Future work can also include :
- Extracting chord structures from the weights matrix.
- Determining better parameters for detecting when instruments are playing or silent.
- Trying out different instruments and more of them playing simultaneously.
- Evaluating musical works in other styles, other composers, other genres.
Audio Files
Note that my MATLAB code produces a test.mid MIDI file. To produce the MP3 of the test.mid file, I set the instruments using Anvil Studio (link below) and also adjusted the tempo. After converting the MIDI to WAV audio using iTunes, I then used DBPowerAmp to create the mp3 file.
MATLAB Code
- Project MATLAB Code - run_project.m is the main script, it is not a function, so it does not take any parameters so if you want to change the files, just edit the paths to the song and instrument files near the top. Keep in mind that the MIDI and WAV files must have the same name (i.e. invent1.wav and invent1.mid)
Software
- MATLAB by The Mathworks
- MIDI Toolbox for MATLAB - make sure you add this to your path
- Anvil Studio - MIDI editing software. Good for changing instruments.
- DB PowerAmp - Audio File Converter (WAV → MP3)
- Apple's iTunes - Audio Player, converts MIDI to WAV
Email me at : barry.rafkind @ gmail.com