<< Back to main page

Final Project

January 25, 2007
February 1, 2007
February 8, 2007
February 15, 2007
February 22, 2007
March 1, 2007
March 8, 2007
(March 15, 2007 - Spring Break)
March 22, 2007
March 29, 2007
April 5, 2007
April 18, 2007
April 26, 2007

January 25, 2007: Ideas for a project

I am currently working on finding solo sections of music in musical recording.  I started working on this last semester, with some success.  This semester, I would like to see if I can get a better system.  Assuming that I managed to get a system up and running, I'd like to then move on to discriminating between vocal soloists and other soloists.  My eventual goal is to automatically create a database of solo vocal music snippets.  Then, I can really have some fun with looking into speech recognition as it applies to singing. 

Top

February 1, 2007: A little background research

It looks as though there are quite a few pitch detection algorithms out there.  Pitch Detection Methods Review, a paper on the CCRMA website, basically lists a whole lot of methods with reference to the papers describing them.  So, should my current method not work, this website has a whole slew of algorithms I could try.

A website on cnx.org, Pitch Detection Algorithms has a more detailed explanation for two pitch detections algorithms.  The second algorithm, which the author calls Harmonic Product Spectrum, is quite an interesting idea.  Basically, you add the fft of your signal to ffts of down-sampled versions of your signal.  In theory, if you down sample by two, the first partial should overlap with the fundamental.  If you down sample by three, the second partial will overlap with the fundamental.  If you down sample several times, you should get a large peak at the fundamental.  The author does point out that you can only down sample so many times before there's not much of your signal left in the fft window.  It also seems to me that you'd re-enforce the other peaks, too, although not as much.  You'd also creating peaks below the fundamental.

Top

February 8, 2007: Further project ideas/progress

My pitch tracker/solo canceler seems to be working pretty well.  You can listen to the unfiltered and filtered versions of one of my samples.  Most of the power in the soloist's voice is removed.  There are a few things I need to work on, however.  First, I really need to improve my test and training sets.  I don't have enough samples and the samples I do have are rather short.  I also think that it's worth trying the yin pitch tracker to see if it does a better job than my pitch tracker.

Top

February 15, 2007: Project work

This week I focused on my data set.  Last semester, my data set was pretty minimal.  I needed more labeled test and training data.  This semester I have tried to find more solo music to label.  This last week, I grabbed 20 1-minute sections of each solo/non-solo track and labeled them.  I actually found that I had trouble sometimes deciding when something was solo or not because musical sounds aren't instantaneous.  They sort of die away.  Depending on the instrument and the acoustics of the room in which the music was recorded, this "dying away" bit could take quite a while.  When did the sound of the soloists cease to be part of the solo and become part of the non-solo/quiet section?  It was a tough call a lot of the time and pretty arbitrary on my part.  I'm really not sure what a good way of approaching this problem is.

Top

February 22, 2007: An exercise in frustration

I did not get as much done this week as I'd hoped to get done.  I spent a great deal of time being stupid with the pitch tracker Yin and with the code management tool Darcs.  That being said, I did finally get Yin working.  My initial conclusion is that it doesn't seem to be doing a much better job of tracking pitch than my simple autocorrelation routine did.  

I ran Yin and my autocorrelation routine on this music sample.  The sample has three sections : orchestra (t = 0 - 6 seconds), solo woman (t = 6 - 10 seconds), and woman plus orchestra (the rest).  I've graphed the outputs of Yin and my autocorrelation routine below.

pitch contour found by yin and by my autocorrelation routine

And just for reference, here's the spectrogram of my music sample:

spectrogram


As you can see, neither tracker did well with the full orchestra, which is to be expected.  They both did well with the solo.  Surprisingly, the both were able to capture a lot of the soloist's pitch when she sang with the orchestra behind her.  However, the word in this is "both."  Yin didn't seem to really do any better than autocorellation.

I admit that I did not spend a lot of time mucking with yin's parameters.  It may be that yin would work better if I tweaked its parameters.  It may also be that yin would work better than my autocorellation routine on other samples.  I suppose I'll have to try that eventually, but right now I'm not terribly impressed.

Top

March 1, 2007: Thinking about next week's presentation.

I've started working on a few slides for next week's presentation.  I need to explain what I'm trying to do, what I've done, and where I'm headed.

Top

March 8, 2007

This week was the midterm presentation.

Top

March 22, 2007: HMM

Over spring break, I basically did two things: fix some really bad default parameter settings and use Kevin Murphy's HMM toolbox to incorporate an HMM into my code.

 As a I mentioned in my presentation, I've been having problems with noisy data.  My current algorithm is (1) estimate a pitch for each window, (2) filter out that pitch (and overtones), and (3) compare the energy in the window before and after filtering.  If there's only one pitch present and I found that pitch, the ratio of the energies (filtered/non-filtered) should go way down.  The problem is that this ratio is pretty noisy.  So, when I try to make a solo/non-solo decision based on the ratio, I also get a noisy decision.  Here's the example I used in my presentation:

Noisy power ratio = noisy decision

The solo is actually between 6 and 10 seconds, but it's impossible to get that perfectly using a single cutoff value for the power ratio.  So, I attacked the problem with a discrete HMM.  I used the viterbi algorithm to find the mostly likely "path" of solo/non-solo outputs for each frame.  I trained the HMM on 10 of my samples and tested it on the other 10.

The results weren't as good as I'd hoped.  Here are some stats:

Training Data
(6522 solo frames and 19339 non-solo frames)

Just the classifier:
% solo frames classified as solo: 55
% non-solo frames classified as non-solo: 83

Using the viterbi algorithm:
% solo frames classified as solo: 66
% non-solo frames classified as non-solo: 84


Testing Data
(7814 solo frames and 18013 non-solo frames)

Just the classifier:
% solo frames classified as solo: 63
% non-solo frames classified as non-solo: 69

Using the viterbi algorithm:
% solo frames classified as solo: 72
% non-solo frames classified as non-solo: 63


Clearly, my next move needs to be cutting out the middle-man here.  There's no real reason to run viterbi on the classified data, I should just run it on the power ratio.  I knew this last week, but I figured I'd try the discrete model first because it seemed to be a lot simpler.

Top

March 29, 2007

This week, I came to the realization that I need to stop using the work "solo" and use "single voice" instead.  It really was one of those duh moments.  After the midterm presentations, I got back several comments where my fellow students clearly did not understand my use of the term "solo."  Hopefully "single voice" is less likely to be misinterpreted.

In terms of actual project work this week, I concentrated on fitting GMMs to two pieces of data for each sample: the power in each frame of the original sample and the power ratio (filtered to unfiltered) in each frame.  Again I used Keving Murphy's HMM toolbox and used Netlab for GMM fitting.  Running the viterbi algorithm on my test data, I got some improvement over using the viterbi algorithm with the discrete output.

The most important thing I realized this week, however, was that I've really been measuring recall when what I want is precision.  Because there are so many more multi-voice frames than single-voice frames, even a 70% recall on multi-voice frames means that my pool of single-voice classified frames is swamped by the incorrectly classified multi-voice frames.  My precision on test data is about 54% for test data for the GMM-style viterbi algorithm.  Because I want build a database of solo samples, getting precision up is really more important than getting recall high.  So I need to think about ways of setting the bar higher before classifying a frame as single-voice.

Top

April 5, 2007

This week I mostly spent refactoring my code (sadly).  My code had gone through so many changes that it sort of grew into an unweildy monstronsity.  It is now much easier to run tests and vary parameters to see what works the best.

As I mentioned last week, I'm really interested in improving the precision of my algorithm while getting the best recall I can.  So, this week I also ran several tests to see how using the viterbi path algorithm helped my precision and recall:

precision and recall

It's amazing how much better my results are after running the raw decision through a viterbi path calculation.  Consider that in order to get 60% precision on the decision, I have to settle for about 50% recall.  In contrast, 60% precision on the viterbi path gives me about 65% recall.  That's much better.

Top

April 18, 2007

Here I ran the auto-correlation rountine with different power thresholds for the single voice/silence or multi-voice classification.
I also ran the classification results through the discrete viterbi path.  The power thresholds ranged from 0.02:0.02:0.1(from April 15) and 0.1:0.05:0.4 (from April 2).




This plot was generated using by training GMMs on 13 mfccs for the training data.  When running the viterbi path calculation to come up with the classification, I manipulated the solo probability vector.  I multiplied it by values ranging from 0.05:0.05:1.  Note that it does about the same as the auto-correlation classifier.




Here I worked with the zero count classifier after optimizing parameters and using a sigmoid instead of a complete cutoff to decide if frequency bins are zero or not.  Again, I manipulated the probabilty of the single-voice to produce these different plots.  Unfortunately, all that seemed to do was reduce recall without doing much to precision.




Here I fit GMMs to the auto-correlation classifier with several combinations of features.  I then manipulated the single-voice probability again.  Basically, the three combinations were: (1) original power, power ratio, original power over 500 Hz (break), (2) power ratio, original power over 500 Hz (just break), and (3) original power, power ratio.  They all faired pretty much the same.




Top

April 26, 2007

I ran the mfcc code many times to try and see how different gmm initial states would affect the outcome.  To get the precision/recall curves, the probability of the single voice outcome was manipulated.

Here are the mean precision/recall curves for training and test.  The lines on either side are one standard deviation away.




Here is a scatter off the results from all 100 tests.



Top

Christine Smit

Christine Smit's email address