Music Content Analysis:
A Practical Investigation of Singing Detection

These pages will lead you through a quick introduction to sound content analysis and classification using Matlab. In particular, we are going to use Ian Nabney's Netlab package to provide the basic statistical pattern recognition tools.

Our task will be to build a detector to find the portions of a piece of pop music where there is singing. We will start off with some training data that has been hand-marked to indicate where the singing starts and stops, then use that to build statistical models of the signal properties that differentiate segments music with and without singing. We will then test the models on another example for which we have hand-marked ground truth, to get a quantitative measure of how well our system works, such as the proportion of time frames correctly labelled.

The practical is broken up in to several sections:

Data and features: Looking at the labelled music examples that we use to train our classifier, and calculating the cepstral features we will use for the classification.
Gaussian mixture models (GMMs): We will attempt to capture the distribution of feature values for each of our two classes by fitting a set of multidimensional Gaussian blobs to their scatter plots. Using these continuous approximations, we can then label a new feature vector with the appropriate class by seeing which class model has a greater likelihood at that point.
Neural network classification: As an alternative to modeling each distribution, we can try to estimate the posterior probabilities of each class directly. One way to do this is to train a neural network, which has some interesting differences from a GMM approach.
Evaluation: Rather than measuring performance on the test data, a better test of system quality is to test on separate, unseen test data - sometimes more complex systems actually perform worse on unseen cases because they 'overfit' the training data. By varying the number of model parameters, we can experiment to find where overfitting begins.
Temporal smoothing: So far, we have been classifying each time frame individually, but in fact there is a lot of temporal structure: if you know the frame at time N has label A, then time N+1 is most likely to have the same label. We can try to exploit this local correlation either by simple temporal smoothing of the porbabilities, or with Markov models. And finally, you will have the chance to build your own classifier using the choice of features, model types, and parameters that you think will work best. We can then compare the performance of all the systems on another separate test example to see how well we can do!

You can download all the files associated with the practical here: muscontent-practical.tar.gz (74 MB, 76031689 bytes), or as a zip file: muscontent-practical.zip (74 MB, 76045455 bytes).

Acknowledgments

This practical was originally developed in July 2003 for the CLSP Summer Workshop at Johns Hopkins. Thanks to Yonggang Deng and Vlasios Doumpiotis for help in debugging.

Back:

Top

Next: Data

Last updated: $Date: 2003/07/02 15:39:48 $

Dan Ellis <[email protected]>

Music Content Analysis: A Practical Investigation of Singing Detection

Acknowledgments

Music Content Analysis:
A Practical Investigation of Singing Detection