Review of Recent Papers on Content Based Audio Classification and Retrieval

Columbia University

Manuel J. Reyes G.
 
 
1.-  "An overview of audio information retrieval".
      Jonathan Foote Multimedia Systems 1999

SYSTEMS USING STATIC MODELING (Dynamic modeling see below)

2- " Content-Based Classification Search and Retrieval of Audio"
      E. Wold, T. Blum, D. Keslar, and J. Wheaton IEEE Multimedia May1999

  This is the "classic" paper in the field, several of the more recent works use the performance obtained in this work as the baseline for Content-Based Audio Classification and Retrieval system.

  The total audio set is divided in two: The database and signals that are used as inquires to the system.

 They calculate 4 perceptual features for all the audio signals in the database:

-Loudness: approximated by signal's root mean-square (RMS) level in decibels.
-Pitch: estimated by taking a series of short time Fourier spectra. For each of the frames the frequencies and amplitudes of the peaks are measured and an approximate greatest common divisor algorithm is used to calculate an estimaate of the pitch.
-Brigthness: is computed as the centroid of the short-time fourier magnitude spectra, stored as a log frequency.
-Bandwidth: is computed as the magnitude weighted average of the differences betwen the spectral differences between the spectral components and the centroid..

 The system can be inquired in several ways:

  Classification by model:
 Audio models, eg: scratchy noises, can be obtained by supervised training, the model will consist on the mean vector and the correlation matrix of the feature vectors of the signals on the training set (for that particular model)
 When a new sounds needs to be classified, a weighted distance measure is calculated from the new sound feature vector and the models. The class of the signal corresponds to the model for which the distance was minimal.

 Classificationand retrieval by similarity with signals in the database:
 The database is divided in 16 classes. When a new sound is presented to the system, a weighted distance measure is calculated from the new sound feature vector and all the files in the database (actually a slective search is done). The input signal is classified to the class
  of the retrieved signal  which  is the one for which the distance was minimal.

 Different queries can also be done by specifying the perceptual characterisitcs of the desired signals.

 The desirability of a particular perceptual feature can be expressed by changing the weigth in the distance function.

3 .- "Content-based Audio Classification and Retrieval Using the Nearest Feature Line Method"
         Stan Z. Li
         IEEE Transactions on Speech and Audio Processing September 2000

 Perceptual features: Total Spectrum Power, Subband Powers, Brightness, Bandwith and Pitch frequency and Ceptral coefficients are used using the nearest feature line method.
 In this classification method, lines are found between any pair of protoypes from the same class within the training set. Classification is done computing the distances between the input and its projection to each one of the lines in the feature space.
 The classifier finds the minimum distance and assign the line class to the input file, when using for Content based the two points (prototypes) used to derive the line are return as the most similar signals in the database.
 The papaer reports an error rate of 9.78% to be compared with the 18.34% error rate obtained by Wod et al.

4. - "A Study on Content-Based Classification and Retrieval of Audio Database"
       Mingchun Liu and Chunru Wan
 
  A system that combines a probabilistic neural network with a nearest neighbor classifier is used.

Perceptual and and Coffcient domain features are used in this work for a total of 87 features.  The sequentisal selection method is used to find the best (sub)optimal feature set. The paper compares the performances of four classifiers: nearest neigbor, k-nearest neighbors, Gaussian Mixture Models and Probalistic Neural Network for 4 different classification tasks (the first one consisting on the E. Wold et al task), the results shown that the K-NN classifier has the best average performanace in the fours tasks.

Finally, a probabilistic neural network is used to find the class of the input within a group of 3 general classes, once this class is found the distances between the feature vector of the input and each one of the signals within the same general class are found.
The classifier finds the minimum distances and returns the signals with the smallest distances as the most similar ones.  They report an improvement on the performance on the E. Wold et al task compared with the Stan Z. Li 's system.
 
 

SYSTEM USING DYNAMIC MODELING.

 .X- " Reduced-Rank Spectra and Minimum-Entropy Priors as Consistent and Reliable Cues for Generalized Sound Recognition"
         Michael A. Casey
         MERL, Cambridge Research LAboratory

 The full frequency spectra obtained by the fourier transform is reduced in dimensionality by a projection to low dimensional subspaces via reduced-rank spectral basis functions.
 This reduced-rank set of features is then used to train a HMM using the minimum entropy criterion instead of the conventional EM. The idea here is only to optimize those parameters that would reduce the entropy of the model, others are not considered.
 Results show that this approach has a better performance than the conventional EM one using the reduce-rank set of features in both cases.