SYSTEMS USING STATIC MODELING (Dynamic modeling see below)
2- " Content-Based Classification
Search and Retrieval of Audio"
E. Wold, T. Blum, D. Keslar, and
J. Wheaton IEEE Multimedia May1999
This is the "classic" paper in the field, several of the more recent works use the performance obtained in this work as the baseline for Content-Based Audio Classification and Retrieval system.
The total audio set is divided in two: The database and signals that are used as inquires to the system.
They calculate 4 perceptual features for all the audio signals in the database:
-Loudness: approximated by signal's root mean-square (RMS) level in
decibels.
-Pitch: estimated by taking a series of short time Fourier spectra.
For each of the frames the frequencies and amplitudes of the peaks are
measured and an approximate greatest common divisor algorithm is used to
calculate an estimaate of the pitch.
-Brigthness: is computed as the centroid of the short-time fourier
magnitude spectra, stored as a log frequency.
-Bandwidth: is computed as the magnitude weighted average of the differences
betwen the spectral differences between the spectral components and the
centroid..
The system can be inquired in several ways:
Classification by model:
Audio models, eg: scratchy noises, can be obtained by supervised
training, the model will consist on the mean vector and the correlation
matrix of the feature vectors of the signals on the training set (for that
particular model)
When a new sounds needs to be classified, a weighted distance
measure is calculated from the new sound feature vector and the models.
The class of the signal corresponds to the model for which the distance
was minimal.
Classificationand retrieval by similarity with signals in the
database:
The database is divided in 16 classes. When a new sound is presented
to the system, a weighted distance measure is calculated from the new sound
feature vector and all the files in the database (actually a slective search
is done). The input signal is classified to the class
of the retrieved signal which is the one for which
the distance was minimal.
Different queries can also be done by specifying the perceptual characterisitcs of the desired signals.
The desirability of a particular perceptual feature can be expressed by changing the weigth in the distance function.
3 .- "Content-based Audio
Classification and Retrieval Using the Nearest Feature Line Method"
Stan Z. Li
IEEE Transactions
on Speech and Audio Processing September 2000
Perceptual features: Total Spectrum Power, Subband Powers, Brightness,
Bandwith and Pitch frequency and Ceptral coefficients are used using the
nearest feature line method.
In this classification method, lines are found between any pair
of protoypes from the same class within the training set. Classification
is done computing the distances between the input and its projection to
each one of the lines in the feature space.
The classifier finds the minimum distance and assign the line
class to the input file, when using for Content based the two points (prototypes)
used to derive the line are return as the most similar signals in the database.
The papaer reports an error rate of 9.78% to be compared with
the 18.34% error rate obtained by Wod et al.
4. - "A Study on Content-Based Classification and Retrieval of Audio
Database"
Mingchun Liu and Chunru Wan
A system that combines a probabilistic neural network with a
nearest neighbor classifier is used.
Perceptual and and Coffcient domain features are used in this work for a total of 87 features. The sequentisal selection method is used to find the best (sub)optimal feature set. The paper compares the performances of four classifiers: nearest neigbor, k-nearest neighbors, Gaussian Mixture Models and Probalistic Neural Network for 4 different classification tasks (the first one consisting on the E. Wold et al task), the results shown that the K-NN classifier has the best average performanace in the fours tasks.
Finally, a probabilistic neural network is used to find the class of
the input within a group of 3 general classes, once this class is found
the distances between the feature vector of the input and each one of the
signals within the same general class are found.
The classifier finds the minimum distances and returns the signals
with the smallest distances as the most similar ones. They report
an improvement on the performance on the E. Wold et al task compared with
the Stan Z. Li 's system.
SYSTEM USING DYNAMIC MODELING.
.X- " Reduced-Rank Spectra and Minimum-Entropy Priors as Consistent
and Reliable Cues for Generalized Sound Recognition"
Michael A. Casey
MERL, Cambridge
Research LAboratory
The full frequency spectra obtained by the fourier transform is
reduced in dimensionality by a projection to low dimensional subspaces
via reduced-rank spectral basis functions.
This reduced-rank set of features is then used to train a HMM
using the minimum entropy criterion instead of the conventional EM. The
idea here is only to optimize those parameters that would reduce the entropy
of the model, others are not considered.
Results show that this approach has a better performance than
the conventional EM one using the reduce-rank set of features in both cases.