EE6820    Paper Readings

project reference papers    project page    xlx audio    EE6820 Home

Last update:    03/27/2001

Week1, Jan 22, 2001

Phase Vocoder
Flanagan & Golden, 1966

This paper presented a scheme to represent speech signal by their short-time phase and amplitude spectra. The principle of this method is to break the original signal into several bandpass signals and represent each as a carrier modulated by the short-time amplitude spectra  and the time derivative of phase spectra, both of which are band-limited signals. The implementation and result is then given, and its applications in multiplexing, time-scaling etc. are discussed.

Possible reasons for not becoming popular in telephone communication:
1. not resilient to noise
2. computation intensive, may not be good for real-time application
3. better method such as PCM took over

Back to Top

Week2, Jan 29, 2001

W. Warren and R. Verbrugge, "Auditory Perception of Breaking and Bouncing Events: Psychophysics", Natural Computation, MIT Press, 1988

Incorporates the "ecological approach" to look at the acoustic consequences of dropping a glass object and its subsequent bouncing vs breaking. 
-    Physical analysis of the source event, identifying its structural invariant & transformational invariant features
-    Identification of higher-order acoustic properties
     for bouncing: a single damped quasi-periodic pulse train 
     for breaking: initial rupture burst dissolving into multiple damped quasi-periodic pulse trains
-    Empirical test with natural and synthetic events 
May be useful in sound synthesis and identification for specific type of acoustic events. But how can this be extended to address the large diversity of even a small fraction of natural sound. 

Back to Top

Week3, Feb 3, 2001

Julius O. Smith "Physical Modeling using Digital Waveguides", Computer Music Journal, v16 n4 p74-91

This paper presented the solution of 2 traveling waves to one-dimensional wave equation, discussed issues in realizing the digital delay lines,  also discussed 1-D lossy wave equation, the initial conditions and border conditions, and illustrated the theory with specific examples of ideal/damped plucked string, struck string and single-reed instrument. Finally, it also gave an sample implementation of 1-D waveguide (plucked string with controllable amp, pitch, duration and initial condition). 

Back to Top

Week4, Feb 10, 2001

Eric Scheirer, Malcolm Slaney, "Construction and evaluation of  a robust multi-feature speech/music discriminator", ICASSP 97

This is a rather thorough examination of speech-music signal discrimination. They looked at 13 features with different latency and computation, 4 main kinds of classifiers. The comparably good performance of variant classifiers suggests that the feature space has simple topology. An evaluation of the choice of features based on performance and latency is also given, which leads us to think which of the features would actually add value in the classification.

Back to Top

Week5, Feb 20, 2001

E. Zwicker, G. Flottorp, S. S. Stevens. "Critical bandwidth in loudness summation", Journal of the Acoustic Society of America, v29 n5, May 1957

Detailed experiments on subjective loudness. For line spectra, compared a single tone with spaced complex tones with regard to their spacing; for bandpass noise, studied its relevant loudness with regard to bandwidth. Here critical bandwidth deltaF0 is defined as the overall spacing of tones or bandwidth of noise where the perceived loudness begins to increase as deltaF increases, and this relationship is roughly linear. This point where increase begins is relatively independent of SPL level, but dependent on the center frequency. In addition to the experiments described here, threshold values of critical bandwidth can be measured using masking effect,  and phase information.

Back to Top

Week6, Feb 27, 2001

Yannis Stylianou, "Applying the harmonic plus noise model in concatenative speech synthesis", IEEE Transactions on speech and audio processing, vol. 9 no. 1, Jan, 2001
A detailed treatment to concatenative speech synthesis. Used HNM model to parameterized speech segment library and facilitate search, and introduced effective post processing to smooth discontinuities in points of concatenation. Results are good. Yet to get into its mathematical details.

Back to Top

Week7, March 4, 2001

Antti Eronen, Anssi Klapuri, "Musical instrument recognition using cepstral coefficients and temporal features", IEEE ICASSP, June 2000

Using a wide set of features to classify musical instruments in a hierarchy. Based on prior research on timber, they chose 2 sets more than 20 different features, each set corresponds to the spectral and temporal dimension of the timber. This is a progress compared to prior works by Martin, Kaminsky et al, where only 1 set is used. Then Gaussian or K-NN classifiers are used in a hierarchical decision tree, where the performance of Guassian is better at higher levels and that of K-NN is better at lower levels. Quite accurate results are achieved.
Questions and random thoughts:
    1. What is the training set? Parallel to the validation set? 
    2. Where does the "knowledge of the best feature in a given node" come from? Empirical, heuristic or something?
    3. If the dataset is expanded, does this kind of classifier need to have some major modification?
    4.  Harp and percussion instruments are missing in the hierarchy. The former can be in the pizzicato family with piano; and the latter may be trickier, because they're not in a consistent domain with regard to audio features, e.g. they can have both tonal and non-tonal sounds, and the non-tonal sounds do have some other kind of structure we may use, as mentioned in the book.
    5.  Another thing may be interesting to classify is human voice. As shown in the practical for this week, vocal music also exhibits tonal structures, and we may be able to use vocal models and features to assist our decision.
    6.  This is only the western musical instrument tree. Can we try to do this: have an unknown instrument (e.g. a random flute from Latin America), and the classifier would know which node (sustained-->reed) it belongs to, and give the most close kin in a knowledge base.
    7.  The validation set used here is only solo tones, what if we want to use melodies or identify the solo instrument from a sonata? Separating instrument from a symphony may seem too ambitious, but for example, is classifying string trios from quartet a worthy thing to do? (or say, count the number of instruments in that piece)

Back to Top

Week9, March 27, 2001

Charles R. Jankowski, Hoang-Doan H. Vo, Richard P. Lippmann, "A comparison of signal processing front ends for automatic word recognition", IEEE Transactions on Speech and Audio Processing, July 1995

- A thorough treatment on the important prerequisite of speech recognition: evaluation of feature extraction front end. 
- Focusing on isolate word recognition in noisy environment
- A multi-angle comparison of various front ends like MFB, Seneff auditory model, EIH, LPC-based, etc. under variations of training process (clean vs. multistyle), spectral variability, noise type (white vs. babble)
- Almost every method has its own pros and cons under different circumstances, yet multi-style training seems more effective than its opponent.
- Work to be processed on towards continuous speech
- Why MFB perfoms better in babble noise than white noise? the noise spectrum also conform with mel filterbank model? 

Back to Top

Project Bibliogrphy

[1]    Albert S. Bregman " Auditory scene analysis: hearing in complex environments", Thinking in sound: the cognitive pasychology of human audition, Oxford University Press, 1993, p10~36 
    This paper provides us interesting insights about human auditory perception in complex environments. Our auditory system is not just a simple LTIC recorder that "sums up" all the different frequency components happening at the same time, it actually uses pitch, timing, change, timber, spatial and other information to segment auditory events and summarize them into high-level perception. For example, the phenomena of "psychophysical complementarities", the timing regularity, the gradualness of change, the harmonic nature of vibrating body, the comodulation masking release etc. not only provide us useful guide in computational auditory scene analysis, but also helps us in searching better solutions towards audio and music synthesis. 

[2]    John Saunders "Real-time discrimination of broadcast speech/music", ICASSP 96
    This paper presents an easy and quick algorithm for real-time discrimination of speech/music. The characteristic structure of speech is a succession of syllables composed of short periods of frication followed by longer periods of vowels, while music display a tonal nature. So they look at the ZCR (zero-crossing rate) contour frame by frame, and the fluctuation of ZCR is relatively small for music while ZCR is high and low alternatively for speech. Then the pattern recognition problem for these contours are solved by using 4 features extracted from the curve: 1-st order difference of ZCR, 3-rd central moment about the mean, total number of ZCR above a threshold, and the (#of ZCR above the mean) - (#of ZCR below the mean). An improvement was added by including another feature: the energy contour dip. The classification results are pretty good.

[3]  Dellaert, F.; Polzin, T.; Waibel, A. "Recognizing emotion in speech" , ICSLP 1996

[4]    Droppo, J, Acero, A, "Maximum A Posterior Pitch Tracking", ICSLP 1998
    A rather mathematical approach towards pitch tracking. General idea is to estimate Pmap=max(f(P|X)) from a priori f(P) and f(X|P). And f(P) is modeled as a 1st-order gaussian markov process while f(x|P) is obtained from a predictable gaussian energy density. Enhancements are also suggested to better approximate the cross-correlation factor in f(X|P), including BP filtering, forward-backward prediction, variable frame length, logarithm sampling, etc. 
    Pitch tracking is a rather old topic, why it is still occasionally active without introducing new theoretical grounds is not yet clear to me.

[5]    Arons, Barry, "Speechskimmer: A System for Interactively Skimming Recorded Speech", ACM CHI 1997

[6]    Chao Wang; Seneff, S., "Robust pitch tracking for prosodic modeling in telephone speech", ICASSP 2000

[7]    Johnathan Foote, "Visualizing music and audio using self-similarity", Proc. ACM Multimedia 1999  
    The idea is novel in general, and the checker-board structure do give us new insights on how to represent similarity in music.
    But the drawbacks:
    a) The similarity measure is not sufficient. In addition to pure repeat of the music, it should also represent the similarity in tonal variations/melody;
    b) The black-and-white checkerboard contains not enough information, we know two Cs are similar, and two Ds are similar, but it cannot be shown these notes are different. Had music transcription had a solution,  it would be easier to add color or something.
    c) It is somewhat difficult to automatically analyze the result of this representation and extract information for indexing and retrieval.

Back to Top