e6820
speech and audio processing and recognition
spring 2002

page overview:

Reviews of papers relevant to e6820 final project can be found in this page. You can click on the title of each paper to download it. If you are one of the authors and for any reason you would like your paper to be taken off this list, send me an email and I'll promptly remove it.

The papers are organized in the following conceptual groups for easier browsing.

  1. Parametric Modeling
  2. Temporal Noise Shaping (TNS)
  3. Warped Linear Prediction
  4. Modified Discrete Cosine Transform (MDCT)
  5. Harmonic plus Individual Lines and Noise (HILN)
  6. Miscellaneous
  7. Other Project Ideas

    I. Parametric Modeling

  1. X. Serra, "Musical Sound Modeling with Sinusoids plus Noise", Published in G. D. Poli, A. Picialli, S. T. Pope, and C. Roads Ed., Musical Signal Processing, Swets & Zeitlinger Publishers.

    Summary:
    This paper is comprehensive summary of all the ideas behind sound modeling using the sines+noise model. Magnitude and phase spectra are calculated. Phase is only use for the determination of the residual part (deterministic subtraction) and after that it is discarded. The deterministic part is extracted using long windows for good frequency resolution and the stochastic part with short windows for better performance during the attacks. The stochastic part is modeled as white noise modulated by a line-segment spectral approximation. This is preferred over LPC because of its ease of manipulation (in LPC one needs to modify poles). Synthesis is performed using IFFT for efficiency.

  2. M. Goodwin, "Residual modeling in music analysis-synthesis", Proc. ICASSP'96, Vol. 2, pp 1005-1008, 1996

    Summary:
    This paper models the residual noise of Sines+Noise using Equivalent Rectangular Bands (ERBs). It improves the realism of the frequency-domain additive synthesis. It is primarily effective for representing the breath noise that appears in the sinusoidal analysis-synthesis residuals of wind instruments. Because the phase is not carefully treated, the ERB model muddies the transients. 

  3. M. Goodwin, "Adaptive Signal Models: Theory, Algorithms and Audio Applications", Ph.D. Dissertation, University of California - Berkeley, Fall 1997
  4. S. N. Levine, J. O. Smith III, "A Sines+Transients+Noise Audio Representation for Data Compression and Time/Pitch Scale Modifications", 105th AES Conv., San Francisco, 1998

    Summary:
    This paper introduces a new method for audio compression based on segregation into sinusoids, transients and noise. The sinusoidal model is straightforward. Special care is taken for phase matching between sines and transients. The transient detector uses a conventional frame-based energy measure as well as a measure of the ratio of short-time energies of the residual and the original signal. Above 5KHz the signal is modeled as either a transformed-coded transient or as bark-band filtered noise depending on the state of the transient detector. Noise is modeled using a six channel bark filter from 5-16KHz and its 6 envelopes are subsampled and encoded.
    Comments:
    Very good TSM results. No echoing on the transients and no smearing on the steady state. Maybe we can use the transient detector part and the phase matching technique. I don't see any obvious way to see this from scene analysis point of view.

  5. S. N. Levine, "Audio Representations for Data Compression and Compressed Domain Processing", Ph.D. Dissertation, Stanford University, December 1998

    Summary:
    During the attack portion of the signal, transform coding is used for about 66 milliseconds between 0 and 5 kHz, but for only 29 milliseconds between 5-16 kHz. The remainder of the 66 msec from 5-16 kHz is encoded using noise modeling. During the non-transient regions, multiresolution sinusoidal modeling is used below 5 kHz and Bark-band noise modeling is used from 0-16 kHz. Given, more time, and more complexity in the software, the transient MDCT coefficients in time-frequency could be pruned in such a way to match the smoother characteristics of an impulse response of a continuous wavelet transform. The noise model used is a combination of both additive as well as residual noise models (only during non-transient regions). From 0 to 5 kHz, residual noise modeling is performed on the residual of the original signal minus the synthesized sinusoids. From 5 to 16 kHz, only additive noise modeling is performed on the original signal itself. To maintain a low total bitrate, it is assumed that all non-transient regions from 5 to 16 kHz are noise-like. The Bark-band noise model is the only one which had noise that seemed to fuse with the sinusoids.
    Comments:

    The most relevant part of the thesis is time-frequency pruning applied in transient modeling.

  6. T. Verma, S. Levine, T. Meng, "Transient Modeling Synthesis: a flexible analysis/synthesis tool for transient signals", International Computer Music Conference, Greece, 1997.

    Summary:
    This paper introduces the Transient Modeling Synthesis (TMS) model for transient analysis. It is designed to enhance / complement the existing Spectral Modeling Synthesis (SMS) framework. The basic idea underlying TMS is the duality between time and frequency. TMS is the frequency domain dual to sinusoidal modeling. A large frame (1 second), which contains the transient, is first DCT transformed to frequency domain. In this domain transients become sinusoids which are easy to model using the existing Sines+Noise model. Transients in the beginning of the frame appear as low frequency sinusoids and transients towards the end as high frequency. The paper also defines a residual-based transient detector which is optional since TMS can detect them pretty fine.
    Comments:
    Can we combine the DCT with the DFT? What kind of transform do we get? It goes back to "time" so it's similar concept as Cepstral Transform but without the Logarithmic non-linearity. Maybe as a first step we can detect the transients, transform them with DCT and put them back in the original signal. Then we can pass the whole signal (the original plus DCT transformed parts) through the Sines+Noise model to get a continuous representation. I still think that the additional flexibility of linear prediction in the spectral domain that the TNS paper introduced is more promising. Try the standard time scale modification techniques on the TNS.

  7. T. Verma, T. Meng, "Extending Spectral Modeling Synthesis with Transient Modeling Synthesis", Computer Music Journal, Vol. 24, No. 2, Summer 2000.

    Summary:
    A more recent version of the original Verma paper. It also includes how to perform pitch shifting and time stretching.
    Comments:
    It is more high level but it still doesn't really explain the math.

  8. T. Verma, T. Meng, "Time Scale Modification Using a Sines+Transients+Noise Signal Model" 1998 Digital Audio Effects Workshop (DAFX98). Barcelona, Spain, 19-21 November 1998
  9. T.S. Verma, "A Perceptually Based Audio Signal Model with Application to Scalable Audio Compression", Ph.D. Dissertation, Stanford University, October 1999
  10. P. Masri, "Computer Modeling of Sound for Transformation and Synthesis of Musical Signals", Ph.D. Thesis, University of Bristol, 1996

    Summary:
    Chapter 5 includes three transient detectors based on energy distribution, attack envelope and spectral dissimilarity. The peaks of those detectors are used for the STFT frame synchronization. For more info also read the other Masri paper.

  11. P. Masri, A. Bateman, "Improved Modeling of Attack Transients in Music Analysis-Resynthesis", Proc. ICMC, pp.100-103, Hong Kong, Aug 1996.

    Summary:
    This paper introduces a method to improve the representation of transients in the sines+noise model. The main principle is that the spectra before and after an attack should be treated as different and should not be included in the same analysis frame which will lead to spectra averaging / smearing. It first detects the transients using two spectrum-derived metrics from an STFT which uses a very narrow window with length 2.9ms and hop 1.5ms. This allows very fine localization of the transients (preprocessing step). The analysis and synthesis happens in sync with transient location. It still fails to capture transients like booming bass drum or drum rolls.  
    Comments:
    Following the paper's suggestion for future directions, we can work on modeling the time domain envelope of the transient and impose it on the synthesized output. This is very similar to Temporal Noise Shaping (TNS) as found in the MPEG-AAC coding scheme. Another future direction is the use of higher order spectra (Wigner-Ville, bispectrum) which might offer better representation for transients.

  12. P. Masri, A. Bateman, "Identification of Nonstationary Audio Signals Using the FFT, with Application to Analysis-based Synthesis of Sound", Proc. IEE Colloquium on Audio Engineering, pp.11/1 - 11/6, London, May 1995.

    Summary:
    Non-stationary intra-frame signals can be as linear FM or exponential AM. One can extract second order phase information from the main and side lobes of the frame FFT. The frequency and log amplitude derivatives that can be calculated this way, allow cubic interpolation synthesis. Therefore some of the dynamic information that was lost by using long FFT windows can now be regained.
    Comments:
    It's not clear how to apply this on very fast changing transient sounds.

  13. M. Link, "An Attack Processing of Audio Signals for Optimizing the Temporal Characteristics of a Low Bit-Rate Audio Coding System", Proc. 95th AES Conv., New York, Oct 1993

    Summary:
    This paper introduces the concept of Gain Modification for better coding of transients. During encoding the temporal envelope is extracted. The transient is then multiplied by the inverse of the envelope thus smoothed. The amplitude modulation that is introduced helps with the more efficient coding of the coefficients. The steepness of the extracted envelope is limited to 0.5ms to shorten the resulting AM side lobes. At the decoding the frame is multiplied with the envelope and most of the transient shape is restored. The pre-echo artifacts are also minimized.
    Comments:
    Very simple and powerful method. In practice Temporal Noise Shaping gives better results cause it operates in frequency domain so it can modify frequency specific micro-transients.

  14. J. Nieuwenhuijse; R. Heusdens, Ed.F. Deprettere, "Robust exponential modeling of audio signals", Proc. IEEE Acoustics, Speech and Signal Processing, Vol. 6, pp. 3581-3584, 1998

    Summary:
    This paper models transients as sums of exponentially decaying sinusoids. It derives the transformation matrix in the ideal case, that is one that is really a sum of sinusoids. Then it proves that in real-life audio signals it gets a good approximation. The algorithm is very sensitive on where the beginning of the transient is defined. If there is silence in the beginning of the frame then it needs too many coefficients for the expansion.
    Comments:
    Vafin's paper uses matching pursuit with an exponential decay dictionary. Better approach.

  15. R. Vafin, R. Heusdens, S. van de Par, W. B. Kleijn, "Improved Modeling of Audio Signals by Modifying Transient Locations", Proc. WASPAA'01, pp. 143-146, New Paltz, NY, USA, 2001
  16. R. Vafin, R. Heusdens, W. B. Kleijn, "Modifying Transients for Efficient Coding of Audio", Proc. IEEE ICASSP'01, Vol. 5, pp. 3285-3288, Salt Lake City, Utah, USA, 2001

    Summary:
    Modifies transient locations using a DCT.

  17. A.J.S. Ferreira, "Combined Spectral Envelope Normalization and Subtraction of Sinusoidal Components in the ODFT and MDCT Frequency Domains", WASPAA, NY, 21-24 October 2001
  18. G.H. Wakefield, L.M. Heller, L.H. Carney, M. Mellody, "On the Perception of Transients: Applying Psychophysical Constraints to Improve Audio Analysis and Synthesis", Proc. ICMC2000, Berlin, August, 2000

  19. X. Amatriain, J. Bonada, A. Loscos, X. Serra, "Spectral Modeling for Higher-level Sound Transformations", No publication information found
  20. X. Rodet, P. Depalle, "Spectral Envelopes and Inverse FFT Synthesis", Proc. 93rd  AES Conv., San Francisco, Oct 1992
  21. [Back to TOP]


    II. Temporal Noise Shaping (TNS)

  22. J. Herre, J. D. Johnston, "Enhancing the Performance of Perceptual Audio Coders by Using Temporal Noise Shaping (TNS)", Proc. 101st AES Conv., Los Angeles, Nov 1996

    Summary:
    High quality coding of both pseudo-stationary as well as transient type signals calls for different requirements for the analysis filterbank. A high resolution uniform filterbank is appropriate for the first case and a critical band structure for the second. This paper introduces the concept of Temporal Noise Shaping which adaptively modifies the analysis filterbank to the signal's characteristics. In the same way time domain prediction efficiently encodes spiky spectra, frequency domain prediction encodes spiky time envelopes, meaning transients. A good intuitive explanation is given through Hilbert envelopes. Another way to see it is by thinking of the LPC filtering of the spectral coefficients, essentially convolution witch corresponds to multiplication in the time domain, meaning time domain envelope shaping. This method has been adapted in MPEG2-AAC.
    Comments:
    Use Hilbert envelopes in different parts of spectra. It can be done using the inverse Fourier transform of the autocorrelation of the spectral coefficients formula. Gain modification (time envelope preprocessing) is not as flexible as this since it operates across all frequencies.

  23. J. Herre, J. D. Johnston, "Continuously signal-adaptive filterbank for high-quality perceptual audio coding", IEEE ASSP Workshop, 1997

    Summary:
    This paper builds on the original TNS paper. It describes TNS as a continuously adaptive filterbank.
    Comments:
    It's basically a more condensed version of the original TNS paper.

  24. J. Herre, J. D. Johnston, "Exploiting Both Time and Frequency Structure in a System that Uses an Analysis/Synthesis Filterbank with High Frequency Resolution", Proc. 103rd  AES Conv., New York, Sept 1997
  25. J. Herre, "Temporal Noise Shaping, Quantization and Coding Methods in Perceptual Audio Coding: A Tutorial Introduction", AES 17th Int. Conf. on High Quality Audio Coding
  26. E. Allamanche, R. Geiger, J. Herre, T. Sporer, "MPEG-4 Low Delay Audio Coding based on the AAC Codec", Proc. 106th  AES Conv., Munich, Germany, May 1999

    Summary:
    The low delay codec in MPEG-4 v2 is not relying on window switching for pre-echo control during transients. Switching decisions introduce additional delay which in this case is unacceptable. It uses a fixed length window instead and it incorporates TNS for pre-echo control. Also, a new low overlap TDAC window is introduced which gives better results in transient signals.
    Comments:
    I found the low overlap window equation in the MPEG-4 v2 specifications and it's available in Matlab.

  27. [Back to TOP]


    III. Warped Linear Prediction

  28. A. Harma, U.K. Laine, "A Comparison of Warped and Conventional Linear Predictive Coding", IEEE Tran. on Speech and Audio Processing, Vol. 9, No. 5, July 2001
  29. A. Harma, "Perceptual aspects and warped techniques in audio coding", Master's Thesis, Helsinki University of Technology, 1997
  30. A. Harma, "Audio coding with warped predictive methods", Licentiate's Thesis, Helsinki University of Technology, 1998
  31. A. Harma, M. Vaalgamaa, U.K. Laine, "A warped linear predictive stereo codec using temporal noise shaping", Proc. Nordic Signal Proc. Symposium, NORSIG'98, pages 229-232, Denmark, June 1998
  32. R. Yu, C.C. Ko, "A Warped Linear-Prediction-Based Subband Audio Coding Algorithm", IEEE Tran. on Speech and Audio Processing, Vol. 10, No. 1, January 2002
  33. A. Harma, T. Paatero, "Discrete representation of signals on a logarithmic frequency scale", Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA'01), pages 135-138, New Paltz, NY, USA, September 2001.
  34. A. Harma, M. Juntunen, P. Kaipio, "Time-varying autoregressive modeling of speech and audio signals", In Signal Processing X: Theories and Applications, pages 2037-2040, Tampere, Finland, September 2000. EURASIP
  35. J.O. Smith III, J.S. Abel, "Bark and ERB Bilinear Transforms", IEEE Tran. on Speech and Audio Processing, Vol. 7, No. 6, November 1999
  36. [Back to TOP]


    IV. Modified Discrete Cosine Transform (MDCT)

  37. J. Princen, A. Johnson, A. Bradley, "Subband/Transform Coding Using Filter Bank Designs Based on Time Domain Aliasing Cancellation", Proc. of the ICASSP 1987, pp 2161-2164
  38. T. Sporer, K. Brandenburg, B. Edler, "The use of multirate filter banks for coding of high quality digital audio", 6th European Signal Processing Conference (EUSIPCO), Amsterdam, June 1992, Vol.1 pp. 211-214.
  39. R. Gluth, "Regular FFT-Related Transform Kernels for DCT/DST-based polyphase filter banks", ICASSP 91, pp. 2205-8 vol.3
  40. A.J. Ferreira, "Spectral Coding and Post-Processing of High Quality Audio",  Ph.D. Thesis, University of Porto, 1998
  41. Y. Wang, L. Yaroslavsky, M. Vilermo, M. Vaananen, "Some Peculiar Properties of the MDCT", WCC2000 - 16th IFIP World Computer Congress/ICSP 2000 - 5th International Conference on Signal Processing, August 21-25, 2000, Beijing, China
  42. L. Bosse, "An Experimental High Fidelity Perceptual Audio Coder", Stanford University CCRMA, Music 420 Project, March 1998
  43. V. Vanhoucke, "Block Artifact Cancellation in DCT Based Image Compression", Project Report, Stanford University, 2001
  44. H. Karmarkar, "Fast Algorithms for the Modified Discrete Cosine Transform", Master's Thesis, Indian Institute of Technology, Bombay (No year info)
  45. V. Nikolajevic, G. Fettweis, "New Recursive Algorithms for the Forward and Inverse MDCT", Proc. IEEE Workshop on Signal Processing Systems (SiPS), Antwerp, Belgium, 26-28 September 2001
  46. H.S. Malvar, "A modulated complex lapped transform and its applications to audio processing", Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Phoenix, March 1999
  47. H.S. Malvar, "Biorthogonal and nonuniform lapped transforms for transform coding with reduced blocking and ringing artifacts", IEEE Trans. Signal Processing, April 1998, pp. 1043-1053
  48. H.S. Malvar, "Extended lapped transforms: properties, applications, and fast algorithms", IEEE Transactions on Signal Processing, vol. 40, pp. 2703-2714, Nov. 1992
  49. H. S. Malvar, "Extended lapped transforms: fast algorithms and applications", IEEE International Conference on Acoustics, Speech, and Signal Processing, Toronto, Canada, pp. 1797-1800, May 1991
  50. [Back to TOP]


    V. Harmonic plus Individual Lines and Noise (HILN)

  51. B. Edler, H. Purnhagen, "Parametric Audio Coding", 5th International Conference on Signal Processing (ICSP 2000), Beijing, August 2000
  52. H. Purnhagen, "Advances in Parametric Audio Coding" IEEE WASPAA, New Paltz, New York, Oct. 17-20, 1999
  53. H. Purnhagen, N. Meine, "HILN - The MPEG-4 Parametric Audio Coding Tools", IEEE International Symposium on Circuits and Systems (ISCAS 2000), Geneva, May 2000
  54. H. Purnhagen, "An Overview of MPEG-4 Audio Version 2", AES 17th International Conference on High-Quality Audio Coding, Florence, Italy, 2-5 September 1999
  55. H. Purnhagen, N. Meine, B. Edler, "Speeding up HILN - MPEG-4 Parametric Audio Encoding with Reduced Complexity", 109th AES Convention, Los Angeles, September 2000
  56. [Back to TOP]


    VI. Miscellaneous

  57. D.P.W. Ellis, "Detecting Alarm Sounds", CRAC Workshop, Aalborg, Denmark, September 2001

    Comments:
    This is a very promising application for parametric modeling tools. I am planning on applying my model on this.

  58. T. Painter, A. Spanias, "Perceptual coding of digital audio", Proc. of the IEEE , Volume: 88 Issue: 4, pp. 451-515, April 2000

    Summary:
    This is a very comprehensive review of the perceptual coding algorithms and standards and by itself is a summary. For the purpose of this project one can read section E. Pre-Echo Control Strategies. We care mostly about Window Switching, Gain modification and Temporal Noise Shaping methods.

  59. P. Herrera, X. Amatriain, E. Batlle, X. Serra, "Towards instrument segmentation for music content description: a critical review of instrument classification techniques", International Symposium on Music Information Retrieval, Plymouth, Massachusetts, Oct 2000

  60. B. Logan, "Mel Frequency Cepstral Coefficients for Music Modeling", Proc. ISMIR 2000
  61. G. Tzanetakis, G. Essl, P. Cook, "Automatic Musical Genre Classification", Proc. ISMIR 2001
  62. [Back to TOP]


    VII. Other Project Ideas (Not pursued)

  63. Y. Stylianou, O. Cappe, E. Moulines, "Continuous Probabilistic Transform for Voice Conversion", IEEE Trans on Speech and Audio Processing, Vol. 6, No. 2, March 1998
  64. A. Hertzmann, C.E. Jacobs, N. Oliver, B. Curless, D.H. Salesin, "Image Analogies", Unknown publication info

[Back HOME]