Reviews of papers relevant to e6820 final project can be found in this page. You can click on the title of each paper to download it. If you are one of the authors and for any reason you would like your paper to be taken off this list, send me an email and I'll promptly remove it.
The papers are organized in the following conceptual groups for easier browsing.
Summary:
This paper is comprehensive summary of all the ideas behind sound modeling using the sines+noise model. Magnitude and phase spectra are calculated. Phase is only use for the determination of the residual part (deterministic subtraction) and after that it is discarded. The deterministic part is extracted using long windows for good frequency resolution and the stochastic part with short windows for better performance during the attacks. The stochastic part is modeled as white noise modulated by a line-segment spectral approximation. This is preferred over LPC because of its ease of manipulation (in LPC one needs to modify poles). Synthesis is performed using IFFT for efficiency.
Summary:
This paper models the residual noise of Sines+Noise using Equivalent Rectangular Bands (ERBs). It improves the realism of the frequency-domain additive synthesis. It is primarily effective for representing the breath noise that appears in the sinusoidal analysis-synthesis residuals of wind instruments. Because the phase is not carefully treated, the ERB model muddies the transients.
Summary:
This paper introduces a new method for audio compression based on segregation into sinusoids, transients and noise. The sinusoidal model is straightforward. Special care is taken for phase matching between sines and transients. The transient detector uses a conventional frame-based energy measure as well as a measure of the ratio of short-time energies of the residual and the original signal. Above 5KHz the signal is modeled as either a transformed-coded transient or as bark-band filtered noise depending on the state of the transient detector. Noise is modeled using a six channel bark filter from 5-16KHz and its 6 envelopes are subsampled and encoded.
Comments:
Very good TSM results. No echoing on the transients and no smearing on the steady state. Maybe we can use the transient detector part and the phase matching technique. I don't see any obvious way to see this from scene analysis point of view.
Summary:
During the attack portion of the signal, transform coding is used for about 66 milliseconds between 0 and 5 kHz, but for only 29 milliseconds between 5-16 kHz. The remainder of the 66 msec from 5-16 kHz is encoded using noise modeling. During the non-transient regions, multiresolution sinusoidal modeling is used below 5 kHz and Bark-band noise modeling is used from 0-16 kHz. Given, more time, and more complexity in the software, the transient MDCT coefficients in time-frequency could be pruned in such a way to match the smoother characteristics of an impulse response of a continuous wavelet transform. The noise model used is a combination of both additive as well as residual noise models (only during non-transient regions). From 0 to 5 kHz, residual noise modeling is performed on the residual of the original signal minus the synthesized sinusoids. From 5 to 16 kHz, only additive noise modeling is performed on the original signal itself. To maintain a low total bitrate, it is assumed that all non-transient regions from 5 to 16 kHz are noise-like. The Bark-band noise model is the only one which had noise that seemed to fuse with the sinusoids.
Comments:
The most relevant part of the thesis is time-frequency pruning applied in transient modeling.
Summary:
This paper introduces the Transient Modeling Synthesis (TMS) model for transient analysis. It is designed to enhance / complement the existing Spectral Modeling Synthesis (SMS) framework. The basic idea underlying TMS is the duality between time and frequency. TMS is the frequency domain dual to sinusoidal modeling. A large frame (1 second), which contains the transient, is first DCT transformed to frequency domain. In this domain transients become sinusoids which are easy to model using the existing Sines+Noise model. Transients in the beginning of the frame appear as low frequency sinusoids and transients towards the end as high frequency. The paper also defines a residual-based transient detector which is optional since TMS can detect them pretty fine.
Comments:
Can we combine the DCT with the DFT? What kind of transform do we get? It goes back to "time" so it's similar concept as Cepstral Transform but without the Logarithmic non-linearity. Maybe as a first step we can detect the transients, transform them with DCT and put them back in the original signal. Then we can pass the whole signal (the original plus DCT transformed parts) through the Sines+Noise model to get a continuous representation. I still think that the additional flexibility of linear prediction in the spectral domain that the TNS paper introduced is more promising. Try the standard time scale modification techniques on the TNS.
Summary:
A more recent version of the original Verma paper. It also includes how to perform pitch shifting and time stretching.
Comments:
It is more high level but it still doesn't really explain the math.
Summary:
Chapter 5 includes three transient detectors based on energy distribution, attack envelope and spectral dissimilarity. The peaks of those detectors are used for the STFT frame synchronization. For more info also read the other Masri paper.
Summary:
This paper introduces a method to improve the representation of transients in the sines+noise model. The main principle is that the spectra before and after an attack should be treated as different and should not be included in the same analysis frame which will lead to spectra averaging / smearing. It first detects the transients using two spectrum-derived metrics from an STFT which uses a very narrow window with length 2.9ms and hop 1.5ms. This allows very fine localization of the transients (preprocessing step). The analysis and synthesis happens in sync with transient location. It still fails to capture transients like booming bass drum or drum rolls.
Comments:
Following the paper's suggestion for future directions, we can work on modeling the time domain envelope of the transient and impose it on the synthesized output. This is very similar to Temporal Noise Shaping (TNS) as found in the MPEG-AAC coding scheme. Another future direction is the use of higher order spectra (Wigner-Ville, bispectrum) which might offer better representation for transients.
Summary:
Non-stationary intra-frame signals can be as linear FM or exponential AM. One can extract second order phase information from the main and side lobes of the frame FFT. The frequency and log amplitude derivatives that can be calculated this way, allow cubic interpolation synthesis. Therefore some of the dynamic information that was lost by using long FFT windows can now be regained.
Comments:
It's not clear how to apply this on very fast changing transient sounds.
Summary:
This paper introduces the concept of Gain Modification for better coding of transients. During encoding the temporal envelope is extracted. The transient is then multiplied by the inverse of the envelope thus smoothed. The amplitude modulation that is introduced helps with the more efficient coding of the coefficients. The steepness of the extracted envelope is limited to 0.5ms to shorten the resulting AM side lobes. At the decoding the frame is multiplied with the envelope and most of the transient shape is restored. The pre-echo artifacts are also minimized.
Comments:
Very simple and powerful method. In practice Temporal Noise Shaping gives better results cause it operates in frequency domain so it can modify frequency specific micro-transients.
Summary:
This paper models transients as sums of exponentially decaying sinusoids. It derives the transformation matrix in the ideal case, that is one that is really a sum of sinusoids. Then it proves that in real-life audio signals it gets a good approximation. The algorithm is very sensitive on where the beginning of the transient is defined. If there is silence in the beginning of the frame then it needs too many coefficients for the expansion.
Comments:
Vafin's paper uses matching pursuit with an exponential decay dictionary. Better approach.
Summary:
Modifies transient locations using a DCT.
G.H. Wakefield, L.M. Heller, L.H. Carney, M. Mellody, "On the Perception of Transients: Applying Psychophysical Constraints to Improve Audio Analysis and Synthesis", Proc. ICMC2000, Berlin, August, 2000
Summary:
High quality coding of both pseudo-stationary as well as transient type signals calls for different requirements for the analysis filterbank. A high resolution uniform filterbank is appropriate for the first case and a critical band structure for the second. This paper introduces the concept of Temporal Noise Shaping which adaptively modifies the analysis filterbank to the signal's characteristics. In the same way time domain prediction efficiently encodes spiky spectra, frequency domain prediction encodes spiky time envelopes, meaning transients. A good intuitive explanation is given through Hilbert envelopes. Another way to see it is by thinking of the LPC filtering of the spectral coefficients, essentially convolution witch corresponds to multiplication in the time domain, meaning time domain envelope shaping. This method has been adapted in MPEG2-AAC.
Comments:
Use Hilbert envelopes in different parts of spectra. It can be done using the inverse Fourier transform of the autocorrelation of the spectral coefficients formula. Gain modification (time envelope preprocessing) is not as flexible as this since it operates across all frequencies.
Summary:
This paper builds on the original TNS paper. It describes TNS as a continuously adaptive filterbank.
Comments:
It's basically a more condensed version of the original TNS paper.
Summary:
The low delay codec in MPEG-4 v2 is not relying on window switching for pre-echo control during transients. Switching decisions introduce additional delay which in this case is unacceptable. It uses a fixed length window instead and it incorporates TNS for pre-echo control. Also, a new low overlap TDAC window is introduced which gives better results in transient signals.
Comments:
I found the low overlap window equation in the MPEG-4 v2 specifications and it's available in Matlab.
Comments:
This is a very promising application for parametric modeling tools. I am planning on applying my model on this.
Summary:
This is a very comprehensive review of the perceptual coding algorithms and standards and by itself is a summary. For the purpose of this project one can read section E. Pre-Echo Control Strategies. We care mostly about Window Switching, Gain modification and Temporal Noise Shaping methods.