When I was a young and fresh graduate student in Barry Vercoe's group at MIT, I worked on some Short-time Fourier transform analysis/synthesis tools, implementing a sound modification scheme known as the Phase Vocoder (PVOC). This code was incorporated into Csound, Barry's sound synthesis system, and has become quite widely distributed. For more information about CSound and its PVOC operator, see the Online CSound Manual, particularly the pages on pvoc (the CSound signal generator), pvanal (the pre-analysis program), and pvlook (a program to print out the values in a pvoc analysis file).

I regularly get questions from people wanting to use the PVOC analysis files for other purposes, so this page describes the file format. Please let me know if there's other information you'd like on this page.

Here is the format of the PVOC files, as written by pvanal. This is based on the contents of pvoc.h.

It starts off with a header. 'long' is a 32 bit integer in natural machine format.

byte index | field | description |
---|---|---|

0-3 | long magic | = 517730 decimal, identifying code |

4-7 | long headBsize | = 56, the # of bytes in this header |

8-11 | long dataBsize | # of data bytes after the header |

12-15 | long dataFormat | = 36, i.e. data is 4 byte floats |

16-19 | float samplingRate | 32-bit float, natural format |

20-23 | long channels | = 1; multi-channel not supported |

24-27 | long frameSize | FFT size, always a power of 2 |

28-31 | long frameIncr | hop size, new samples per frame |

32-35 | long frameBsize | number of bytes in each block of data corresponding to one frame. = 8 * (frameSize/2+1) (see below) |

36-39 | long frameFormat | = 7 for PVOC files (polar, phi-dot) |

40-43 | float minFreq | = 0.0, centre freq of lowest bin |

44-47 | float maxFreq | = samplingRate/2.0, cf of top bin+1 |

48-51 | long freqFormat | = 1, linear frequency axis |

52-55 | char info[4] | empty header extension field |

There then follow some number of frames, each of frameSize/2+1 pairs of (32bit) floats (hence frameBsize is 2 x 4 x (frameSize/2+1). The first of each pair is the MAGNITUDE of the corresponding FFT bin (for bins 0..frameSize/2 INCLUSIVE, i.e. frameSize/2+1 bins). The second is the effective phase advance for that fft bin over the preceding frame, expressed in cycles per second. To explain...

Consider an analysis with 256 point analysis window and 75% overlap between frames (frameSize = 256, frameIncr = 64).

If we denote the (complex) fft output as X[k, m] where k is the frequency index (0 to 255, we only look at 0 to 128 = 129 bins) and m is the frame index (so m+1 indicates 64 samples further on)

then the phase advance for X[k, m] is

Dp[k,m] = phase{X[k,m]} - phase{X[k,m-1]}

(phase{X[k,-1]} is taken as zero) and to express this number in cycles per second,

Dpc[k,m] = Dp[k,m]/2pi / (frameIncr/samplingRate) ----------- ---------------------- | \---- the time between frames \--- the portion of a cycle represented by Dp

Now because of the 2pi ambiguity in Dp, this number will be a positive or negative frequency smaller than half the binwidth (samplingRate/frameSize). We assume the 'true' phase advance is approximately that at the band center, so we add that on :-

Dpa[k,m] = Dpc[k,m] + (k/frameSize)*samplingRate

It is this number that is written to file, or something like that. The actual range of ambiguity depends on the size of frameIncr, of course. In the code it is calculated a little different (the expected phase advance for a component at band center is calculated mod 2pi, the difference from the measured phase advance is also reduced to modulo 2pi, this is then scaled up to cycles per second, and has the bin mid freq added back).

The motivation behind all this is that the numbers that come out have some kind of meaningful interpretation : if the signal has a sinusoidal component, the phase advance so expressed is the frequency of the sinusoid.

Having said all that, I wouldn't do it that way again! I think it's easier to leave application-dependent reformulation to the application, and make the data files as vanilla as possible.

(I just checked that on a 200hz sine tone - the dphase values around the mag peak are indeed +200.0. The magnitude values are normally a little smaller than 1.0 i.e. the input file has been scaled so that 32768 -> 1.0, and I *think* the FFT output has been scaled for 1/window length - not sure).

*(originally written 1994mar09)*

Last updated: $Date: 2001/09/28 18:53:00 $

Dan Ellis <dpwe@ee.columbia.edu>