next up previous contents index
Next: 5.13 Copying and Coding using HCOPY Up: 5 Speech Input/Output Previous: 5.11 Vector Quantisation

5.12 Viewing Speech with HLIST

 

  As mentioned in section 5.1, the tool HLIST  provides a dual rôle in HTK. Firstly, it can be used for examining the contents of speech data files. In general, HLIST displays three types of information

  1. source header: requested using the -h option
  2. target header: requested using the -t option
  3. target data: printed by default. The begin and end samples of the displayed data can be specified using the -s and -e options.
When the default configuration parameters are used, no conversions are applied and the target data is identical to the contents of the file.

  As an example, suppose that the file called timit.wav holds speech waveform data using the TIMIT format. The command

    HList -h -e 49 -F TIMIT timit.wav
would display the source header information and the first 50 samples of the file. The output would look something like the following

The source information confirms that the file contains WAVEFORM data with 2 byte samples and 31437 samples in total. The sample period is tex2html_wrap_inline19982 which corresponds to a 16kHz sample rate. The displayed data is numerically small because it corresponds to leading silence. Any part of the file could be viewed by suitable choice of the begin and end sample indices. For example,
   HList -s 5000 -e 5049 -F TIMIT timit.wav
would display samples 5000 through to 5049. The output might look like the following

The second use of HLIST is to check that input conversions are being performed properly. Suppose that the above TIMIT format file is part of a database to be used for training a recogniser and that mel-frequency cepstra are to be used along with energy and the first differential coefficients. Suitable configuration parameters needed to achieve this might be as follows

    # Wave -> MFCC config file
    SOURCEFORMAT = TIMIT    # same as -F TIMIT
    TARGETKIND   = MFCC_E_D # MFCC + Energy + Deltas
    TARGETRATE   = 100000   # 10ms frame rate
    WINDOWSIZE   = 200000   # 20ms window
    NUMCHANS     = 24       # num filterbank chans
    NUMCEPS      = 8        # compute c1 to c8
HLIST can be used to check this. For example, typing
    HList -C config -o -h -t -s 100 -e 104 -i 9  timit.wav
will cause the waveform file to be converted, then the source header, the target header and parameter vectors 100 through to 104 to be listed. A typical output would be as follows
------------------------------ Source: timit.wav ---------------------------
  Sample Bytes:  2        Sample Kind:   WAVEFORM
  Num Comps:     1        Sample Period: 62.5 us
  Num Samples:   31437    File Format:   TIMIT
------------------------------------ Target --------------------------------
  Sample Bytes:  72       Sample Kind:   MFCC_E_D
  Num Comps:     18       Sample Period: 10000.0 us
  Num Samples:   195      File Format:   HTK
-------------------------- Observation Structure ---------------------------
x:    MFCC-1  MFCC-2  MFCC-3  MFCC-4  MFCC-5  MFCC-6  MFCC-7  MFCC-8       E
       Del-1   Del-2   Del-3   Del-4   Del-5   Del-6   Del-7   Del-8    DelE
------------------------------ Samples: 100->104 ---------------------------
100:   3.573 -19.729  -1.256  -6.646  -8.293 -15.601 -23.404  10.988   0.834
       3.161  -1.913   0.573  -0.069  -4.935   2.309  -5.336   2.460   0.080
101:   3.372 -16.278  -4.683  -3.600 -11.030  -8.481 -21.210  10.472   0.777
       0.608  -1.850  -0.903  -0.665  -2.603  -0.194  -2.331   2.180   0.069
102:   2.823 -15.624  -5.367  -4.450 -12.045 -15.939 -22.082  14.794   0.830
      -0.051   0.633  -0.881  -0.067  -1.281  -0.410   1.312   1.021   0.005
103:   3.752 -17.135  -5.656  -6.114 -12.336 -15.115 -17.091  11.640   0.825
      -0.002  -0.204   0.015  -0.525  -1.237  -1.039   1.515   1.007   0.015
104:   3.127 -16.135  -5.176  -5.727 -14.044 -14.333 -18.905  15.506   0.833
      -0.034  -0.247   0.103  -0.223  -1.575   0.513   1.507   0.754   0.006
------------------------------------- END ----------------------------------

The target header information shows that the converted data consists of 195 parameter vectors, each vector having 18 components and being 72 bytes in size. The structure of each parameter vector is displayed as a simple sequence of floating-point numbers. The layout information described in section 5.7 can be used to interpret the data. However, including the -o option, as in the example, causes HLIST to output a schematic of the observation structure. Thus, it can be seen that the first row of each sample contains the static coefficients and the second contains the delta coefficients. The energy is in the final column. The command line option -i 9 controls the number of values displayed per line and can be used to aid in the visual interpretation of the data. Notice finally that the command line option -F TIMIT was not required in this case because the source format was specified in the configuration file.

It should be stressed that when HLIST displays parameterised data, it does so in exactly the form that observations are passed to a HTK tool. So, for example, if the above data was input to a system built using 3 data streams, then this can be simulated by using the command line option -n to set the number of streams. For example, typing

    HList -C config -n 3 -o -s 100 -e 101 -i 9  timit.wav
would result in the following output
------------------------ Observation Structure -----------------------
nTotal=18 nStatic=8 nDel=16  eSep=T
x.1:    MFCC-1  MFCC-2  MFCC-3  MFCC-4  MFCC-5  MFCC-6  MFCC-7  MFCC-8
x.2:     Del-1   Del-2   Del-3   Del-4   Del-5   Del-6   Del-7   Del-8
x.3:         E    DelE
-------------------------- Samples: 100->101 -------------------------
100.1:   3.573 -19.729  -1.256  -6.646  -8.293 -15.601 -23.404  10.988
100.2:   3.161  -1.913   0.573  -0.069  -4.935   2.309  -5.336   2.460
100.3:   0.834   0.080
101.1:   3.372 -16.278  -4.683  -3.600 -11.030  -8.481 -21.210  10.472
101.2:   0.608  -1.850  -0.903  -0.665  -2.603  -0.194  -2.331   2.180
101.3:   0.777   0.069
--------------------------------- END --------------------------------
Notice that the data is identical to the previous case, but it has been re-organised into separate streams. 


next up previous contents index
Next: 5.13 Copying and Coding using HCOPY Up: 5 Speech Input/Output Previous: 5.11 Vector Quantisation

ECRL HTK_V2.1: email [email protected]