Next: 12.7 N-Best Lists and Lattices Up: 12 Decoding Previous: 12.5 Generating Forced Alignments

12.6 Recognition using Direct Audio Input

In all of the preceding discussion, it has been assumed that input was from speech files stored on disk. These files would normally have been stored in parameterised form so that little or no conversion of the source speech data was required. When HVITE is invoked with no files listed on the command line, it assumes that input is to be taken directly from the audio input. In this case, configuration variables must be used to specify firstly how the speech waveform is to be captured and secondly, how the captured waveform is to be converted to parameterised form.

Dealing with waveform capture first, as described in section 5.9, HTK provides two main forms of control over speech capture: signals/keypress and an automatic speech/silence detector . To use the speech/silence detector alone, the configuration file would contain the following

    # Waveform capture
    SOURCERATE=625.0
    SOURCEKIND=WAVEFORM
    SOURCEFORMAT=HAUDIO
    USESILDET=T
    MEASURESIL=T
    OUTSILWARN=T
    ENORMALISE=F

where the source sampling rate is being set to 16kHz. Notice that the SOURCEKIND must be set to WAVEFORM and the SOURCEFORMAT must be set to HAUDIO. Setting the Boolean variable USESILDET causes the speech/silence detector to be used, and the MEASURESIL OUTSILWARN variables result in a measurement being taken of the background silence level prior to capturing the first utterance. To make sure that each input utterance is being captured properly, the HVITE option -g can be set to cause the captured wave to be output after each recognition attempt. Note that for recognition of live audio input the configuration variable ENORMALISE should be set to false.

As an alternative to using the speech/silence detector, a signal can be used to start and stop recording. For example,

    # Waveform capture
    SOURCERATE=625.0
    SOURCEKIND=WAVEFORM
    SOURCEFORMAT=HAUDIO
    AUDIOSIG=2

would result in the Unix interrupt signal (usually the Control-C key) being used as a start and stop control

. Key-press control of the audio input can be obtained by setting AUDIOSIG to a negative number.

Both of the above can be used together, in this case, audio capture is disabled until the specified signal is received. From then on control is in the hands of the speech/silence detector.

The captured waveform must be converted to the required target parameter kind. Thus, the configuration file must define all of the parameters needed to control the conversion of the waveform to the required target kind. This process is described in detail in Chapter 5. As an example, the following parameters would allow conversion to Mel-frequency cepstral coefficients with delta and acceleration parameters.

    # Waveform to MFCC parameters
    TARGETKIND=MFCC_0_D_A
    TARGETRATE=100000.0
    WINDOWSIZE=250000.0
    ZMEANSOURCE=T
    USEHAMMING = T
    PREEMCOEF = 0.97
    USEPOWER = T
    NUMCHANS = 26
    CEPLIFTER = 22
    NUMCEPS = 12

Many of these variable settings are the default settings and could be omitted, they are included explicitly here as a reminder of the main configuration options available.

When HVITE is executed in direct audio input mode, it issues a prompt prior to each input and it is normal to enable basic tracing so that the recognition results can be seen. A typical terminal output might be

    READY[1]>
    Please speak sentence - measuring levels
    Level measurement completed
    DIAL ONE FOUR SEVEN  
         ==  [258 frames] -97.8668 [Ac=-25031.3 LM=-218.4] (Act=22.3)

    READY[2]>
    CALL NINE TWO EIGHT  
         ==  [233 frames] -97.0850 [Ac=-22402.5 LM=-218.4] (Act=21.8)

    etc

If required, a transcription of each spoken input can be output to a label file or an MLF in the usual way by setting the -e option. However, to do this a file name must be synthesised. This is done by using a counter prefixed by the value of the HVITE configuration variable RECOUTPREFIX and suffixed by the value of RECOUTSUFFIX . For example, with the settings

    RECOUTPREFIX = sjy
    RECOUTSUFFIX = .rec

then the output transcriptions would be stored as sjy0001.rec, sjy0002.rec etc.

Next: 12.7 N-Best Lists and Lattices Up: 12 Decoding Previous: 12.5 Generating Forced Alignments

ECRL HTK_V2.1: email [email protected]