Next: 3.6 Summary Up: 3 A Tutorial Example of Using HTK Previous: 3.4.1 Step 11 - Recognising the Test Data

3.5 Running the Recogniser Live

Finally, the recogniser can be run with live input . To do this it is only necessary to set the configuration variables needed to convert the input audio to the correct form of parameterisation. Specifically, the following need to be set

    # Waveform capture
    SOURCERATE=625.0
    SOURCEKIND=WAVEFORM
    SOURCEFORMAT=HAUDIO
    ENORMALISE=F
    USESILDET=T
    MEASURESIL=T
    OUTSILWARN=T

These indicate that the source is direct audio with sample period 62.5

secs. The silence detector is enabled and a measurement of the background speech/silence levels should be made at start-up. The final line makes sure that a warning is printed when this silence measurement is being made.

Once the configuration file has been set-up for direct audio input, HVITE can be run as in the previous step except that no files need be given as arguments. On start-up, HVITE will prompt the user to speak an arbitrary sentence (approx. 4 secs) in order to measure the speech and background silence levels. It will then repeatedly recognise and, if trace level bit 1 is set, it will output each utterance to the terminal. A typical session is as follows

   Read 1648 physical / 4131 logical HMMs
   Read lattice with 26 nodes / 52 arcs
   Created network with 123 nodes / 151 links

   READY[1]>
   Please speak sentence - measuring levels
   Level measurement completed
   DIAL FOUR SIX FOUR TWO FOUR OH  
        == [303 frames] -95.5773 [Ac=-28630.2 LM=-329.8] (Act=21.8)
   
   READY[2]>
    DIAL ZERO EIGHT SIX TWO 
        == [228 frames] -99.3758 [Ac=-22402.2 LM=-255.5] (Act=21.8)
   
   READY[3]>
    etc

During loading, information will be printed out regarding the different recogniser components. The physical models are the distinct HMMs used by the system, while the logical models include all model names. The number of logical models is higher than the number of physical models because many logically distinct models have been determined to be physically identical and have been merged during the previous model building steps. The lattice information refers to the number of links and nodes in the recognition syntax. The network information refers to actual recognition network built by expanding the lattice using the current HMM set, dictionary and any context expansion rules specified. After each utterance, the numerical information gives the total number of frames, the average log likelihood per frame, the total acoustic score, the total language model score and the average number of models active.

Finally, note that if it was required to recognise a new name, then the following two changes would be needed

the grammar would be altered to include the new name
a pronunciation for the new name would be added to the dictionary

If the new name required triphones which did not exist, then they could be created by loading the existing triphone set into HHED , loading the decision trees using the LT command and then using the AU command to generate a new complete triphone set.

Next: 3.6 Summary Up: 3 A Tutorial Example of Using HTK Previous: 3.4.1 Step 11 - Recognising the Test Data

ECRL HTK_V2.1: email [email protected]