Finally, the recogniser can be run with live input . To do this it is only necessary to set the configuration variables needed to convert the input audio to the correct form of parameterisation. Specifically, the following need to be set
# Waveform capture SOURCERATE=625.0 SOURCEKIND=WAVEFORM SOURCEFORMAT=HAUDIO ENORMALISE=F USESILDET=T MEASURESIL=T OUTSILWARN=TThese indicate that the source is direct audio with sample period 62.5 secs. The silence detector is enabled and a measurement of the background speech/silence levels should be made at start-up. The final line makes sure that a warning is printed when this silence measurement is being made.
Once the configuration file has been set-up for direct audio input, HVITE can be run as in the previous step except that no files need be given as arguments. On start-up, HVITE will prompt the user to speak an arbitrary sentence (approx. 4 secs) in order to measure the speech and background silence levels. It will then repeatedly recognise and, if trace level bit 1 is set, it will output each utterance to the terminal. A typical session is as follows
Read 1648 physical / 4131 logical HMMs Read lattice with 26 nodes / 52 arcs Created network with 123 nodes / 151 links READY[1]> Please speak sentence - measuring levels Level measurement completed DIAL FOUR SIX FOUR TWO FOUR OH == [303 frames] -95.5773 [Ac=-28630.2 LM=-329.8] (Act=21.8) READY[2]> DIAL ZERO EIGHT SIX TWO == [228 frames] -99.3758 [Ac=-22402.2 LM=-255.5] (Act=21.8) READY[3]> etcDuring loading, information will be printed out regarding the different recogniser components. The physical models are the distinct HMMs used by the system, while the logical models include all model names. The number of logical models is higher than the number of physical models because many logically distinct models have been determined to be physically identical and have been merged during the previous model building steps. The lattice information refers to the number of links and nodes in the recognition syntax. The network information refers to actual recognition network built by expanding the lattice using the current HMM set, dictionary and any context expansion rules specified. After each utterance, the numerical information gives the total number of frames, the average log likelihood per frame, the total acoustic score, the total language model score and the average number of models active.
Finally, note that if it was required to recognise a new name, then the following two changes would be needed