Finally, the recogniser can be run with live input . To do this it is only necessary to set the configuration variables needed to convert the input audio to the correct form of parameterisation. Specifically, the following need to be set
# Waveform capture
SOURCERATE=625.0
SOURCEKIND=WAVEFORM
SOURCEFORMAT=HAUDIO
ENORMALISE=F
USESILDET=T
MEASURESIL=T
OUTSILWARN=T
These indicate that the source is direct audio with sample period 62.5
Once the configuration file has been set-up for direct audio input, HVITE can be run as in the previous step except that no files need be given as arguments. On start-up, HVITE will prompt the user to speak an arbitrary sentence (approx. 4 secs) in order to measure the speech and background silence levels. It will then repeatedly recognise and, if trace level bit 1 is set, it will output each utterance to the terminal. A typical session is as follows
Read 1648 physical / 4131 logical HMMs
Read lattice with 26 nodes / 52 arcs
Created network with 123 nodes / 151 links
READY[1]>
Please speak sentence - measuring levels
Level measurement completed
DIAL FOUR SIX FOUR TWO FOUR OH
== [303 frames] -95.5773 [Ac=-28630.2 LM=-329.8] (Act=21.8)
READY[2]>
DIAL ZERO EIGHT SIX TWO
== [228 frames] -99.3758 [Ac=-22402.2 LM=-255.5] (Act=21.8)
READY[3]>
etc
During loading, information will be printed out regarding the different
recogniser components. The physical models are the distinct HMMs used by
the system, while the logical models include all model names. The number
of logical models is higher than the number of physical models because many
logically distinct models have been determined to be physically identical
and have been merged during the previous model building steps. The lattice
information refers to the number of links and nodes in the recognition syntax.
The network information refers to actual recognition network built by
expanding the lattice using the current HMM set, dictionary and any context
expansion rules specified.
After each utterance, the numerical information gives the total number
of frames, the average log likelihood per frame, the total acoustic score,
the total language model score and the average number of models active.
Finally, note that if it was required to recognise a new name, then the following two changes would be needed