next up previous contents index
Next: 11.2 Word Networks and Standard Lattice Format Up: 11 NetworksDictionaries and Language Models Previous: 11 NetworksDictionaries and Language Models

11.1 How Networks are Used

 

Before delving into the details of word networks  and dictionaries, it will be helpful to understand their rôle in building a speech recogniser using HTK. Fig 11.1 illustrates the overall recognition process. A word network is defined using HTK Standard Lattice Format (SLF). An SLF word network is just a text file and it can be written directly with a text editor or a tool can be used to build it. HTK provides two such tools, HBUILD and HPARSE. These both take as input a textual description and output an SLF file. Other methods of generating SLF files might in the future include graphics-based systems which allow the required networks to be drawn on the screen. Whatever method is chosen, word network SLF generation is done off-line and is part of the system build process.

An SLF file contains a list of nodes representing words and a list of arcs representing the transitions between words. The transitions can have probabilities attached to them and these can be used to indicate preferences in a grammar network. They can also be used to represent bigram probabilities in a back-off bigram network and HBUILD can generate such a bigram network automatically. In addition to an SLF file, a HTK recogniser requires a dictionary to supply pronunciations for each word in the network and a set of acoustic HMM phone models. Dictionaries are input via the HTK interface module HDICT.

The dictionary, HMM set and word network are input to the HTK library module HNET whose function is to generate an equivalent network of HMMs. Each word in the dictionary may have several pronunciations and in this case there will be one branch in the network corresponding to each alternative pronunciation. Each pronunciation may consist either of a list of phones or a list of HMM names. In the former case, HNET can optionally expand the HMM network to use either word internal triphones or cross-word triphones. Once the HMM network has been constructed, it can be input to the decoder module HREC and used to recognise speech input. Note that HMM network construction is performed on-line at recognition time as part of the initialisation process.

  tex2html_wrap22024

For convenience, HTK provides a recognition  tool called HVITE to allow the functions provided by HNET and HREC to be invoked from the command line. HVITE is particularly useful for running experimental evaluations on test speech stored in disk files and for basic testing using live audio input. However, application developers should note that HVITE is just a shell containing calls to load the word network, dictionary and models; generate the recognition network and then repeatedly recognise each input utterance. For embedded applications, it may well be appropriate to dispense with HVITE and call the functions in HNET and HREC directly from the application. The use of HVITE is explained in the next chapter.


next up previous contents index
Next: 11.2 Word Networks and Standard Lattice Format Up: 11 NetworksDictionaries and Language Models Previous: 11 NetworksDictionaries and Language Models

ECRL HTK_V2.1: email [email protected]