HTK provides a single recognition tool called HVITE which uses the token passing algorithm described in the previous chapter to perform Viterbi-based speech recognition. HVITE takes as input a network describing the allowable word sequences, a dictionary defining how each word is pronounced and a set of HMMs. It operates by converting the word network to a phone network and then attaching the appropriate HMM definition to each phone instance. Recognition can then be performed on either a list of stored speech files or on direct audio input. As noted at the end of the last chapter, HVITE can support cross-word triphones and it can run with multiple tokens to generate lattices containing multiple hypotheses. It can also be configured to rescore lattices and perform forced alignments.
The word networks needed to drive HVITE are usually either simple word loops in which any word can follow any other word or they are directed graphs representing a finite-state task grammar. In the former case, bigram probabilities are normally attached to the word transitions. Word networks are stored using the HTK standard lattice format . This is a text-based format and hence word networks can be created directly using a text-editor. However, this is rather tedious and hence HTK provides two tools to assist in creating word networks. Firstly, HBUILD allows sub-networks to be created and used within higher level networks. Hence, although the same low level notation is used, much duplication is avoided. Also, HBUILD can be used to generate word loops and it can also read in a backed-off bigram language model and modify the word loop transitions to incorporate the bigram probabilities. Note that the label statistics tool HLSTATS mentioned earlier can be used to generate a backed-off bigram language model.
As an alternative to specifying a word network directly, a higher level grammar notation can be used. This notation is based on the Extended Backus Naur Form (EBNF ) used in compiler specification and it is compatible with the grammar specification language used in earlier versions of HTK. The tool HPARSE is supplied to convert this notation into the equivalent word network.
Whichever method is chosen to generate a word network, it is useful to be able to see examples of the language that it defines. The tool HSGEN is provided to do this. It takes as input a network and then randomly traverses the network outputing word strings. These strings can then be inspected to ensure that they correspond to what is required. HSGEN can also compute the empirical perplexity of the task.
Finally, the construction of large dictionaries can involve merging several sources and performing a variety of transformations on each sources. The dictionary management tool HDMAN is supplied to assist with this process.