As indicated in the introduction above, the basic operation of the HTK training tools involves reading in a set of one or more HMM definitions, and then using speech data to estimate the parameters of these definitions. The speech data files are normally stored in parameterised form such as LPC or MFCC parameters. However, additional parameters such as delta coefficients are normally computed on-the-fly whilst loading each file.
In fact, it is also possible to use waveform data directly by performing the full parameter conversion on-the-fly. Which approach is preferred depends on the available computing resources. The advantages of storing the data already encoded are that the data is more compact in parameterised form and pre-encoding avoids wasting compute time converting the data each time that it is read in. However, if the training data is derived from CD-ROMS and they can be accessed automatically on-line, then the extra compute may be worth the saving in magnetic disk storage.
The methods for configuring speech data input to HTK tools were described in detail in chapter 5. All of the various input mechanisms are supported by the HTK training tools except direct audio input.
The precise way in which the training tools are used depends on the type of HMM system to be built and the form of the available training data. Furthermore, HTK tools are designed to interface cleanly to each other, so a large number of configurations are possible. In practice, however, HMM-based speech recognisers are either whole-word or sub-word.
As the name suggests, whole word modelling refers to a technique whereby each individual word in the system vocabulary is modelled by a single HMM. As shown in Fig. 8.1, whole word HMMs are most commonly trained on examples of each word spoken in isolation. If these training examples, which are often called tokens, have had leading and trailing silence removed, then they can be input directly into the training tools without the need for any label information. The most common method of building whole word HMMs is to firstly use HINIT to calculate initial parameters for the model and then use HREST to refine the parameters using Baum-Welch re-estimation. Where there is limited training data and recognition in adverse noise environments is needed, so-called fixed variance models can offer improved robustness. These are models in which all the variances are set equal to the global speech variance and never subsequently re-estimated. The tool HCOMPV can be used to compute this global variance.
Although HTK gives full support for building whole-word HMM systems, the bulk of its facilities are focussed on building sub-word systems in which the basic units are the individual sounds of the language called phones. One HMM is constructed for each such phone and continuous speech is recognised by joining the phones together to make any required vocabulary using a pronunciation dictionary.
The basic procedures involved in training a set of subword models are shown in Fig. 8.2. The core process involves the embedded training tool HEREST . HEREST uses continuously spoken utterances as its source of training data and simultaneously re-estimates the complete set of subword HMMs. For each input utterance, HEREST needs a transcription i.e. a list of the phones in that utterance. HEREST then joins together all of the subword HMMs corresponding to this phone list to make a single composite HMM. This composite HMM is used to collect the necessary statistics for the re-estimation. When all of the training utterances have been processed, the total set of accumulated statistics are used to re-estimate the parameters of all of the phone HMMs. It is important to emphasise that in the above process, the transcriptions are only needed to identify the sequence of phones in each utterance. No phone boundary information is needed.
The initialisation of a set of phone HMMs prior to embedded re-estimation using HEREST can be achieved in two different ways. As shown on the left of Fig. 8.2, a small set of hand-labelled bootstrap training data can be used along with the isolated training tools HINIT and HREST to initialise each phone HMM individually. When used in this way, both HINIT and HREST use the label information to extract all the segments of speech corresponding to the current phone HMM in order to perform isolated word training.
A simpler initialisation procedure uses HCOMPV to assign the global speech mean and variance to every Gaussian distribution in every phone HMM. This so-called flat start procedure implies that during the first cycle of embedded re-estimation, each training utterance will be uniformly segmented. The hope then is that enough of the phone models align with actual realisations of that phone so that on the second and subsequent iterations, the models align as intended.
One of the major problems to be faced in building any HMM-based system is that the amount of training data for each model will be variable and is rarely sufficient. To overcome this, HTK allows a variety of sharing mechanisms to be implemented whereby HMM parameters are tied together so that the training data is pooled and more robust estimates result. These tyings, along with a variety of other manipulations, are performed using the HTK HMM editor HHED. The use of HHED is described in a later chapter. Here it is sufficient to note that a phone-based HMM set typically goes through several refinement cycles of editing using HHED followed by parameter re-estimation using HEREST before the final model set is obtained.
Having described in outline the main training strategies, each of the above procedures will be described in more detail.