Let each spoken word be represented by a sequence of speech vectors or observations , defined as
where is the speech vector observed at time t. The isolated word recognition problem can then be regarded as that of computing
where is the i'th vocabulary word. This probability is not computable directly but using Bayes' Rule gives
Thus, for a given set of prior probabilities , the most probable spoken word depends only on the likelihood . Given the dimensionality of the observation sequence , the direct estimation of the joint conditional probability from examples of spoken words is not practicable. However, if a parametric model of word production such as a Markov model is assumed, then estimation from data is possible since the problem of estimating the class conditional observation densities is replaced by the much simpler problem of estimating the Markov model parameters.
In HMM based speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model as shown in Fig. 1.3. A Markov model is a finite state machine which changes state once every time unit and each time t that a state j is entered, a speech vector is generated from the probability density . Furthermore, the transition from state i to state j is also probabilistic and is governed by the discrete probability . Fig. 1.3 shows an example of this process where the six state model moves through the state sequence X=1,2,2,3,4,4,5,6 in order to generate the sequence to . Notice that in HTK, the entry and exit states of a HMM are non-emitting. This is to facilitate the construction of composite models as explained in more detail later.
The joint probability that is generated by the model M moving through the state sequence X is calculated simply as the product of the transition probabilities and the output probabilities. So for the state sequence X in Fig. 1.3
However, in practice, only the observation sequence is known and the underlying state sequence X is hidden. This is why it is called a Hidden Markov Model.
Given that X is unknown, the required likelihood is computed by summing over all possible state sequences , that is
where x(0) is constrained to be the model entry state and x(T+1) is constrained to be the model exit state.
As an alternative to equation 1.5, the likelihood can be approximated by only considering the most likely state sequence, that is
Although the direct computation of equations 1.5 and 1.6 is not tractable, simple recursive procedures exist which allow both quantities to be calculated very efficiently. Before going any further, however, notice that if equation 1.2 is computable then the recognition problem is solved. Given a set of models corresponding to words , equation 1.2 is solved by using 1.3 and assuming that
All this, of course, assumes that the parameters and are known for each model . Herein lies the elegance and power of the HMM framework. Given a set of training examples corresponding to a particular model, the parameters of that model can be determined automatically by a robust and efficient re-estimation procedure. Thus, provided that a sufficient number of representative examples of each word can be collected then a HMM can be constructed which implicitly models all of the many sources of variability inherent in real speech. Fig. 1.4 summarises the use of HMMs for isolated word recognition. Firstly, a HMM is trained for each vocabulary word using a number of examples of that word. In this case, the vocabulary consists of just three words: ``one'', ``two'' and ``three''. Secondly, to recognise some unknown word, the likelihood of each model generating that word is calculated and the most likely model identifies the word.