next up previous contents index
Next: 1.3 Output Probability Specification Up: 1 The Fundamentals of HTK Previous: 1.1 General Principles of HMMs

1.2 Isolated Word Recognition

 

Let each spoken word be represented by a sequence of speech vectors or observations tex2html_wrap_inline19388 , defined as

equation446

where tex2html_wrap_inline19398 is the speech vector observed at time t. The isolated word recognition problem can then be regarded as that of computing

  equation453

where tex2html_wrap_inline19404 is the i'th vocabulary word. This probability is not computable directly but using Bayes' Rule  gives

  equation458

Thus, for a given set of prior probabilities tex2html_wrap_inline19414 , the most probable spoken word depends only on the likelihood tex2html_wrap_inline19416 . Given the dimensionality of the observation sequence tex2html_wrap_inline19418 , the direct estimation of the joint conditional probability tex2html_wrap_inline19420 from examples of spoken words is not practicable. However, if a parametric model of word production such as a Markov model is assumed, then estimation from data is possible since the problem of estimating the class conditional observation densities tex2html_wrap_inline19422 is replaced by the much simpler problem of estimating the Markov model parameters.

  tex2html_wrap19494

In HMM based speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model  as shown in Fig. 1.3. A Markov model is a finite state machine which changes state once every time unit and each time t that a state j is entered, a speech vector tex2html_wrap_inline19428 is generated from the probability density tex2html_wrap_inline19430 . Furthermore, the transition from state i to state j is also probabilistic and is governed by the discrete probability tex2html_wrap_inline19436 . Fig. 1.3 shows an example of this process where the six state model moves through the state sequence X=1,2,2,3,4,4,5,6 in order to generate the sequence tex2html_wrap_inline19440 to tex2html_wrap_inline19442 . Notice that in HTK, the entry and exit states of a HMM are non-emitting. This is to facilitate the construction of composite models as explained in more detail later.

The joint probability that tex2html_wrap_inline19444 is generated by the model M moving through the state sequence X is calculated simply as the product of the transition probabilities and the output probabilities. So for the state sequence X in Fig. 1.3

  equation483

However, in practice, only the observation sequence tex2html_wrap_inline19460 is known and the underlying state sequence X is hidden. This is why it is called a Hidden Markov Model.

  tex2html_wrap19496

Given that X is unknown, the required likelihood  is computed by summing over all possible state sequences tex2html_wrap_inline19466 , that is

  equation499

where x(0) is constrained to be the model entry state and x(T+1) is constrained to be the model exit state.

As an alternative to equation 1.5, the likelihood can be approximated by only considering the most likely state sequence, that is

  equation509

Although the direct computation of equations 1.5 and 1.6 is not tractable, simple recursive procedures exist which allow both quantities to be calculated very efficiently. Before going any further, however, notice that if equation 1.2 is computable then the recognition problem is solved. Given a set of models tex2html_wrap_inline19480 corresponding to words tex2html_wrap_inline19482 , equation 1.2 is solved by using 1.3 and assuming that

  equation524

All this, of course, assumes that the parameters tex2html_wrap_inline19488 and tex2html_wrap_inline19490 are known for each model tex2html_wrap_inline19492 . Herein lies the elegance and power of the HMM framework. Given a set of training examples corresponding to a particular model, the parameters of that model can be determined automatically by a robust and efficient re-estimation procedure. Thus, provided that a sufficient number of representative examples of each word can be collected then a HMM can be constructed which implicitly models all of the many sources of variability inherent in real speech. Fig. 1.4 summarises the use of HMMs for isolated word recognition. Firstly, a HMM is trained for each vocabulary word using a number of examples of that word. In this case, the vocabulary consists of just three words: ``one'', ``two'' and ``three''. Secondly, to recognise some unknown word, the likelihood of each model generating that word is calculated and the most likely model identifies the word.

  tex2html_wrap19498


next up previous contents index
Next: 1.3 Output Probability Specification Up: 1 The Fundamentals of HTK Previous: 1.1 General Principles of HMMs

ECRL HTK_V2.1: email [email protected]