Next: 11.9 Other Kinds of Recognition System Up: 11 NetworksDictionaries and Language Models Previous: 11.7 Constructing a Dictionary

11.8 Word Network Expansion

Now that word networks and dictionaries have been explained, the conversion of word level networks to model-based recognition networks will be described. Referring again to Fig 11.1, this expansion is performed automatically by the module HNET. By default, HNET attempts to infer the required expansion from the contents of the dictionary and the associated list of HMMs. However, 5 configurations parameters are supplied to apply more precise control where required: ALLOWCXTEXP , ALLOWXWRDEXP , FORCECXTEXP , FORCELEFTBI and FORCERIGHTBI .

The expansion proceeds in four stages.

Context definition
The first step is to determine how model names are constructed from the dictionary entries and whether cross-word context expansion should be performed. The dictionary is scanned and each distinct phone is classified as either
1. Context Free
  In this case, the phone is skipped when determining context. An example is a model (sp) for short pauses. This will typically be inserted at the end of every word pronunciation but since it tends to cover a very short segment of speech it should not block context-dependent effects in a cross-word triphone system.
2. Context Independent
  The phone only exists in context-independent form. A typical example would be a silence model (sil). Note that the distinction that would be made by HNET between sil and sp is that whilst both would only appear in the HMM set in context-independent form, sil would appear in the contexts of other phones whereas sp would not.
3. Context Dependent
  This classification depends on whether a phone appears in the context part of the name and whether any context dependent versions of the phone exist in the HMMSet. Context Dependent phones will be subject to model name expansion.
Determination of network type
The default behaviour is to produce the simplest network possible. If the dictionary is closed (every phone name appears in the HMM list), then no expansion of phone names is performed. The resulting network is generated by straightforward substitution of each dictionary pronunciation for each word in the word network. If the dictionary is not closed, then if word internal context expansion would find each model in the HMM set then word internal context expansion is used. Otherwise, full cross-word context expansion is applied.
The determination of the network type can be modified by using the configuration parameters mentioned earlier. By default ALLOWCXTEXP is set true. If ALLOWCXTEXP is set false, then no expansion of phone names is performed and each phone corresponds to the model of the same name. The default value of ALLOWXWRDEXP is false thus preventing context expansion across word boundaries. This also limits the expansion of the phone labels in the dictionary to word internal contexts only. If FORCECXTEXP is set true, then context expansion will be performed. For example, if the HMM set contained all monophones, all biphones and all triphones, then given a monophone dictionary, the default behaviour of HNET would be to generate a monophone recognition network since the dictionary would be closed. However, if FORCECXTEXP is set true and ALLOWXWRDEXP is set false then word internal context expansion will be performed. If FORCECXTEXP is set true and ALLOWXWRDEXP is set true then full cross-word context expansion will be performed.
Network expansion
Each word in the word network is transformed into a word-end node preceded by the sequence of model nodes corresponding to the word's pronunciation. For cross word context expansion, the initial and final context dependent phones (and any preceding/following context independent ones) are duplicated as many times as is necessary to cater for each different cross word context. Each duplicated word-final phone is followed by a similarly duplicated word-end node. Null words are simply transformed into word-end nodes with no preceding model nodes.
Linking of models to network nodes
Each model node is linked to the corresponding HMM definition. In each case, the required HMM model name is determined from the phone name and the surrounding context names. The algorithm used for this is
1. Construct the context-dependent name and see if the corresponding model exists.
2. Construct the context-independent name and see if the corresponding model exists.
If the configuration variable ALLOWCXTEXP is false (a) is skipped and if the configuration variable FORCECXTEXP is true (b) is skipped. If no matching model is found, an error is generated. When the right context is a boundary or FORCELEFTBI is true, then the context-dependent name takes the form of a left biphone, that is, the phone p with left context l becomes l-p. When the left context is a boundary or FORCERIGHTBI is true, then the context-dependent name takes the form of a right biphone, that is, the phone p with right context r becomes p+r. Otherwise, the context-dependent name is a full triphone, that is, l-p+r. Context-free phones are skipped in this process so
```
	   sil aa r sp y uw sp sil
```
would be expanded as
```
	   sil sil-aa+r aa-r+y sp r-y+uw y-uw+sil sp sil
```
assuming that sil is context-independent and sp is context-free. For word-internal systems, the context expansion can be further controlled via the configuration variable CFWORDBOUNDARY. When set true (default setting) context-free phones will be treated as word boundaries so
```
	   aa r sp y uw sp
```
would be expanded to
```
	   aa+r aa-r sp y+uw y-uw sp
```
Setting CFWORDBOUNDARY false would produce
```
	   aa+r aa-r+y sp r-y+uw y-uw sp
```

Note that in practice, stages (3) and (4) above actually proceed concurrently so that for the first and last phone of context-dependent models, logical models which have the same underlying physical model can be merged.

tex2html_wrap22066

Having described the expansion process in some detail, some simple examples will help clarify the process. All of these are based on the Bit-But word network illustrated in Fig. 11.2. Firstly, assume that the dictionary contains simple monophone pronunciations, that is

    bit        b  i  t 
    but        b  u  t
    start      sil
    end        sil

and the HMM set consists of just monophones

    b  i  t  u  sil

In this case, HNET will find a closed dictionary. There will be no expansion and it will directly generate the network shown in Fig 11.8. In this figure, the rounded boxes represent model nodes and the square boxes represent word-end nodes.

Similarly, if the dictionary contained word-internal triphone pronunciations such as

    bit        b+i  b-i+t  i-t 
    but        b+u  b-u+t  u-t
    start      sil
    end        sil

and the HMM set contains all the required models

    b+i  b-i+t  i-t b+u  b-u+t  u-t  sil

then again HNET will find a closed dictionary and the network shown in Fig. 11.9 would be generated.

tex2html_wrap22068

If however the dictionary contained just the simple monophone pronunciations as in the first case above, but the HMM set contained just triphones, that is

    sil-b+i  t-b+i  b-i+t  i-t+sil  i-t+b  
    sil-b+u  t-b+u  b-u+t  u-t+sil  u-t+b  sil

then HNET would perform full cross-word expansion and generate the network shown in Fig. 11.10.

tex2html_wrap22070

Now suppose that still using the simple monophone pronunciations, the HMM set contained all monophones, biphones and triphones. In this case, the default would be to generate the monophone network of Fig 11.8. If FORCECXTEXP is true but ALLOWXWRDEXP is set false then the word-internal network of Fig. 11.9 would be generated. Finally, if both FORCECXTEXP and ALLOWXWRDEXP are set true then the cross-word network of Fig. 11.10 would be generated.

Next: 11.9 Other Kinds of Recognition System Up: 11 NetworksDictionaries and Language Models Previous: 11.7 Constructing a Dictionary

ECRL HTK_V2.1: email [email protected]