Next: 11.9 Other Kinds of Recognition System
Up: 11 NetworksDictionaries and Language Models
Previous: 11.7 Constructing a Dictionary
Now that word networks and dictionaries have been explained,
the conversion of word level networks
to model-based recognition networks will be described. Referring
again to Fig 11.1, this expansion
is performed automatically by the module HNET. By default,
HNET attempts to infer the required expansion from the
contents of the dictionary and the associated list of HMMs.
However, 5 configurations parameters are supplied to apply
more precise control where required:
ALLOWCXTEXP ,
ALLOWXWRDEXP ,
FORCECXTEXP ,
FORCELEFTBI and
FORCERIGHTBI .
The expansion proceeds in four stages.
- Context definition
The first step is to determine how model
names are constructed from the dictionary entries and whether
cross-word context expansion should be performed.
The dictionary is scanned and each distinct phone is
classified as either
- Context Free
In this case, the phone is skipped when determining context.
An example is a model (sp) for short pauses.
This will typically be inserted at the end of every word
pronunciation but since it tends to cover a very short
segment of speech it should not block context-dependent
effects in a cross-word triphone system. - Context Independent
The phone only exists in context-independent form. A typical
example would be a silence model (sil).
Note that the distinction that would be made by HNET between
sil and sp is that whilst both would
only appear in the HMM set
in context-independent form, sil would appear in the contexts
of other phones whereas sp would not. - Context Dependent
This classification depends on whether a phone appears in the context
part of the name and whether
any context dependent versions of the phone exist in the HMMSet.
Context Dependent phones will be subject to model name expansion.
- Determination of network type
The default behaviour is to produce the simplest network
possible. If the dictionary is closed (every phone name appears
in the HMM list), then no expansion of phone names is performed.
The resulting network is generated by straightforward
substitution of each dictionary pronunciation for each
word in the word network. If the dictionary is not closed,
then if word internal context expansion
would find each model in the HMM set then word internal
context expansion is used.
Otherwise, full cross-word
context expansion is applied.
The determination of the network type can be modified by
using the configuration parameters mentioned earlier. By default
ALLOWCXTEXP is set true. If ALLOWCXTEXP is set false, then
no expansion of phone names is performed and each phone corresponds to the
model of the same name. The default value of ALLOWXWRDEXP is false thus
preventing context expansion across word boundaries. This also limits the
expansion of the phone labels in the dictionary to word internal contexts
only. If FORCECXTEXP is set true, then context expansion will be
performed. For example, if the HMM set contained all monophones, all biphones
and all triphones, then given a monophone dictionary, the default behaviour of
HNET would be to generate a monophone recognition network since the
dictionary would be closed. However, if FORCECXTEXP is set true and
ALLOWXWRDEXP is set false then word internal context expansion will
be performed. If FORCECXTEXP is set true and ALLOWXWRDEXP is
set true then full cross-word context expansion will be performed.
- Network expansion
Each word in the word network is transformed into a word-end
node preceded by the sequence of model nodes corresponding to
the word's pronunciation.
For cross word context expansion, the initial and final context
dependent phones (and any preceding/following context independent
ones) are duplicated as many times as is necessary
to cater for each different cross
word context. Each duplicated word-final phone is followed by
a similarly duplicated word-end node.
Null words are simply transformed into word-end nodes with
no preceding model nodes. - Linking of models to network nodes
Each model node is linked to the corresponding HMM definition.
In each case, the required HMM model name is
determined from the phone name and the surrounding
context names. The algorithm used for this is
- Construct the context-dependent name and see if the
corresponding model exists.
- Construct the context-independent name and see if the
corresponding model exists.
If the configuration variable ALLOWCXTEXP is false (a)
is skipped and if the configuration variable FORCECXTEXP is true
(b) is skipped. If no matching model is found, an error is
generated. When the right context
is a boundary or FORCELEFTBI is true, then the
context-dependent name takes the form of a left biphone, that is,
the phone p with left context l becomes l-p.
When the left context
is a boundary or FORCERIGHTBI is true, then the
context-dependent name takes the form of a right biphone, that is,
the phone p with right context r becomes p+r.
Otherwise, the context-dependent name is a full triphone, that is,
l-p+r.
Context-free phones are skipped in this process so
sil aa r sp y uw sp sil
would be expanded as
sil sil-aa+r aa-r+y sp r-y+uw y-uw+sil sp sil
assuming that sil is context-independent and sp is
context-free.
For word-internal systems,
the context expansion can be further controlled via the configuration variable
CFWORDBOUNDARY. When set true (default setting) context-free phones
will be treated as word boundaries so
aa r sp y uw sp
would be expanded to
aa+r aa-r sp y+uw y-uw sp
Setting CFWORDBOUNDARY false would produce
aa+r aa-r+y sp r-y+uw y-uw sp
Note that in practice, stages (3) and (4) above actually proceed concurrently
so that for the first and last phone of context-dependent models, logical
models which have the same underlying physical model can be merged.
Having described the expansion process in some detail, some simple
examples will help clarify the process. All of these are based
on the Bit-But word network illustrated in Fig. 11.2.
Firstly, assume that the dictionary contains simple monophone
pronunciations, that is
bit b i t
but b u t
start sil
end sil
and the HMM set consists of just monophones
b i t u sil
In this case, HNET will find a closed dictionary. There will
be no expansion and it will directly generate the network
shown in Fig 11.8. In this figure, the rounded boxes
represent model nodes and the square boxes represent word-end nodes.
Similarly, if the dictionary
contained word-internal triphone pronunciations such as
bit b+i b-i+t i-t
but b+u b-u+t u-t
start sil
end sil
and the HMM set contains all the required models
b+i b-i+t i-t b+u b-u+t u-t sil
then again HNET will find a closed dictionary
and the network shown in Fig. 11.9 would be generated.
If however the dictionary contained just the simple monophone pronunciations
as in the first case above, but the HMM set contained just triphones,
that is
sil-b+i t-b+i b-i+t i-t+sil i-t+b
sil-b+u t-b+u b-u+t u-t+sil u-t+b sil
then HNET would perform full cross-word expansion and
generate the network shown in Fig. 11.10.
Now suppose that still using the simple monophone pronunciations,
the HMM set contained all monophones, biphones and triphones. In this
case, the default would be to generate the monophone network of
Fig 11.8. If FORCECXTEXP is true but
ALLOWXWRDEXP is set false then the word-internal
network
of Fig. 11.9 would be generated. Finally, if both
FORCECXTEXP and
ALLOWXWRDEXP are set true then the cross-word network
of Fig. 11.10 would be generated.
Next: 11.9 Other Kinds of Recognition System
Up: 11 NetworksDictionaries and Language Models
Previous: 11.7 Constructing a Dictionary
ECRL HTK_V2.1: email [email protected]