Before continuing with the description of network generation and, in particular, the use of HBUILD , the use of bigram language models needs to be described. Support for statistical language models in HTK is provided by the library module HLM. Although the interface to HLM can support general N-grams , the facilities for constructing and using N-grams are limited to bigrams.
A bigram language model can be built using HLSTATS invoked as follows where it is a assumed that all of the label files used for training are stored in an MLF called labs
HLStats -b bigfn -o -I labs wordlistAll words used in the label files must be listed in the wordlist. This command will read all of the transcriptions in labs, build a table of bigram counts in memory, and then output a back-off bigram to the file bigfn. The formulae used for this are given in the reference entry for HLSTATS. However, the basic idea is encapsulated in the following formula
where N(i,j) is the number of times word j follows word i and N(i) is the number of times that word i appears. Essentially, a small part of the available probability mass is deducted from the higher bigram counts and distributed amongst the infrequent bigrams. This process is called discounting. The default value for the discount constant D is 0.5 but this can be altered using the configuration variable DISCOUNT . When a bigram count falls below the threshold t, the bigram is backed-off to the unigram probability suitably scaled by a back-off weight in order to ensure that all bigram probabilities for a given history sum to one.
Backed-off bigrams are stored in a text file using the standard ARPA MIT-LL format which as used in HTK is as follows
\data\
ngram 1=<num 1-grams>
ngram 2=<num 2-ngrams>
\1-grams:
P(!ENTER) !ENTER B(!ENTER)
P(W1) W1 B(W1)
P(W2) W2 B(W2)
...
P(!EXIT) !EXIT B(!EXIT)
\2-grams:
P(W1 | !ENTER) !ENTER W1
P(W2 | !ENTER) !ENTER W2
P(W1 | W1) W1 W1
P(W2 | W1) W1 W2
P(W1 | W2) W2 W1
....
P(!EXIT | W1) W1 !EXIT
P(!EXIT | W2) W2 !EXIT
\end\
where all probabilities are stored as base-10 logs. The default
start and end words, !ENTER and !EXIT can be changed
using the HLSTATS -s option.
For some applications, a simple matrix style of bigram representation may be more appropriate. If the -o option is omitted in the above invocation of HLSTATS, then a simple full bigram matrix will be output using the format
!ENTER 0 P(W1 | !ENTER) P(W2 | !ENTER) .....
W1 0 P(W1 | W1) P(W2 | W1) .....
W2 0 P(W1 | W2) P(W2 | W2) .....
...
!EXIT 0 PN PN .....
where the probability