next up previous contents index
Next: 11.5 Building a Word Network with HBUILD Up: 11 NetworksDictionaries and Language Models Previous: 11.3 Building a Word Network with HPARSE

11.4 Bigram Language Models

 

  Before continuing with the description of network generation and, in particular, the use of HBUILD , the use of bigram language models needs to be described. Support for statistical language models in HTK is provided by the library module HLM. Although the interface to HLM  can support general N-grams , the facilities for constructing and using N-grams are limited to bigrams.

A bigram language model can be built using HLSTATS  invoked as follows where it is a assumed that all of the label files used for training are stored in an MLF called labs

    HLStats -b bigfn -o -I labs wordlist
All words used in the label files must be listed in the wordlist. This command will read all of the transcriptions in labs, build a table of bigram counts in memory, and then output a back-off bigram  to the file bigfn. The formulae used for this are given in the reference entry for HLSTATS. However, the basic idea is encapsulated in the following formula

displaymath22032

where N(i,j) is the number of times word j follows word i and N(i) is the number of times that word i appears. Essentially, a small part of the available probability mass is deducted from the higher bigram counts and distributed amongst the infrequent bigrams. This process is called discounting. The default value for the discount constant D is 0.5 but this can be altered using the configuration variable DISCOUNT . When a bigram count falls below the threshold t, the bigram is backed-off to the unigram probability suitably scaled by a back-off weight in order to ensure that all bigram probabilities for a given history sum to one.

Backed-off bigrams  are stored in a text file using the standard ARPA MIT-LL format which as used in HTK is as follows

    \data\
    ngram 1=<num 1-grams>
    ngram 2=<num 2-ngrams>

    \1-grams:
    P(!ENTER)      !ENTER  B(!ENTER)
    P(W1)	    W1     B(W1)
    P(W2)	    W2     B(W2)
    ...
    P(!EXIT)       !EXIT   B(!EXIT)

    \2-grams:
    P(W1 | !ENTER)  !ENTER W1
    P(W2 | !ENTER)  !ENTER W2
    P(W1 | W1)      W1     W1
    P(W2 | W1)      W1     W2
    P(W1 | W2)      W2     W1
    ....
    P(!EXIT | W1)   W1     !EXIT
    P(!EXIT | W2)   W2     !EXIT
    \end\
where all probabilities are stored as base-10 logs. The default start and end words, !ENTER and !EXIT can be changed using the HLSTATS -s option.

For some applications, a simple matrix style of bigram representation may be more appropriate. If the -o option is omitted in the above invocation of HLSTATS, then a simple full bigram matrix will be output using the format

    !ENTER    0   P(W1 | !ENTER) P(W2 | !ENTER) .....
    W1        0   P(W1 | W1)     P(W2 | W1)     .....
    W2        0   P(W1 | W2)     P(W2 | W2)     .....
    ...
    !EXIT     0   PN             PN             .....
where the probability tex2html_wrap_inline22050 is given by row i,j of the matrix. If there are a total of N words in the vocabulary then PN in the above is set to 1/(N+1), this ensures that the last row sums to one. As a very crude form of smoothing, a floor can be set using the -f minp option to prevent any entry falling below minp. Note, however, that this does not affect the bigram entries in the first column which are zero by definition. Finally, as with the storage of tied-mixture and discrete probabilities, a run-length encoding scheme is used whereby any value can be followed by an asterix and a repeat count (see section 7.5).


next up previous contents index
Next: 11.5 Building a Word Network with HBUILD Up: 11 NetworksDictionaries and Language Models Previous: 11.3 Building a Word Network with HPARSE

ECRL HTK_V2.1: email [email protected]