next up previous contents index
Next: 13.10.3 Use Up: 13.10 HLStats Previous: 13.10.1 Function

13.10.2 Bigram Generation

When using the bigram generating options, each transcription is assumed to have a unique entry and exit label which by default are !ENTER and !EXIT. If these labels are not present they are inserted. In addition, any label occurring in a transcription which is not listed in the HMM list is mapped to a unique label called !NULL.

HLSTATS processes all input transcriptions and maps all labels to a set of unique integers in the range 1 to L, where L is the number of distinct labels. For each adjacent pair of labels i and j, it counts the total number of occurrences N(i,j). Let the total number of occurrences of label i be tex2html_wrap_inline22308 .

For matrix bigrams, the bigram probability p(i,j) is given by

displaymath22288

where f is a floor probability set by the -f option and tex2html_wrap_inline22318 is chosen to ensure that tex2html_wrap_inline22320 .

For back-off bigrams, the unigram probablities p(i) are given by

displaymath22289

where u is unigram floor count set by the -u option and tex2html_wrap_inline22328 ,u].

The backed-off bigram probabilities are given by

displaymath22290

where D is a discount and t is a bigram count threshold set by the -t option. The discount D is fixed at 0.5 but can be changed via the configuration variable DISCOUNT. The back-off weight b(i) is calculated to ensure that tex2html_wrap_inline22342 , i.e.

displaymath22291

where B is the set of all words for which p(i,j) has a bigram.

The formats of matrix and ARPA/MIT-LL format bigram files are described in Chapter 11.


next up previous contents index
Next: 13.10.3 Use Up: 13.10 HLStats Previous: 13.10.1 Function

ECRL HTK_V2.1: email [email protected]