Next: Part III: Reference Section Up: 12 Decoding Previous: 12.6 Recognition using Direct Audio Input

12.7 N-Best Lists and Lattices

As noted in section 12.1, HVITE can generate lattices and N-best outputs. To generate an N-best list, the -n option is used to specify the number of N-best tokens to store per state and the number of N-best hypotheses to generate. The result is that for each input utterance, a multiple alternative transcription is generated. For example, setting -n 4 20 with a digit recogniser would generate an output of the form

    "testf1.rec"
    FOUR
    SEVEN
    NINE
    OH
    /// 
    FOUR
    SEVEN
    NINE
    OH
    OH
    /// 

    etc

The lattices from which the N-best lists are generated can be output by setting the option -z ext. In this case, a lattice called testf.ext will be generated for each input test file testf.xxx. By default, these lattices will be stored in the same directory as the test files, but they can be redirected to another directory using the -l option.

The lattices generated by HVITE have the following general form

    VERSION=1.0
    UTTERANCE=testf1.mfc
    lmname=wdnet
    lmscale=20.00  wdpenalty=-30.00
    vocab=dict
    N=31   L=56   
    I=0    t=0.00  
    I=1    t=0.36  
    I=2    t=0.75  
    I=3    t=0.81
    ... etc
    I=30   t=2.48  
    J=0     S=0    E=1    W=SILENCE   v=0  a=-3239.01  l=0.00    
    J=1     S=1    E=2    W=FOUR      v=0  a=-3820.77  l=0.00    
    ... etc
    J=55    S=29   E=30   W=SILENCE   v=0  a=-246.99   l=-1.20

The first 5 lines comprise a header which records names of the files used to generate the lattice along with the settings of the language model scale and penalty factors. Each node in the lattice represents a point in time measured in seconds and each arc represents a word spanning the segment of the input starting at the time of its start node and ending at the time of its end node. For each such span, v gives the number of the pronunciation used, a gives the acoustic score and l gives the language model score.

The language model scores in output lattices do not include the scale factors and penalties. These are removed so that the lattice can be used as a constraint network for subsequent recogniser testing. When using HVITE normally, the word level network file is specified using the -w option. When the -w option is included but no file name is included, HVITE constructs the name of a lattice file from the name of the test file and inputs that. Hence, a new recognition network is created for each input file and recognition is very fast. For example, this is an efficient way of experimentally determining optimum values for the language model scale and penalty factors.

Next: Part III: Reference Section Up: 12 Decoding Previous: 12.6 Recognition using Direct Audio Input

ECRL HTK_V2.1: email [email protected]