Most transcriptions are single-alternative and single-level, that is to say, the associated speech file is described by a single sequence of labelled segments. Most standard label formats are of this kind. Sometimes, however, it is useful to have several levels of labels associated with the same basic segment sequence. For example, in training a HMM system it is useful to have both the word level transcriptions and the phone level transcriptions side-by-side.
Orthogonal to the requirement for multiple levels of description, a transcription may also need to include multiple alternative descriptions of the same speech file. For example, the output of a speech recogniser may be in the form of an N-best list where each word sequence in the list represents one possible interpretation of the input.
As an example, Fig. 6.1 shows a speech file and three different ways in which it might be labelled. In part (a), just a simple orthography is given and this single-level single-alternative type of transcription is the commonest case. Part (b) shows a 2-level transcription where the basic level consists of a sequence of phones but a higher level of word labels are also provided. Notice that there is a distinction between the basic level and the higher levels, since only the basic level has explicit boundary locations marked for every segment. The higher levels do not have explicit boundary information since this can always be inferred from the basic level boundaries. Finally, part (c) shows the case where knowledge of the contents of the speech file is uncertain and three possible word sequences are given.
HTK label files support multiple-alternative and multiple-level transcriptions. In addition to start and end times on the basic level, a label at any level may also have a score associated with it. When a transcription is loaded, all but one specific alternative can be discarded by setting the configuration variable TRANSALT to the required alternative N, where the first (i.e. normal) alternative is numbered 1. Similarly, all but a specified level can be discarded by setting the configuration variable TRANSLEV to the required level number where again the first (i.e. normal) level is numbered 1.
All non-HTK formats are limited to single-level single-alternative transcriptions.