{"title":"A training procedure for a segment-based-network approach to isolated word recognition","authors":"F. Soong","doi":"10.1109/ICASSP.1987.1169579","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a complete training procedure for creating a subword-based network and test it in an isolated word recognition experiment. We first hand segment one training token per word into contiguous subword segments with the aid of an interactive program that can display and playback various acoustic features of an utterance. The subword segmental units adopted in this paper consist of four different sound classes including: stationary sounds, fast transitional sounds, slow transitional sounds plus consonant clusters and others. The hand segmented token is used to initialize a subword-based word network which is then refined by using more training tokens. The refinement is carried out with a two-level dynamic programming (DP) procedure. At the first level, or the word level, an endpoint-relaxed DP algorithm is used to remove any possible endpointing errors and to mark tentative segment boundaries. Between the marked segment boundaries, another endpoint-relaxed DP algorithm is employed at the segment level to refine the segments extracted at the word level. A segment-based word network, which consists of serial and parallel branches, is generated from this training procedure. While serial branches are generated by using acoustically similar segments aligned at the segment level parallel branches are created for accomodating different acoustic manifestations of the same sound class in different phonetic contexts or different pronunciations. A speaker-dependent, isolated word, recognition experiment was carried out. For a four-speaker(2 male and 2 female), English alphabet data base, the segment-based network, when compared with a conventional word-template-based approach, gives improved performance. The word error rate is reduced from 11.2% for the word-based recognizer down to 7.7% for the network-based recognizer; or correspondingly, the number of misrecognized words is reduced from 116 to 80 out of 1040 recognition trials.","PeriodicalId":140810,"journal":{"name":"ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1987-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.1987.1169579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In this paper, we propose a complete training procedure for creating a subword-based network and test it in an isolated word recognition experiment. We first hand segment one training token per word into contiguous subword segments with the aid of an interactive program that can display and playback various acoustic features of an utterance. The subword segmental units adopted in this paper consist of four different sound classes including: stationary sounds, fast transitional sounds, slow transitional sounds plus consonant clusters and others. The hand segmented token is used to initialize a subword-based word network which is then refined by using more training tokens. The refinement is carried out with a two-level dynamic programming (DP) procedure. At the first level, or the word level, an endpoint-relaxed DP algorithm is used to remove any possible endpointing errors and to mark tentative segment boundaries. Between the marked segment boundaries, another endpoint-relaxed DP algorithm is employed at the segment level to refine the segments extracted at the word level. A segment-based word network, which consists of serial and parallel branches, is generated from this training procedure. While serial branches are generated by using acoustically similar segments aligned at the segment level parallel branches are created for accomodating different acoustic manifestations of the same sound class in different phonetic contexts or different pronunciations. A speaker-dependent, isolated word, recognition experiment was carried out. For a four-speaker(2 male and 2 female), English alphabet data base, the segment-based network, when compared with a conventional word-template-based approach, gives improved performance. The word error rate is reduced from 11.2% for the word-based recognizer down to 7.7% for the network-based recognizer; or correspondingly, the number of misrecognized words is reduced from 116 to 80 out of 1040 recognition trials.