{"title":"基于深度神经网络/HMM的普通话语音识别中发音知识与语音特征的整合","authors":"Ying-Wei Tan, Wenju Liu, Wei Jiang, Hao Zheng","doi":"10.1109/IJCNN.2015.7280396","DOIUrl":null,"url":null,"abstract":"Speech production knowledge has been used to enhance the phonetic representation and the performance of automatic speech recognition (ASR) systems successfully. Representations of speech production make simple explanations for many phenomena observed in speech. These phenomena can not be easily analyzed from either acoustic signal or phonetic transcription alone. One of the most important aspects of speech production knowledge is the use of articulatory knowledge, which describes the smooth and continuous movements in the vocal tract. In this paper, we present a new articulatory model to provide available information for rescoring the speech recognition lattice hypothesis. The articulatory model consists of a feature front-end, which computes a voicing feature based on a spectral harmonics correlation (SHC) function, and a back-end based on the combination of deep neural networks (DNNs) and hidden Markov models (HMMs). The voicing features are incorporated with standard Mel frequency cepstral coefficients (MFCCs) using heteroscedastic linear discriminant analysis (HLDA) to compensate the speech recognition accuracy rates. Moreover, the advantages of two different models are taken into account by the algorithm, which retains deep learning properties of DNNs, while modeling the articulatory context powerfully through HMMs. Mandarin speech recognition experiments show the proposed method achieves significant improvements in speech recognition performance over the system using MFCCs alone.","PeriodicalId":6539,"journal":{"name":"2015 International Joint Conference on Neural Networks (IJCNN)","volume":"50 1","pages":"1-8"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Integration of articulatory knowledge and voicing features based on DNN/HMM for Mandarin speech recognition\",\"authors\":\"Ying-Wei Tan, Wenju Liu, Wei Jiang, Hao Zheng\",\"doi\":\"10.1109/IJCNN.2015.7280396\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech production knowledge has been used to enhance the phonetic representation and the performance of automatic speech recognition (ASR) systems successfully. Representations of speech production make simple explanations for many phenomena observed in speech. These phenomena can not be easily analyzed from either acoustic signal or phonetic transcription alone. One of the most important aspects of speech production knowledge is the use of articulatory knowledge, which describes the smooth and continuous movements in the vocal tract. In this paper, we present a new articulatory model to provide available information for rescoring the speech recognition lattice hypothesis. The articulatory model consists of a feature front-end, which computes a voicing feature based on a spectral harmonics correlation (SHC) function, and a back-end based on the combination of deep neural networks (DNNs) and hidden Markov models (HMMs). The voicing features are incorporated with standard Mel frequency cepstral coefficients (MFCCs) using heteroscedastic linear discriminant analysis (HLDA) to compensate the speech recognition accuracy rates. Moreover, the advantages of two different models are taken into account by the algorithm, which retains deep learning properties of DNNs, while modeling the articulatory context powerfully through HMMs. Mandarin speech recognition experiments show the proposed method achieves significant improvements in speech recognition performance over the system using MFCCs alone.\",\"PeriodicalId\":6539,\"journal\":{\"name\":\"2015 International Joint Conference on Neural Networks (IJCNN)\",\"volume\":\"50 1\",\"pages\":\"1-8\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Joint Conference on Neural Networks (IJCNN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJCNN.2015.7280396\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN.2015.7280396","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Integration of articulatory knowledge and voicing features based on DNN/HMM for Mandarin speech recognition
Speech production knowledge has been used to enhance the phonetic representation and the performance of automatic speech recognition (ASR) systems successfully. Representations of speech production make simple explanations for many phenomena observed in speech. These phenomena can not be easily analyzed from either acoustic signal or phonetic transcription alone. One of the most important aspects of speech production knowledge is the use of articulatory knowledge, which describes the smooth and continuous movements in the vocal tract. In this paper, we present a new articulatory model to provide available information for rescoring the speech recognition lattice hypothesis. The articulatory model consists of a feature front-end, which computes a voicing feature based on a spectral harmonics correlation (SHC) function, and a back-end based on the combination of deep neural networks (DNNs) and hidden Markov models (HMMs). The voicing features are incorporated with standard Mel frequency cepstral coefficients (MFCCs) using heteroscedastic linear discriminant analysis (HLDA) to compensate the speech recognition accuracy rates. Moreover, the advantages of two different models are taken into account by the algorithm, which retains deep learning properties of DNNs, while modeling the articulatory context powerfully through HMMs. Mandarin speech recognition experiments show the proposed method achieves significant improvements in speech recognition performance over the system using MFCCs alone.