{"title":"基于多任务学习的端到端维吾尔语语音识别","authors":"Chuang Liu, Qiong Li, Zhiwei You","doi":"10.1109/ICARCE55724.2022.10046434","DOIUrl":null,"url":null,"abstract":"With the emergence of deep learning, speech recognition research for languages with a wide speaker base, such as English and Chinese, has become reasonably mature. However, Uyghur speech recognition research has only developed slowly, because the modeling is subpar since Uighur is a low-resource language. To solve the aforementioned problem, we propose an end-to-end Uyghur speech recognition model based on multi-task learning with Branchformer. Branchformer can capture both global and local contexts, allowing it to learn richer features from low volumes of data to improve model performance in resource-constrained situations. The multi-task learning model can fully utilize the data through multiple related tasks, enhancing the model’s generalizability under low resources. In this paper, the multi-task learning model is trained and decoded using modeling units including phoneme, sub-word, and word. The performance of the model is then evaluated using various datasets. The results show that the proposed model outperforms the mainstream models in terms of recognition, in the self-built database and open-source corpus Thuyg-20, respectively, the word accuracy of the proposed model reaches 94.48% and 91.16%, which is a significant improvement compared with each benchmark model.","PeriodicalId":416305,"journal":{"name":"2022 International Conference on Automation, Robotics and Computer Engineering (ICARCE)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An End-to-End Uyghur Speech Recognition Based on Multi-task Learning\",\"authors\":\"Chuang Liu, Qiong Li, Zhiwei You\",\"doi\":\"10.1109/ICARCE55724.2022.10046434\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the emergence of deep learning, speech recognition research for languages with a wide speaker base, such as English and Chinese, has become reasonably mature. However, Uyghur speech recognition research has only developed slowly, because the modeling is subpar since Uighur is a low-resource language. To solve the aforementioned problem, we propose an end-to-end Uyghur speech recognition model based on multi-task learning with Branchformer. Branchformer can capture both global and local contexts, allowing it to learn richer features from low volumes of data to improve model performance in resource-constrained situations. The multi-task learning model can fully utilize the data through multiple related tasks, enhancing the model’s generalizability under low resources. In this paper, the multi-task learning model is trained and decoded using modeling units including phoneme, sub-word, and word. The performance of the model is then evaluated using various datasets. The results show that the proposed model outperforms the mainstream models in terms of recognition, in the self-built database and open-source corpus Thuyg-20, respectively, the word accuracy of the proposed model reaches 94.48% and 91.16%, which is a significant improvement compared with each benchmark model.\",\"PeriodicalId\":416305,\"journal\":{\"name\":\"2022 International Conference on Automation, Robotics and Computer Engineering (ICARCE)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Automation, Robotics and Computer Engineering (ICARCE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICARCE55724.2022.10046434\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Automation, Robotics and Computer Engineering (ICARCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICARCE55724.2022.10046434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An End-to-End Uyghur Speech Recognition Based on Multi-task Learning
With the emergence of deep learning, speech recognition research for languages with a wide speaker base, such as English and Chinese, has become reasonably mature. However, Uyghur speech recognition research has only developed slowly, because the modeling is subpar since Uighur is a low-resource language. To solve the aforementioned problem, we propose an end-to-end Uyghur speech recognition model based on multi-task learning with Branchformer. Branchformer can capture both global and local contexts, allowing it to learn richer features from low volumes of data to improve model performance in resource-constrained situations. The multi-task learning model can fully utilize the data through multiple related tasks, enhancing the model’s generalizability under low resources. In this paper, the multi-task learning model is trained and decoded using modeling units including phoneme, sub-word, and word. The performance of the model is then evaluated using various datasets. The results show that the proposed model outperforms the mainstream models in terms of recognition, in the self-built database and open-source corpus Thuyg-20, respectively, the word accuracy of the proposed model reaches 94.48% and 91.16%, which is a significant improvement compared with each benchmark model.