Yuqin Lin , Jianwu Dang , Longbiao Wang , Sheng Li , Chenchen Ding
{"title":"Disordered speech recognition considering low resources and abnormal articulation","authors":"Yuqin Lin , Jianwu Dang , Longbiao Wang , Sheng Li , Chenchen Ding","doi":"10.1016/j.specom.2023.103002","DOIUrl":null,"url":null,"abstract":"<div><p><span>The success of automatic speech recognition (ASR) benefits a great number of healthy people, but not people with disorders. The speech disordered may truly need support from technology, while they actually gain little. The difficulties of disordered ASR arise from the limited availability of data and the abnormal nature of speech, </span><em>e.g</em><span><span>, unclear, unstable, and incorrect pronunciations. To realize the ASR of disordered speech, this study addresses the problems of disordered speech in two respects, low resources, and articulatory abnormality. In order to solve the problem of low resources, this study proposes staged knowledge distillation<span> (KD), which provides different references to the student models according to their mastery of knowledge, so as to avoid feature overfitting. To tackle the articulatory abnormalities in dysarthria, we propose an intended phonological perception method (IPPM) by applying the </span></span>motor theory of speech perception to ASR, in which pieces of intended phonological features are estimated and provided to ASR. And further, we solve the challenges of disordered ASR by combining the staged KD and the IPPM. TORGO database and UASEECH corpus are two commonly used datasets of dysarthria which is the main cause of speech disorders. Experiments on the two datasets validated the effectiveness of the proposed methods. Compared with the baseline, the proposed method achieves 35.14%</span><span><math><mo>∼</mo></math></span><span>38.12% relative phoneme error rate reductions (PERRs) for speakers with varying degrees of dysarthria on the TORGO database and relative 8.17%</span><span><math><mo>∼</mo></math></span>13.00% PERRs on the UASPEECH corpus. The experiments demonstrated that addressing disordered speech from both low resources and speech abnormality is an effective way to solve the problems, and the proposed methods significantly improved the performance of ASR for disordered speech.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103002"},"PeriodicalIF":2.4000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016763932300136X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
The success of automatic speech recognition (ASR) benefits a great number of healthy people, but not people with disorders. The speech disordered may truly need support from technology, while they actually gain little. The difficulties of disordered ASR arise from the limited availability of data and the abnormal nature of speech, e.g, unclear, unstable, and incorrect pronunciations. To realize the ASR of disordered speech, this study addresses the problems of disordered speech in two respects, low resources, and articulatory abnormality. In order to solve the problem of low resources, this study proposes staged knowledge distillation (KD), which provides different references to the student models according to their mastery of knowledge, so as to avoid feature overfitting. To tackle the articulatory abnormalities in dysarthria, we propose an intended phonological perception method (IPPM) by applying the motor theory of speech perception to ASR, in which pieces of intended phonological features are estimated and provided to ASR. And further, we solve the challenges of disordered ASR by combining the staged KD and the IPPM. TORGO database and UASEECH corpus are two commonly used datasets of dysarthria which is the main cause of speech disorders. Experiments on the two datasets validated the effectiveness of the proposed methods. Compared with the baseline, the proposed method achieves 35.14%38.12% relative phoneme error rate reductions (PERRs) for speakers with varying degrees of dysarthria on the TORGO database and relative 8.17%13.00% PERRs on the UASPEECH corpus. The experiments demonstrated that addressing disordered speech from both low resources and speech abnormality is an effective way to solve the problems, and the proposed methods significantly improved the performance of ASR for disordered speech.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.