{"title":"解耦结构可提高端到端模型的适应性","authors":"Keqi Deng, Philip C. Woodland","doi":"10.1016/j.specom.2024.103109","DOIUrl":null,"url":null,"abstract":"<div><p>Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data. To solve this problem, this paper proposes decoupled structures for attention-based encoder–decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data. It is also shown that the decoupled structure can be used to boost cross-domain speech translation quality while retaining the intra-domain performance.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103109"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000803/pdfft?md5=7e35ebdc40ecd26754dcc103e392268c&pid=1-s2.0-S0167639324000803-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Decoupled structure for improved adaptability of end-to-end models\",\"authors\":\"Keqi Deng, Philip C. Woodland\",\"doi\":\"10.1016/j.specom.2024.103109\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data. To solve this problem, this paper proposes decoupled structures for attention-based encoder–decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data. It is also shown that the decoupled structure can be used to boost cross-domain speech translation quality while retaining the intra-domain performance.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"163 \",\"pages\":\"Article 103109\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000803/pdfft?md5=7e35ebdc40ecd26754dcc103e392268c&pid=1-s2.0-S0167639324000803-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000803\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000803","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
Decoupled structure for improved adaptability of end-to-end models
Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data. To solve this problem, this paper proposes decoupled structures for attention-based encoder–decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data. It is also shown that the decoupled structure can be used to boost cross-domain speech translation quality while retaining the intra-domain performance.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.