Qing Hu, Yan Zhang, Xianlei Zhang, Zongyu Han, Xiuxia Liang
{"title":"Language fusion via adapters for low-resource speech recognition","authors":"Qing Hu, Yan Zhang, Xianlei Zhang, Zongyu Han, Xiuxia Liang","doi":"10.1016/j.specom.2024.103037","DOIUrl":null,"url":null,"abstract":"<div><p>Data scarcity makes low-resource speech recognition systems suffer from severe overfitting. Although fine-tuning addresses this issue to some extent, it leads to parameter-inefficient training. In this paper, a novel language knowledge fusion method, named LanFusion, is proposed. It is built on the recent popular adapter-tuning technique, thus maintaining better parameter efficiency compared with conventional fine-tuning methods. LanFusion is a two-stage method. Specifically, multiple adapters are first trained on several source languages to extract language-specific and language-invariant knowledge. Then, the trained adapters are re-trained on the target low-resource language to fuse the learned knowledge. Compared with Vanilla-adapter, LanFusion obtains a relative average word error rate (WER) reduction of 9.8% and 8.6% on the Common Voice and FLEURS corpora, respectively. Extensive experiments demonstrate the proposed method is not only simple and effective but also parameter-efficient. Besides, using source languages that are geographically similar to the target language yields better results on both datasets.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000098","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Data scarcity makes low-resource speech recognition systems suffer from severe overfitting. Although fine-tuning addresses this issue to some extent, it leads to parameter-inefficient training. In this paper, a novel language knowledge fusion method, named LanFusion, is proposed. It is built on the recent popular adapter-tuning technique, thus maintaining better parameter efficiency compared with conventional fine-tuning methods. LanFusion is a two-stage method. Specifically, multiple adapters are first trained on several source languages to extract language-specific and language-invariant knowledge. Then, the trained adapters are re-trained on the target low-resource language to fuse the learned knowledge. Compared with Vanilla-adapter, LanFusion obtains a relative average word error rate (WER) reduction of 9.8% and 8.6% on the Common Voice and FLEURS corpora, respectively. Extensive experiments demonstrate the proposed method is not only simple and effective but also parameter-efficient. Besides, using source languages that are geographically similar to the target language yields better results on both datasets.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.