PLDE:用于口语识别的轻量级汇集层

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-02-23 DOI:10.1016/j.specom.2024.103055
Zimu Li , Yanyan Xu , Dengfeng Ke , Kaile Su
{"title":"PLDE:用于口语识别的轻量级汇集层","authors":"Zimu Li ,&nbsp;Yanyan Xu ,&nbsp;Dengfeng Ke ,&nbsp;Kaile Su","doi":"10.1016/j.specom.2024.103055","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the <span><math><msub><mrow><mi>C</mi></mrow><mrow><mi>a</mi><mi>v</mi><mi>g</mi></mrow></msub></math></span> of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PLDE: A lightweight pooling layer for spoken language recognition\",\"authors\":\"Zimu Li ,&nbsp;Yanyan Xu ,&nbsp;Dengfeng Ke ,&nbsp;Kaile Su\",\"doi\":\"10.1016/j.specom.2024.103055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the <span><math><msub><mrow><mi>C</mi></mrow><mrow><mi>a</mi><mi>v</mi><mi>g</mi></mrow></msub></math></span> of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-02-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016763932400027X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016763932400027X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

近年来,用语音特征替代声学特征的迁移学习方法已成为端到端口语识别的新范式。然而,这些较大的迁移学习模型总是编码过多的冗余信息。在本文中,我们提出了一种基于语音可学习字典编码(PLDE)层的轻量级语言识别解码器,它更适合语音特征,在大幅减少参数数量的同时实现了更好的识别性能。轻量级解码器主要由三部分组成:(1)带鬼簇的语音可学习字典,它改进了传统的 LDE 汇集层,提高了模型对带鬼簇噪声的建模能力;(2)粗粒度的块级汇集,它能突出电话序列,抑制鬼簇周围的噪声,从而减少鬼簇对后续网络的影响;(3)细粒度的块级投影,它能使判别网络获得更多的语言信息,从而提高模型的建模能力。这三个部分将语言识别解码器简化为 PLDE 池层,将解码器的参数大小减少了至少一个数量级,同时实现了更好的识别性能。在 OLR2020 数据集的实验中,所提方法的 Cavg 超过了目前最先进的语言识别系统,在跨信道测试集和未知噪声测试集上分别提高了 24.68% 和 42.24%。此外,在 OLR2021 数据集上的实验结果也证明了 PLDE 的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PLDE: A lightweight pooling layer for spoken language recognition

In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the Cavg of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments Forms, factors and functions of phonetic convergence: Editorial Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study Zero-shot voice conversion based on feature disentanglement Multi-modal co-learning for silent speech recognition based on ultrasound tongue images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1