PLDE：用于口语识别的轻量级汇集层

IF 3 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-03-01 Epub Date: 2024-02-23 DOI:10.1016/j.specom.2024.103055

Zimu Li , Yanyan Xu , Dengfeng Ke , Kaile Su

{"title":"PLDE：用于口语识别的轻量级汇集层","authors":"Zimu Li , Yanyan Xu , Dengfeng Ke , Kaile Su","doi":"10.1016/j.specom.2024.103055","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the <span><math><msub><mrow><mi>C</mi></mrow><mrow><mi>a</mi><mi>v</mi><mi>g</mi></mrow></msub></math></span> of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103055"},"PeriodicalIF":3.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PLDE: A lightweight pooling layer for spoken language recognition\",\"authors\":\"Zimu Li , Yanyan Xu , Dengfeng Ke , Kaile Su\",\"doi\":\"10.1016/j.specom.2024.103055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the <span><math><msub><mrow><mi>C</mi></mrow><mrow><mi>a</mi><mi>v</mi><mi>g</mi></mrow></msub></math></span> of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"158 \",\"pages\":\"Article 103055\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016763932400027X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/2/23 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016763932400027X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/2/23 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

近年来，用语音特征替代声学特征的迁移学习方法已成为端到端口语识别的新范式。然而，这些较大的迁移学习模型总是编码过多的冗余信息。在本文中，我们提出了一种基于语音可学习字典编码（PLDE）层的轻量级语言识别解码器，它更适合语音特征，在大幅减少参数数量的同时实现了更好的识别性能。轻量级解码器主要由三部分组成：（1）带鬼簇的语音可学习字典，它改进了传统的 LDE 汇集层，提高了模型对带鬼簇噪声的建模能力；（2）粗粒度的块级汇集，它能突出电话序列，抑制鬼簇周围的噪声，从而减少鬼簇对后续网络的影响；（3）细粒度的块级投影，它能使判别网络获得更多的语言信息，从而提高模型的建模能力。这三个部分将语言识别解码器简化为 PLDE 池层，将解码器的参数大小减少了至少一个数量级，同时实现了更好的识别性能。在 OLR2020 数据集的实验中，所提方法的 Cavg 超过了目前最先进的语言识别系统，在跨信道测试集和未知噪声测试集上分别提高了 24.68% 和 42.24%。此外，在 OLR2021 数据集上的实验结果也证明了 PLDE 的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PLDE: A lightweight pooling layer for spoken language recognition

In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the $C_{a v g}$ of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.