端到端多对话者重叠语音识别中的无监督领域自适应

IF 3.2 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Signal Processing Letters Pub Date : 2024-10-29 DOI:10.1109/LSP.2024.3487795
Lin Zheng;Han Zhu;Sanli Tian;Qingwei Zhao;Ta Li
{"title":"端到端多对话者重叠语音识别中的无监督领域自适应","authors":"Lin Zheng;Han Zhu;Sanli Tian;Qingwei Zhao;Ta Li","doi":"10.1109/LSP.2024.3487795","DOIUrl":null,"url":null,"abstract":"Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token (\n<inline-formula><tex-math>$&lt; $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing \n<italic>Weakening and Enhancing CTC</i>\n (WE-CTC) loss to weaken the learning of error-prone labels surrounding \n<inline-formula><tex-math>$&lt;$</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n while enhance the emission probability of \n<inline-formula><tex-math>$&lt; $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n through modifying posteriors of the pseudo-labels. Secondly, we introduce \n<italic>Weighted Confidence Filter</i>\n (WCF) that assigns higher scores of \n<inline-formula><tex-math>$&lt;$</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n to exclude low-quality pseudo-labels without hurting the \n<inline-formula><tex-math>$&lt; $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\n prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3119-3123"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition\",\"authors\":\"Lin Zheng;Han Zhu;Sanli Tian;Qingwei Zhao;Ta Li\",\"doi\":\"10.1109/LSP.2024.3487795\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token (\\n<inline-formula><tex-math>$&lt; $</tex-math></inline-formula>\\nsc\\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\\n) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing \\n<italic>Weakening and Enhancing CTC</i>\\n (WE-CTC) loss to weaken the learning of error-prone labels surrounding \\n<inline-formula><tex-math>$&lt;$</tex-math></inline-formula>\\nsc\\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\\n while enhance the emission probability of \\n<inline-formula><tex-math>$&lt; $</tex-math></inline-formula>\\nsc\\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\\n through modifying posteriors of the pseudo-labels. Secondly, we introduce \\n<italic>Weighted Confidence Filter</i>\\n (WCF) that assigns higher scores of \\n<inline-formula><tex-math>$&lt;$</tex-math></inline-formula>\\nsc\\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\\n to exclude low-quality pseudo-labels without hurting the \\n<inline-formula><tex-math>$&lt; $</tex-math></inline-formula>\\nsc\\n<inline-formula><tex-math>$&gt;$</tex-math></inline-formula>\\n prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"31 \",\"pages\":\"3119-3123\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-10-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10737652/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10737652/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

序列化输出训练(SOT)因其简单易行,已成为解决多说话者重叠语音识别难题的主流方法。然而,SOT 会遇到跨域性能下降的问题,这阻碍了它的应用。同时,传统的域自适应方法可能会损害通过 UD-CER 评估的说话人变化点预测的准确性,而 UD-CER 是 SOT 的一个重要指标。为了解决这些问题,我们提出了基于伪标记的 SOT(PL-SOT)领域适应方法,在训练过程中对说话人变化标记($< $sc$>$)进行特殊处理,以提高说话人变化点预测的准确性。首先,我们通过提出弱化和增强 CTC(Weakening and Enhancing CTC,WE-CTC)损失来改进 CTC 损失,以弱化对 $<$sc$>$ 周围易出错标签的学习,同时通过修改伪标签的后验值来增强 $< $sc$>$ 的发射概率。其次,我们引入了加权置信过滤器(WCF),在不影响 $< $sc$>$ 预测的情况下,为 $< $sc$>$ 分配更高的分数,以排除低质量的伪标签。实验结果表明,以 AliMeeting 为源域,AISHELL-4 和 MagicData-RAMC 为目标域,PL-SOT 实现了 17.7%/12.8% 的 CER/UD-CER 平均相对降低率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition
Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token ( $< $ sc $>$ ) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing Weakening and Enhancing CTC (WE-CTC) loss to weaken the learning of error-prone labels surrounding $<$ sc $>$ while enhance the emission probability of $< $ sc $>$ through modifying posteriors of the pseudo-labels. Secondly, we introduce Weighted Confidence Filter (WCF) that assigns higher scores of $<$ sc $>$ to exclude low-quality pseudo-labels without hurting the $< $ sc $>$ prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Signal Processing Letters
IEEE Signal Processing Letters 工程技术-工程:电子与电气
CiteScore
7.40
自引率
12.80%
发文量
339
审稿时长
2.8 months
期刊介绍: The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.
期刊最新文献
Diagnosis of Parkinson's Disease Based on Hybrid Fusion Approach of Offline Handwriting Images Differentiable Duration Refinement Using Internal Division for Non-Autoregressive Text-to-Speech SoLAD: Sampling Over Latent Adapter for Few Shot Generation Robust Multi-Prototypes Aware Integration for Zero-Shot Cross-Domain Slot Filling LFSamba: Marry SAM With Mamba for Light Field Salient Object Detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1