{"title":"Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition","authors":"Lin Zheng;Han Zhu;Sanli Tian;Qingwei Zhao;Ta Li","doi":"10.1109/LSP.2024.3487795","DOIUrl":null,"url":null,"abstract":"Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token (\n<inline-formula><tex-math>$< $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing \n<italic>Weakening and Enhancing CTC</i>\n (WE-CTC) loss to weaken the learning of error-prone labels surrounding \n<inline-formula><tex-math>$<$</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n while enhance the emission probability of \n<inline-formula><tex-math>$< $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n through modifying posteriors of the pseudo-labels. Secondly, we introduce \n<italic>Weighted Confidence Filter</i>\n (WCF) that assigns higher scores of \n<inline-formula><tex-math>$<$</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n to exclude low-quality pseudo-labels without hurting the \n<inline-formula><tex-math>$< $</tex-math></inline-formula>\nsc\n<inline-formula><tex-math>$>$</tex-math></inline-formula>\n prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3119-3123"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10737652/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token (
$< $
sc
$>$
) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing
Weakening and Enhancing CTC
(WE-CTC) loss to weaken the learning of error-prone labels surrounding
$<$
sc
$>$
while enhance the emission probability of
$< $
sc
$>$
through modifying posteriors of the pseudo-labels. Secondly, we introduce
Weighted Confidence Filter
(WCF) that assigns higher scores of
$<$
sc
$>$
to exclude low-quality pseudo-labels without hurting the
$< $
sc
$>$
prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.