Unsupervised Adaptive Speaker Recognition by Coupling-Regularized Optimal Transport

IF 4.1 2区 计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-12 DOI:10.1109/TASLP.2024.3426934
Ruiteng Zhang;Jianguo Wei;Xugang Lu;Wenhuan Lu;Di Jin;Lin Zhang;Junhai Xu
{"title":"Unsupervised Adaptive Speaker Recognition by Coupling-Regularized Optimal Transport","authors":"Ruiteng Zhang;Jianguo Wei;Xugang Lu;Wenhuan Lu;Di Jin;Lin Zhang;Junhai Xu","doi":"10.1109/TASLP.2024.3426934","DOIUrl":null,"url":null,"abstract":"Cross-domain speaker recognition (SR) can be improved by unsupervised domain adaptation (UDA) algorithms. UDA algorithms often reduce domain mismatch at the cost of decreasing the discrimination of speaker features. In contrast, optimal transport (OT) has the potential to achieve domain alignment while preserving the speaker discrimination capability in UDA applications; however, naively applying OT to measure global probability distribution discrepancies between the source and target domains may induce negative transports where samples belonging to different speakers are coupled in transportation. These negative transports reduce the SR model's discriminative power, degrading the SR performance. This paper proposes a coupling-regularized optimal transport (CROT) algorithm for cross-domain SR to reduce the negative transport during UDA. In the proposed CROT, two consecutive processing modules regularize the coupling paths for the OT solution: a progressive inter-speaker constraint (PISC) module and a coupling-smoothed regularization (CSR) module. The PISC, designed as a pseudo-label memory bank with curriculum learning, is first applied to select valid samples to guarantee that coupling samples are from the same speaker. The CSR, designed to control the information entropy of the coupling paths further, reduces the effect of negative transport in UDA. To evaluate the effectiveness of the proposed algorithm, cross-domain SR experiments were conducted under different target domains, speaker encoders, corpora, and acoustic features. Experimental results showed that CROT achieved a 50% relative reduction in equal error rates compared to conventional OT-based UDAs, outperforming the state-of-the-art UDAs.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3603-3617"},"PeriodicalIF":4.1000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10596689/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Cross-domain speaker recognition (SR) can be improved by unsupervised domain adaptation (UDA) algorithms. UDA algorithms often reduce domain mismatch at the cost of decreasing the discrimination of speaker features. In contrast, optimal transport (OT) has the potential to achieve domain alignment while preserving the speaker discrimination capability in UDA applications; however, naively applying OT to measure global probability distribution discrepancies between the source and target domains may induce negative transports where samples belonging to different speakers are coupled in transportation. These negative transports reduce the SR model's discriminative power, degrading the SR performance. This paper proposes a coupling-regularized optimal transport (CROT) algorithm for cross-domain SR to reduce the negative transport during UDA. In the proposed CROT, two consecutive processing modules regularize the coupling paths for the OT solution: a progressive inter-speaker constraint (PISC) module and a coupling-smoothed regularization (CSR) module. The PISC, designed as a pseudo-label memory bank with curriculum learning, is first applied to select valid samples to guarantee that coupling samples are from the same speaker. The CSR, designed to control the information entropy of the coupling paths further, reduces the effect of negative transport in UDA. To evaluate the effectiveness of the proposed algorithm, cross-domain SR experiments were conducted under different target domains, speaker encoders, corpora, and acoustic features. Experimental results showed that CROT achieved a 50% relative reduction in equal error rates compared to conventional OT-based UDAs, outperforming the state-of-the-art UDAs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过耦合-规则化优化传输实现无监督自适应扬声器识别
跨域说话人识别(SR)可以通过无监督域适应(UDA)算法得到改善。UDA 算法通常以降低说话人特征的辨识度为代价来减少域不匹配。与此相反,在 UDA 应用中,最优传输(OT)有可能在保持说话人辨别能力的同时实现域对齐;然而,天真地应用 OT 来测量源域和目标域之间的全局概率分布差异可能会引起负传输,即属于不同说话人的样本在传输中耦合在一起。这些负迁移会降低 SR 模型的分辨能力,从而降低 SR 性能。本文提出了一种用于跨域 SR 的耦合规则化最优传输(CROT)算法,以减少 UDA 过程中的负传输。在所提出的 CROT 算法中,有两个连续的处理模块对 OT 解决方案的耦合路径进行了正则化处理:一个是渐进式扬声器间约束(PISC)模块,另一个是耦合平滑正则化(CSR)模块。PISC 设计为具有课程学习功能的伪标签记忆库,首先用于选择有效样本,以保证耦合样本来自同一说话者。CSR 的目的是进一步控制耦合路径的信息熵,减少 UDA 中负传输的影响。为了评估所提算法的有效性,我们在不同的目标域、说话者编码器、语料库和声学特征下进行了跨域 SR 实验。实验结果表明,与传统的基于 OT 的 UDA 相比,CROT 实现了相等错误率相对减少 50%,优于最先进的 UDA。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
11.30
自引率
11.10%
发文量
217
期刊介绍: The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.
期刊最新文献
CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection ELSF: Entity-Level Slot Filling Framework for Joint Multiple Intent Detection and Slot Filling Proper Error Estimation and Calibration for Attention-Based Encoder-Decoder Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1