Unsupervised Adaptive Speaker Recognition by Coupling-Regularized Optimal Transport

IF 5.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-12 DOI:10.1109/TASLP.2024.3426934

Ruiteng Zhang;Jianguo Wei;Xugang Lu;Wenhuan Lu;Di Jin;Lin Zhang;Junhai Xu

{"title":"Unsupervised Adaptive Speaker Recognition by Coupling-Regularized Optimal Transport","authors":"Ruiteng Zhang;Jianguo Wei;Xugang Lu;Wenhuan Lu;Di Jin;Lin Zhang;Junhai Xu","doi":"10.1109/TASLP.2024.3426934","DOIUrl":null,"url":null,"abstract":"Cross-domain speaker recognition (SR) can be improved by unsupervised domain adaptation (UDA) algorithms. UDA algorithms often reduce domain mismatch at the cost of decreasing the discrimination of speaker features. In contrast, optimal transport (OT) has the potential to achieve domain alignment while preserving the speaker discrimination capability in UDA applications; however, naively applying OT to measure global probability distribution discrepancies between the source and target domains may induce negative transports where samples belonging to different speakers are coupled in transportation. These negative transports reduce the SR model's discriminative power, degrading the SR performance. This paper proposes a coupling-regularized optimal transport (CROT) algorithm for cross-domain SR to reduce the negative transport during UDA. In the proposed CROT, two consecutive processing modules regularize the coupling paths for the OT solution: a progressive inter-speaker constraint (PISC) module and a coupling-smoothed regularization (CSR) module. The PISC, designed as a pseudo-label memory bank with curriculum learning, is first applied to select valid samples to guarantee that coupling samples are from the same speaker. The CSR, designed to control the information entropy of the coupling paths further, reduces the effect of negative transport in UDA. To evaluate the effectiveness of the proposed algorithm, cross-domain SR experiments were conducted under different target domains, speaker encoders, corpora, and acoustic features. Experimental results showed that CROT achieved a 50% relative reduction in equal error rates compared to conventional OT-based UDAs, outperforming the state-of-the-art UDAs.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3603-3617"},"PeriodicalIF":5.1000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10596689/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-domain speaker recognition (SR) can be improved by unsupervised domain adaptation (UDA) algorithms. UDA algorithms often reduce domain mismatch at the cost of decreasing the discrimination of speaker features. In contrast, optimal transport (OT) has the potential to achieve domain alignment while preserving the speaker discrimination capability in UDA applications; however, naively applying OT to measure global probability distribution discrepancies between the source and target domains may induce negative transports where samples belonging to different speakers are coupled in transportation. These negative transports reduce the SR model's discriminative power, degrading the SR performance. This paper proposes a coupling-regularized optimal transport (CROT) algorithm for cross-domain SR to reduce the negative transport during UDA. In the proposed CROT, two consecutive processing modules regularize the coupling paths for the OT solution: a progressive inter-speaker constraint (PISC) module and a coupling-smoothed regularization (CSR) module. The PISC, designed as a pseudo-label memory bank with curriculum learning, is first applied to select valid samples to guarantee that coupling samples are from the same speaker. The CSR, designed to control the information entropy of the coupling paths further, reduces the effect of negative transport in UDA. To evaluate the effectiveness of the proposed algorithm, cross-domain SR experiments were conducted under different target domains, speaker encoders, corpora, and acoustic features. Experimental results showed that CROT achieved a 50% relative reduction in equal error rates compared to conventional OT-based UDAs, outperforming the state-of-the-art UDAs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过耦合-规则化优化传输实现无监督自适应扬声器识别

跨域说话人识别（SR）可以通过无监督域适应（UDA）算法得到改善。UDA 算法通常以降低说话人特征的辨识度为代价来减少域不匹配。与此相反，在 UDA 应用中，最优传输（OT）有可能在保持说话人辨别能力的同时实现域对齐；然而，天真地应用 OT 来测量源域和目标域之间的全局概率分布差异可能会引起负传输，即属于不同说话人的样本在传输中耦合在一起。这些负迁移会降低 SR 模型的分辨能力，从而降低 SR 性能。本文提出了一种用于跨域 SR 的耦合规则化最优传输（CROT）算法，以减少 UDA 过程中的负传输。在所提出的 CROT 算法中，有两个连续的处理模块对 OT 解决方案的耦合路径进行了正则化处理：一个是渐进式扬声器间约束（PISC）模块，另一个是耦合平滑正则化（CSR）模块。PISC 设计为具有课程学习功能的伪标签记忆库，首先用于选择有效样本，以保证耦合样本来自同一说话者。CSR 的目的是进一步控制耦合路径的信息熵，减少 UDA 中负传输的影响。为了评估所提算法的有效性，我们在不同的目标域、说话者编码器、语料库和声学特征下进行了跨域 SR 实验。实验结果表明，与传统的基于 OT 的 UDA 相比，CROT 实现了相等错误率相对减少 50%，优于最先进的 UDA。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach