Investigating Long-Term and Short-Term Time-Varying Speaker Verification

IF 4.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-16 DOI:10.1109/TASLP.2024.3428910

Xiaoyi Qin;Na Li;Shufei Duan;Ming Li

{"title":"Investigating Long-Term and Short-Term Time-Varying Speaker Verification","authors":"Xiaoyi Qin;Na Li;Shufei Duan;Ming Li","doi":"10.1109/TASLP.2024.3428910","DOIUrl":null,"url":null,"abstract":"The performance of speaker verification systems can be adversely affected by time domain variations. However, limited research has been conducted on time-varying speaker verification due to the absence of appropriate datasets. This paper aims to investigate the impact of long-term and short-term time-varying in speaker verification and proposes solutions to mitigate these effects. For long-term speaker verification (i.e., cross-age speaker verification), we introduce an age-decoupling adversarial learning method to learn age-invariant speaker representation by mining age information from the VoxCeleb dataset. For short-term speaker verification, we collect the SMIIP-TimeVarying (SMIIP-TV) Dataset, which includes recordings at multiple time slots every day from 373 speakers for 90 consecutive days and other relevant meta information. Using this dataset, we analyze the time-varying of speaker embeddings and propose a novel but realistic time-varying speaker verification task, termed incremental sequence-pair speaker verification. This task involves continuous interaction between enrollment audios and a sequence of testing audios with the aim of improving performance over time. We introduce the template updating method to counter the negative effects over time, and then formulate the template updating processing as a Markov Decision Process and propose a template updating method based on deep reinforcement learning (DRL). The policy network of DRL is treated as an agent to determine if and how much should the template be updated. In summary, this paper releases our collected database, investigates both the long-term and short-term time-varying scenarios and provides insights and solutions into time-varying speaker verification.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3408-3423"},"PeriodicalIF":4.1000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10599875/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The performance of speaker verification systems can be adversely affected by time domain variations. However, limited research has been conducted on time-varying speaker verification due to the absence of appropriate datasets. This paper aims to investigate the impact of long-term and short-term time-varying in speaker verification and proposes solutions to mitigate these effects. For long-term speaker verification (i.e., cross-age speaker verification), we introduce an age-decoupling adversarial learning method to learn age-invariant speaker representation by mining age information from the VoxCeleb dataset. For short-term speaker verification, we collect the SMIIP-TimeVarying (SMIIP-TV) Dataset, which includes recordings at multiple time slots every day from 373 speakers for 90 consecutive days and other relevant meta information. Using this dataset, we analyze the time-varying of speaker embeddings and propose a novel but realistic time-varying speaker verification task, termed incremental sequence-pair speaker verification. This task involves continuous interaction between enrollment audios and a sequence of testing audios with the aim of improving performance over time. We introduce the template updating method to counter the negative effects over time, and then formulate the template updating processing as a Markov Decision Process and propose a template updating method based on deep reinforcement learning (DRL). The policy network of DRL is treated as an agent to determine if and how much should the template be updated. In summary, this paper releases our collected database, investigates both the long-term and short-term time-varying scenarios and provides insights and solutions into time-varying speaker verification.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

调查长期和短期时变扬声器验证

扬声器验证系统的性能会受到时域变化的不利影响。然而，由于缺乏适当的数据集，针对时变说话人验证的研究十分有限。本文旨在研究长期和短期时变对说话人验证的影响，并提出减轻这些影响的解决方案。对于长期说话人验证（即跨年龄说话人验证），我们引入了一种年龄去耦对抗学习方法，通过挖掘 VoxCeleb 数据集中的年龄信息来学习与年龄无关的说话人表征。在短期说话人验证方面，我们收集了 SMIIP-TimeVarying（SMIIP-TV）数据集，其中包括 373 位说话人连续 90 天每天多个时间段的录音和其他相关元信息。利用该数据集，我们分析了扬声器嵌入的时变性，并提出了一种新颖但现实的时变扬声器验证任务，即增量序列对扬声器验证。这项任务涉及报名音频和测试音频序列之间的持续互动，目的是随着时间的推移提高性能。我们引入了模板更新方法来对抗随时间变化的负面影响，然后将模板更新处理过程表述为马尔可夫决策过程，并提出了一种基于深度强化学习（DRL）的模板更新方法。DRL 的策略网络被视为一个代理，以决定模板是否应该更新以及更新多少。总之，本文发布了我们收集的数据库，研究了长期和短期时变场景，并为时变说话人验证提供了见解和解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.