基于上下文信息开发的粗到细目标扬声器提取

IF 5.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-08 DOI:10.1109/TASLP.2024.3440638

Xue Yang;Changchun Bao;Xianhong Chen

{"title":"基于上下文信息开发的粗到细目标扬声器提取","authors":"Xue Yang;Changchun Bao;Xianhong Chen","doi":"10.1109/TASLP.2024.3440638","DOIUrl":null,"url":null,"abstract":"To address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the target speaker may be present or absent and the received signal may contain one or multiple speakers. Numerous TSE studies utilize the target speaker's embedding to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework. For the specific scenario, an interaction block is employed to facilitate direct interaction between the T-F representations of the enrollment and received signal. This direct interaction leads to the consistent representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal is utilized to guide the refining extraction. The residual representation obtained during the refining extraction increases the extraction precision. Besides, this paper explores an undisturbed universal scenario where the noise and reverberation are not considered. A two-level decision-making scheme is devised to generalize our proposed method for this undisturbed universal scenario. The proposed method achieves high performance and is proven effective for both scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3795-3810"},"PeriodicalIF":5.1000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation\",\"authors\":\"Xue Yang;Changchun Bao;Xianhong Chen\",\"doi\":\"10.1109/TASLP.2024.3440638\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the target speaker may be present or absent and the received signal may contain one or multiple speakers. Numerous TSE studies utilize the target speaker's embedding to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework. For the specific scenario, an interaction block is employed to facilitate direct interaction between the T-F representations of the enrollment and received signal. This direct interaction leads to the consistent representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal is utilized to guide the refining extraction. The residual representation obtained during the refining extraction increases the extraction precision. Besides, this paper explores an undisturbed universal scenario where the noise and reverberation are not considered. A two-level decision-making scheme is devised to generalize our proposed method for this undisturbed universal scenario. The proposed method achieves high performance and is proven effective for both scenarios.\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"3795-3810\"},\"PeriodicalIF\":5.1000,\"publicationDate\":\"2024-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10631297/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10631297/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

为解决鸡尾酒会问题，目标发言人提取（TSE）近来受到越来越多的关注。通常，TSE 在两种情况下进行探索。第一种情况是特定情况，即目标说话者在场，麦克风接收到的信号至少包含两个说话者。第二种情况是普遍情况，目标扬声器可能存在，也可能不存在，接收到的信号可能包含一个或多个扬声器。许多 TSE 研究利用目标扬声器的嵌入来指导提取。然而，仅仅利用这种嵌入可能无法充分利用注册中的上下文信息。为了解决这一局限性，有人提出了一种直接利用时频（T-F）域上下文信息的新方法。本文通过整合我们之前提出的 "从粗到细 "框架，对这一方法进行了改进。针对特定场景，采用了一个交互块，以促进报名和接收信号的 T-F 表示之间的直接交互。这种直接互动会产生一致的报名表示，为粗提取提供指导。之后，粗提取信号的 T-F 表示将用于指导精提取。精提取过程中获得的残差表示提高了提取精度。此外，本文还探讨了不考虑噪声和混响的无干扰通用场景。本文设计了一种两级决策方案，将我们提出的方法推广到这种无干扰通用场景中。所提出的方法实现了高性能，并被证明对这两种场景都有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation

To address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the target speaker may be present or absent and the received signal may contain one or multiple speakers. Numerous TSE studies utilize the target speaker's embedding to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework. For the specific scenario, an interaction block is employed to facilitate direct interaction between the T-F representations of the enrollment and received signal. This direct interaction leads to the consistent representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal is utilized to guide the refining extraction. The residual representation obtained during the refining extraction increases the extraction precision. Besides, this paper explores an undisturbed universal scenario where the noise and reverberation are not considered. A two-level decision-making scheme is devised to generalize our proposed method for this undisturbed universal scenario. The proposed method achieves high performance and is proven effective for both scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach