CLAPSep：利用对比预训练模型进行多模态查询条件下的目标声音提取

IF 4.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-13 DOI:10.1109/TASLP.2024.3497586

Hao Ma;Zhiyuan Peng;Xu Li;Mingjie Shao;Xixin Wu;Ju Liu

{"title":"CLAPSep：利用对比预训练模型进行多模态查询条件下的目标声音提取","authors":"Hao Ma;Zhiyuan Peng;Xu Li;Mingjie Shao;Xixin Wu;Ju Liu","doi":"10.1109/TASLP.2024.3497586","DOIUrl":null,"url":null,"abstract":"Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4945-4960"},"PeriodicalIF":4.1000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction\",\"authors\":\"Hao Ma;Zhiyuan Peng;Xu Li;Mingjie Shao;Xixin Wu;Ju Liu\",\"doi\":\"10.1109/TASLP.2024.3497586\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation.\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"4945-4960\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2024-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10752098/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10752098/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

通用声音分离（USS）旨在从真实世界的录音中提取任意类型的声音。这可以通过语言查询目标声音提取（TSE）来实现，TSE 通常由两个部分组成：一个是将用户查询转换为条件嵌入的查询网络，另一个是相应提取目标声音的分离网络。现有方法通常从头开始训练模型。因此，要使随机初始化的模型理解声音事件并进行相应的分离，需要大量的数据和计算资源。在本文中，我们建议将预先训练好的模型集成到 TSE 模型中，以解决上述问题。具体来说，我们为 USS 定制并调整了强大的对比语言音频预训练模型（CLAP），称为 CLAPSep；CLAPSep 还接受灵活的用户输入，可接受单模态和/或多模态的正反两方面用户提示来提取目标声音。CLAPSep 的这些关键功能不仅能提高提取性能，还能改善其应用的多样性。我们在 5 个不同的数据集上进行了广泛的实验，证明了我们所提出的 CLAPSep 性能优越，具有零次和少次通用性，训练收敛速度快，大大超过了以前的方法。我们还发布了完整的代码和一些音频示例，以供复制和评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction

Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach