arXiv - CS - Sound最新文献

英文中文

A Benchmark for Multi-speaker Anonymization 多扬声器匿名化基准

arXiv - CS - Sound

Pub Date : 2024-07-08 DOI: arxiv-2407.05608

Xiaoxiao Miao, Ruijie Tao, Chang Zeng, Xin Wang

Privacy-preserving voice protection approaches primarily suppressprivacy-related information derived from paralinguistic attributes whilepreserving the linguistic content. Existing solutions focus on single-speakerscenarios. However, they lack practicality for real-world applications, i.e.,multi-speaker scenarios. In this paper, we present an initial attempt toprovide a multi-speaker anonymization benchmark by defining the task andevaluation protocol, proposing benchmarking solutions, and discussing theprivacy leakage of overlapping conversations. Specifically, ideal multi-speakeranonymization should preserve the number of speakers and the turn-takingstructure of the conversation, ensuring accurate context conveyance whilemaintaining privacy. To achieve that, a cascaded system uses speakerdiarization to aggregate the speech of each speaker and speaker anonymizationto conceal speaker privacy and preserve speech content. Additionally, wepropose two conversation-level speaker vector anonymization methods to improvethe utility further. Both methods aim to make the original and correspondingpseudo-speaker identities of each speaker unlinkable while preserving or evenimproving the distinguishability among pseudo-speakers in a conversation. Thefirst method minimizes the differential similarity across speaker pairs in theoriginal and anonymized conversations to maintain original speakerrelationships in the anonymized version. The other method minimizes theaggregated similarity across anonymized speakers to achieve betterdifferentiation between speakers. Experiments conducted on both non-overlapsimulated and real-world datasets demonstrate the effectiveness of themulti-speaker anonymization system with the proposed speaker anonymizers.Additionally, we analyzed overlapping speech regarding privacy leakage andprovide potential solutions.

保护隐私的语音保护方法主要是在保留语言内容的同时，抑制从副语言属性中提取的与隐私相关的信息。现有的解决方案主要针对单人场景。然而，这些方案在实际应用中（即多扬声器场景）缺乏实用性。在本文中，我们通过定义任务和评估协议、提出基准解决方案以及讨论重叠会话的隐私泄露问题，初步尝试提供多发言人匿名化基准。具体来说，理想的多发言人匿名化应该保留发言人的数量和对话的轮流结构，在保证隐私的同时确保上下文的准确传达。为了实现这一目标，一个级联系统使用说话者匿名化来聚合每个说话者的语音，并使用说话者匿名化来隐藏说话者的隐私并保留语音内容。此外，我们还提出了两种会话级别的说话者矢量匿名化方法，以进一步提高实用性。这两种方法都旨在使每个说话人的原始身份和相应的伪说话人身份不可链接，同时保留甚至提高对话中伪说话人之间的可区分性。第一种方法尽量减少原始对话和匿名对话中说话者对之间的相似性差异，以保持匿名版本中说话者的原始关系。另一种方法则最小化匿名发言人之间的集合相似性，以实现发言人之间更好的区分。在非重叠模拟数据集和真实世界数据集上进行的实验证明，使用所提出的说话者匿名器的多说话者匿名系统是有效的，此外，我们还分析了有关隐私泄露的重叠语音，并提供了潜在的解决方案。

{"title":"A Benchmark for Multi-speaker Anonymization","authors":"Xiaoxiao Miao, Ruijie Tao, Chang Zeng, Xin Wang","doi":"arxiv-2407.05608","DOIUrl":"https://doi.org/arxiv-2407.05608","url":null,"abstract":"Privacy-preserving voice protection approaches primarily suppress\u0000privacy-related information derived from paralinguistic attributes while\u0000preserving the linguistic content. Existing solutions focus on single-speaker\u0000scenarios. However, they lack practicality for real-world applications, i.e.,\u0000multi-speaker scenarios. In this paper, we present an initial attempt to\u0000provide a multi-speaker anonymization benchmark by defining the task and\u0000evaluation protocol, proposing benchmarking solutions, and discussing the\u0000privacy leakage of overlapping conversations. Specifically, ideal multi-speaker\u0000anonymization should preserve the number of speakers and the turn-taking\u0000structure of the conversation, ensuring accurate context conveyance while\u0000maintaining privacy. To achieve that, a cascaded system uses speaker\u0000diarization to aggregate the speech of each speaker and speaker anonymization\u0000to conceal speaker privacy and preserve speech content. Additionally, we\u0000propose two conversation-level speaker vector anonymization methods to improve\u0000the utility further. Both methods aim to make the original and corresponding\u0000pseudo-speaker identities of each speaker unlinkable while preserving or even\u0000improving the distinguishability among pseudo-speakers in a conversation. The\u0000first method minimizes the differential similarity across speaker pairs in the\u0000original and anonymized conversations to maintain original speaker\u0000relationships in the anonymized version. The other method minimizes the\u0000aggregated similarity across anonymized speakers to achieve better\u0000differentiation between speakers. Experiments conducted on both non-overlap\u0000simulated and real-world datasets demonstrate the effectiveness of the\u0000multi-speaker anonymization system with the proposed speaker anonymizers.\u0000Additionally, we analyzed overlapping speech regarding privacy leakage and\u0000provide potential solutions.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MERGE -- A Bimodal Dataset for Static Music Emotion Recognition MERGE -- 用于静态音乐情感识别的双模数据集

arXiv - CS - Sound

Pub Date : 2024-07-08 DOI: arxiv-2407.06060

Pedro Lima Louro, Hugo Redinho, Ricardo Santos, Ricardo Malheiro, Renato Panda, Rui Pedro Paiva

The Music Emotion Recognition (MER) field has seen steady developments inrecent years, with contributions from feature engineering, machine learning,and deep learning. The landscape has also shifted from audio-centric systems tobimodal ensembles that combine audio and lyrics. However, a severe lack ofpublic and sizeable bimodal databases has hampered the development andimprovement of bimodal audio-lyrics systems. This article proposes three newaudio, lyrics, and bimodal MER research datasets, collectively called MERGE,created using a semi-automatic approach. To comprehensively assess the proposeddatasets and establish a baseline for benchmarking, we conducted severalexperiments for each modality, using feature engineering, machine learning, anddeep learning methodologies. In addition, we propose and validate fixedtrain-validate-test splits. The obtained results confirm the viability of theproposed datasets, achieving the best overall result of 79.21% F1-score forbimodal classification using a deep neural network.

近年来，随着特征工程、机器学习和深度学习的发展，音乐情感识别（MER）领域取得了稳步发展。该领域的格局也从以音频为中心的系统转变为结合音频和歌词的双模态系统。然而，由于严重缺乏公开且规模可观的双模数据库，双模音频-歌词系统的开发和改进受到了阻碍。本文提出了三个新的音频、歌词和双模 MER 研究数据集，统称为 MERGE，采用半自动方法创建。为了全面评估所提出的数据集并建立基准线，我们使用特征工程、机器学习和深度学习方法对每种模式进行了多次实验。此外，我们还提出并验证了固定的训练-验证-测试拆分。所获得的结果证实了所提议的数据集的可行性，使用深度神经网络进行模态分类的总体结果最好，达到了 79.21% 的 F1 分数。

引用次数: 0

Music Era Recognition Using Supervised Contrastive Learning and Artist Information 利用监督对比学习和艺术家信息识别音乐年代

arXiv - CS - Sound

Pub Date : 2024-07-07 DOI: arxiv-2407.05368

Qiqi He, Xuchen Song, Weituo Hao, Ju-Chiang Wang, Wei-Tsung Lu, Wei Li

Does popular music from the 60s sound different than that of the 90s? Priorstudy has shown that there would exist some variations of patterns andregularities related to instrumentation changes and growing loudness acrossmulti-decadal trends. This indicates that perceiving the era of a song frommusical features such as audio and artist information is possible. Music erainformation can be an important feature for playlist generation andrecommendation. However, the release year of a song can be inaccessible in manycircumstances. This paper addresses a novel task of music era recognition. Weformulate the task as a music classification problem and propose solutionsbased on supervised contrastive learning. An audio-based model is developed topredict the era from audio. For the case where the artist information isavailable, we extend the audio-based model to take multimodal inputs anddevelop a framework, called MultiModal Contrastive (MMC) learning, to enhancethe training. Experimental result on Million Song Dataset demonstrates that theaudio-based model achieves 54% in accuracy with a tolerance of 3-years range;incorporating the artist information with the MMC framework for training leadsto 9% improvement further.

60 年代的流行音乐与 90 年代的流行音乐听起来是否有所不同？先前的研究表明，在多年代趋势中，会存在一些与乐器变化和音量增长相关的模式和规律性变化。这表明，从音频和艺术家信息等音乐特征中感知一首歌曲的年代是可能的。音乐年代信息可以成为生成播放列表和进行推荐的重要特征。然而，歌曲的发行年份在很多情况下是无法获取的。本文探讨了音乐年代识别这一新颖任务。我们将该任务视为音乐分类问题，并提出了基于监督对比学习的解决方案。我们开发了一种基于音频的模型，可以从音频中预测年代。在艺术家信息可用的情况下，我们将基于音频的模型扩展到多模态输入，并开发了一个称为多模态对比（MMC）学习的框架来增强训练。在百万首歌曲数据集上的实验结果表明，基于音频的模型在容差为 3 年的范围内达到了 54% 的准确率；将艺术家信息与 MMC 框架结合起来进行训练，准确率进一步提高了 9%。

{"title":"Music Era Recognition Using Supervised Contrastive Learning and Artist Information","authors":"Qiqi He, Xuchen Song, Weituo Hao, Ju-Chiang Wang, Wei-Tsung Lu, Wei Li","doi":"arxiv-2407.05368","DOIUrl":"https://doi.org/arxiv-2407.05368","url":null,"abstract":"Does popular music from the 60s sound different than that of the 90s? Prior\u0000study has shown that there would exist some variations of patterns and\u0000regularities related to instrumentation changes and growing loudness across\u0000multi-decadal trends. This indicates that perceiving the era of a song from\u0000musical features such as audio and artist information is possible. Music era\u0000information can be an important feature for playlist generation and\u0000recommendation. However, the release year of a song can be inaccessible in many\u0000circumstances. This paper addresses a novel task of music era recognition. We\u0000formulate the task as a music classification problem and propose solutions\u0000based on supervised contrastive learning. An audio-based model is developed to\u0000predict the era from audio. For the case where the artist information is\u0000available, we extend the audio-based model to take multimodal inputs and\u0000develop a framework, called MultiModal Contrastive (MMC) learning, to enhance\u0000the training. Experimental result on Million Song Dataset demonstrates that the\u0000audio-based model achieves 54% in accuracy with a tolerance of 3-years range;\u0000incorporating the artist information with the MMC framework for training leads\u0000to 9% improvement further.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments 针对视力和听力障碍人士的摩斯密码语音识别技术

arXiv - CS - Sound

Pub Date : 2024-07-07 DOI: arxiv-2407.14525

Ritabrata Roy Choudhury

The proposed model aims to develop a speech recognition technology forhearing, speech, or cognitively disabled people. All the available technologyin the field of speech recognition doesn't come with an interface forcommunication for people with hearing, speech, or cognitive disabilities. Theproposed model proposes the speech from the user, is transmitted to the speechrecognition layer where it is converted into text and then that text is thentransmitted to the morse code conversion layer where the morse code of thecorresponding speech is given as the output. The accuracy of the model iscompletely dependent on speech recognition, as the morse code conversion is aprocess. The model is tested with recorded audio files with differentparameters. The proposed model's WER and accuracy are both determined to be10.18% and 89.82%, respectively.

所提出的模式旨在为听力、语言或认知残疾人士开发一种语音识别技术。语音识别领域现有的所有技术都没有为听力、语言或认知障碍人士提供交流界面。所提出的模型建议将用户的语音传输到语音识别层，在语音识别层将语音转换成文本，然后将文本传输到摩尔斯电码转换层，在摩尔斯电码转换层输出对应语音的摩尔斯电码。该模型的准确性完全取决于语音识别，因为摩斯密码转换是一个过程。该模型使用不同参数的录音文件进行了测试。结果表明，该模型的误码率和准确率分别为 10.18% 和 89.82%。

引用次数: 0

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens CosyVoice：基于有监督语义标记的可扩展多语言零镜头文本到语音合成器

arXiv - CS - Sound

Pub Date : 2024-07-07 DOI: arxiv-2407.05407

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhijie Yan

Recent years have witnessed a trend that large language model (LLM) basedtext-to-speech (TTS) emerges into the mainstream due to their high naturalnessand zero-shot capacity. In this paradigm, speech signals are discretized intotoken sequences, which are modeled by an LLM with text as prompts andreconstructed by a token-based vocoder to waveforms. Obviously, speech tokensplay a critical role in LLM-based TTS models. Current speech tokens are learnedin an unsupervised manner, which lacks explicit semantic information andalignment to the text. In this paper, we propose to represent speech withsupervised semantic tokens, which are derived from a multilingual speechrecognition model by inserting vector quantization into the encoder. Based onthe tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice,which consists of an LLM for text-to-token generation and a conditional flowmatching model for token-to-speech synthesis. Experimental results show thatsupervised semantic tokens significantly outperform existing unsupervisedtokens in terms of content consistency and speaker similarity for zero-shotvoice cloning. Moreover, we find that utilizing large-scale data furtherimproves the synthesis performance, indicating the scalable capacity ofCosyVoice. To the best of our knowledge, this is the first attempt to involvesupervised speech tokens into TTS models.

近年来，基于大语言模型（LLM）的文本到语音（TTS）因其高自然度和零误差能力而成为主流趋势。在这种模式中，语音信号被离散化为令牌序列，由 LLM 以文本作为提示进行建模，并由基于令牌的声码器将其重组为波形。显然，语音标记在基于 LLM 的 TTS 模型中起着至关重要的作用。目前的语音标记是以无监督的方式学习的，缺乏明确的语义信息和与文本的对齐。在本文中，我们提出用有监督的语义标记来表示语音，这种标记来自多语言语音识别模型，通过在编码器中插入向量量化来实现。在语义标记的基础上，我们进一步提出了一种可扩展的 "0-shot TTS "合成器--CosyVoice，它由用于文本到标记生成的 LLM 和用于标记到语音合成的条件流匹配模型组成。实验结果表明，在内容一致性和说话人相似性方面，有监督语义标记在零点语音克隆方面明显优于现有的无监督标记。此外，我们还发现利用大规模数据可以进一步提高合成性能，这表明 CosyVoice 具有可扩展性。据我们所知，这是首次尝试将有监督的语音标记纳入 TTS 模型。

{"title":"CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens","authors":"Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhijie Yan","doi":"arxiv-2407.05407","DOIUrl":"https://doi.org/arxiv-2407.05407","url":null,"abstract":"Recent years have witnessed a trend that large language model (LLM) based\u0000text-to-speech (TTS) emerges into the mainstream due to their high naturalness\u0000and zero-shot capacity. In this paradigm, speech signals are discretized into\u0000token sequences, which are modeled by an LLM with text as prompts and\u0000reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens\u0000play a critical role in LLM-based TTS models. Current speech tokens are learned\u0000in an unsupervised manner, which lacks explicit semantic information and\u0000alignment to the text. In this paper, we propose to represent speech with\u0000supervised semantic tokens, which are derived from a multilingual speech\u0000recognition model by inserting vector quantization into the encoder. Based on\u0000the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice,\u0000which consists of an LLM for text-to-token generation and a conditional flow\u0000matching model for token-to-speech synthesis. Experimental results show that\u0000supervised semantic tokens significantly outperform existing unsupervised\u0000tokens in terms of content consistency and speaker similarity for zero-shot\u0000voice cloning. Moreover, we find that utilizing large-scale data further\u0000improves the synthesis performance, indicating the scalable capacity of\u0000CosyVoice. To the best of our knowledge, this is the first attempt to involve\u0000supervised speech tokens into TTS models.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition 增强跨语言语音情感识别的层添加策略

arXiv - CS - Sound

Pub Date : 2024-07-06 DOI: arxiv-2407.04966

Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee

Cross-lingual speech emotion recognition (SER) is important for a wide rangeof everyday applications. While recent SER research relies heavily on largepretrained models for emotion training, existing studies often concentratesolely on the final transformer layer of these models. However, given thetask-specific nature and hierarchical architecture of these models, eachtransformer layer encapsulates different levels of information. Leveraging thishierarchical structure, our study focuses on the information embedded acrossdifferent layers. Through an examination of layer feature similarity acrossdifferent languages, we propose a novel strategy called a layer-anchoringmechanism to facilitate emotion transfer in cross-lingual SER tasks. Ourapproach is evaluated using two distinct language affective corpora(MSP-Podcast and BIIC-Podcast), achieving a best UAR performance of 60.21% onthe BIIC-podcast corpus. The analysis uncovers interesting insights into thebehavior of popular pretrained models.

跨语言语音情感识别（SER）对于广泛的日常应用非常重要。虽然最近的 SER 研究在很大程度上依赖于用于情感训练的大型预训练模型，但现有研究往往只关注这些模型的最终转换层。然而，鉴于这些模型的特定任务性质和分层架构，每个转换器层都封装了不同层次的信息。利用这种分层结构，我们的研究重点放在了不同层之间所蕴含的信息上。通过对不同语言层特征相似性的研究，我们提出了一种称为层锚定机制的新策略，以促进跨语言 SER 任务中的情感转移。我们使用两种不同的语言情感语料库（MSP-Podcast 和 BIIC-Podcast）对我们的方法进行了评估，在 BIIC-podcast 语料库中取得了 60.21% 的最佳 UAR 性能。分析揭示了流行的预训练模型行为的有趣之处。

引用次数: 0

Few-Shot Keyword Spotting from Mixed Speech 从混合语音中发现少量关键词

arXiv - CS - Sound

Pub Date : 2024-07-05 DOI: arxiv-2407.06078

Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla

Few-shot keyword spotting (KWS) aims to detect unknown keywords with limitedtraining samples. A commonly used approach is the pre-training and fine-tuningframework. While effective in clean conditions, this approach struggles withmixed keyword spotting -- simultaneously detecting multiple keywords blended inan utterance, which is crucial in real-world applications. Previous researchhas proposed a Mix-Training (MT) approach to solve the problem, however, it hasnever been tested in the few-shot scenario. In this paper, we investigate thepossibility of using MT and other relevant methods to solve the two practicalchallenges together: few-shot and mixed speech. Experiments conducted on theLibriSpeech and Google Speech Command corpora demonstrate that MT is highlyeffective on this task when employed in either the pre-training phase or thefine-tuning phase. Moreover, combining SSL-based large-scale pre-training(HuBert) and MT fine-tuning yields very strong results in all the testconditions.

少量关键词抽取（KWS）旨在利用有限的训练样本检测未知关键词。一种常用的方法是预训练和微调框架。这种方法虽然在干净的条件下很有效，但在混合关键词检测方面却很吃力，即同时检测语篇中混合的多个关键词，这在实际应用中至关重要。之前的研究提出了一种混合训练（MT）方法来解决这个问题，但是这种方法从未在少量语料的情况下进行过测试。在本文中，我们研究了使用 MT 和其他相关方法一并解决两个实际挑战的可能性：少发语音和混合语音。在 LibriSpeech 和 Google Speech Command 语料库上进行的实验表明，无论是在预训练阶段还是在微调阶段，MT 在这项任务中都非常有效。此外，将基于 SSL 的大规模预训练（HuBert）与 MT 微调相结合，在所有测试条件下都能获得非常出色的结果。

引用次数: 0

MUSIC-lite: Efficient MUSIC using Approximate Computing: An OFDM Radar Case Study MUSIC-lite：使用近似计算的高效 MUSIC：OFDM 雷达案例研究

arXiv - CS - Sound

Pub Date : 2024-07-05 DOI: arxiv-2407.04849

Rajat Bhattacharjya, Arnab Sarkar, Biswadip Maity, Nikil Dutt

Multiple Signal Classification (MUSIC) is a widely used Direction of Arrival(DoA)/Angle of Arrival (AoA) estimation algorithm applied to variousapplication domains such as autonomous driving, medical imaging, and astronomy.However, MUSIC is computationally expensive and challenging to implement inlow-power hardware, requiring exploration of trade-offs between accuracy, cost,and power. We present MUSIC-lite, which exploits approximate computing togenerate a design space exploring accuracy-area-power trade-offs. This isspecifically applied to the computationally intensive singular valuedecomposition (SVD) component of the MUSIC algorithm in an orthogonalfrequency-division multiplexing (OFDM) radar use case. MUSIC-lite incorporatesapproximate adders into the iterative CORDIC algorithm that is used forhardware implementation of MUSIC, generating interesting accuracy-area-powertrade-offs. Our experiments demonstrate MUSIC-lite's ability to save an averageof 17.25% on-chip area and 19.4% power with a minimal 0.14% error for efficientMUSIC implementations.

多重信号分类（MUSIC）是一种广泛使用的到达方向（DoA）/到达角度（AoA）估计算法，应用于自动驾驶、医疗成像和天文学等多个领域。我们提出了 MUSIC-lite，它利用近似计算生成一个设计空间，探索精度、面积和功耗之间的权衡。这特别适用于正交半频分复用（OFDM）雷达应用案例中 MUSIC 算法中计算密集的奇异值分解（SVD）部分。MUSIC-lite 将近似加法器纳入了用于 MUSIC 硬件实现的迭代 CORDIC 算法，从而实现了有趣的精度-面积-功耗权衡。我们的实验证明，对于高效的 MUSIC 实现，MUSIC-lite 能够平均节省 17.25% 的芯片面积和 19.4% 的功耗，误差最小为 0.14%。

引用次数: 0

Semantic Grouping Network for Audio Source Separation 用于音源分离的语义分组网络

arXiv - CS - Sound

Pub Date : 2024-07-04 DOI: arxiv-2407.03736

Shentong Mo, Yapeng Tian

Recently, audio-visual separation approaches have taken advantage of thenatural synchronization between the two modalities to boost audio sourceseparation performance. They extracted high-level semantics from visual inputsas the guidance to help disentangle sound representation for individualsources. Can we directly learn to disentangle the individual semantics from thesound itself? The dilemma is that multiple sound sources are mixed together inthe original space. To tackle the difficulty, in this paper, we present a novelSemantic Grouping Network, termed as SGN, that can directly disentangle soundrepresentations and extract high-level semantic information for each sourcefrom input audio mixture. Specifically, SGN aggregates category-wise sourcefeatures through learnable class tokens of sounds. Then, the aggregatedsemantic features can be used as the guidance to separate the correspondingaudio sources from the mixture. We conducted extensive experiments onmusic-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, andVGG-Sound. The results demonstrate that our SGN significantly outperformsprevious audio-only methods and audio-visual models without utilizingadditional visual cues.

最近，视听分离方法利用两种模态之间的自然同步来提高音源分离性能。他们从视觉输入中提取高级语义作为指导，以帮助分离个体声源的声音表征。我们能否直接学习从声音本身分离出单个语义呢？我们面临的难题是，多个声源在原始空间中混杂在一起。为了解决这个难题，我们在本文中提出了一种新颖的语义分组网络（Semantic Grouping Network，简称 SGN），它可以直接从输入的音频混合物中分离出声音表示并提取每个声源的高级语义信息。具体来说，SGN 通过可学习的声音类别标记来聚合类别化的声源特征。然后，聚合的语义特征可用作从混合物中分离相应音源的指导。我们在纯音乐和通用声音分离基准上进行了大量实验：这些基准包括：MUSIC、FUSS、MUSDB18 和 VGG-Sound。结果表明，在不使用额外视觉线索的情况下，我们的 SGN 明显优于以前的纯音频方法和视听模型。

{"title":"Semantic Grouping Network for Audio Source Separation","authors":"Shentong Mo, Yapeng Tian","doi":"arxiv-2407.03736","DOIUrl":"https://doi.org/arxiv-2407.03736","url":null,"abstract":"Recently, audio-visual separation approaches have taken advantage of the\u0000natural synchronization between the two modalities to boost audio source\u0000separation performance. They extracted high-level semantics from visual inputs\u0000as the guidance to help disentangle sound representation for individual\u0000sources. Can we directly learn to disentangle the individual semantics from the\u0000sound itself? The dilemma is that multiple sound sources are mixed together in\u0000the original space. To tackle the difficulty, in this paper, we present a novel\u0000Semantic Grouping Network, termed as SGN, that can directly disentangle sound\u0000representations and extract high-level semantic information for each source\u0000from input audio mixture. Specifically, SGN aggregates category-wise source\u0000features through learnable class tokens of sounds. Then, the aggregated\u0000semantic features can be used as the guidance to separate the corresponding\u0000audio sources from the mixture. We conducted extensive experiments on\u0000music-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and\u0000VGG-Sound. The results demonstrate that our SGN significantly outperforms\u0000previous audio-only methods and audio-visual models without utilizing\u0000additional visual cues.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"2018 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141578116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark 如何在不同数据集之间推广 SER 模型？综合基准

arXiv - CS - Sound

Pub Date : 2024-06-14 DOI: arxiv-2406.09933

Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed

Speech emotion recognition (SER) is essential for enhancing human-computerinteraction in speech-based applications. Despite improvements in specificemotional datasets, there is still a research gap in SER's capability togeneralize across real-world situations. In this paper, we investigateapproaches to generalize the SER system across different emotion datasets. Inparticular, we incorporate 11 emotional speech datasets and illustrate acomprehensive benchmark on the SER task. We also address the challenge ofimbalanced data distribution using over-sampling methods when combining SERdatasets for training. Furthermore, we explore various evaluation protocols foradeptness in the generalization of SER. Building on this, we explore thepotential of Whisper for SER, emphasizing the importance of thoroughevaluation. Our approach is designed to advance SER technology by integratingspeaker-independent methods.

语音情感识别（SER）对于增强语音应用中的人机交互至关重要。尽管在特定情感数据集方面有所改进，但在 SER 在现实世界中的泛化能力方面仍存在研究空白。在本文中，我们研究了将 SER 系统泛化到不同情感数据集的方法。特别是，我们纳入了 11 个情感语音数据集，并说明了 SER 任务的综合基准。我们还利用过度采样方法解决了在结合 SER 数据集进行训练时数据分布不平衡的难题。此外，我们还探索了各种评估协议，以评估 SER 的泛化能力。在此基础上，我们探讨了 Whisper 在 SER 方面的潜力，强调了彻底评估的重要性。我们的方法旨在通过整合与扬声器无关的方法来推动 SER 技术的发展。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Sound

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀