Privacy-preserving voice protection approaches primarily suppress privacy-related information derived from paralinguistic attributes while preserving the linguistic content. Existing solutions focus on single-speaker scenarios. However, they lack practicality for real-world applications, i.e., multi-speaker scenarios. In this paper, we present an initial attempt to provide a multi-speaker anonymization benchmark by defining the task and evaluation protocol, proposing benchmarking solutions, and discussing the privacy leakage of overlapping conversations. Specifically, ideal multi-speaker anonymization should preserve the number of speakers and the turn-taking structure of the conversation, ensuring accurate context conveyance while maintaining privacy. To achieve that, a cascaded system uses speaker diarization to aggregate the speech of each speaker and speaker anonymization to conceal speaker privacy and preserve speech content. Additionally, we propose two conversation-level speaker vector anonymization methods to improve the utility further. Both methods aim to make the original and corresponding pseudo-speaker identities of each speaker unlinkable while preserving or even improving the distinguishability among pseudo-speakers in a conversation. The first method minimizes the differential similarity across speaker pairs in the original and anonymized conversations to maintain original speaker relationships in the anonymized version. The other method minimizes the aggregated similarity across anonymized speakers to achieve better differentiation between speakers. Experiments conducted on both non-overlap simulated and real-world datasets demonstrate the effectiveness of the multi-speaker anonymization system with the proposed speaker anonymizers. Additionally, we analyzed overlapping speech regarding privacy leakage and provide potential solutions.
{"title":"A Benchmark for Multi-speaker Anonymization","authors":"Xiaoxiao Miao, Ruijie Tao, Chang Zeng, Xin Wang","doi":"arxiv-2407.05608","DOIUrl":"https://doi.org/arxiv-2407.05608","url":null,"abstract":"Privacy-preserving voice protection approaches primarily suppress\u0000privacy-related information derived from paralinguistic attributes while\u0000preserving the linguistic content. Existing solutions focus on single-speaker\u0000scenarios. However, they lack practicality for real-world applications, i.e.,\u0000multi-speaker scenarios. In this paper, we present an initial attempt to\u0000provide a multi-speaker anonymization benchmark by defining the task and\u0000evaluation protocol, proposing benchmarking solutions, and discussing the\u0000privacy leakage of overlapping conversations. Specifically, ideal multi-speaker\u0000anonymization should preserve the number of speakers and the turn-taking\u0000structure of the conversation, ensuring accurate context conveyance while\u0000maintaining privacy. To achieve that, a cascaded system uses speaker\u0000diarization to aggregate the speech of each speaker and speaker anonymization\u0000to conceal speaker privacy and preserve speech content. Additionally, we\u0000propose two conversation-level speaker vector anonymization methods to improve\u0000the utility further. Both methods aim to make the original and corresponding\u0000pseudo-speaker identities of each speaker unlinkable while preserving or even\u0000improving the distinguishability among pseudo-speakers in a conversation. The\u0000first method minimizes the differential similarity across speaker pairs in the\u0000original and anonymized conversations to maintain original speaker\u0000relationships in the anonymized version. The other method minimizes the\u0000aggregated similarity across anonymized speakers to achieve better\u0000differentiation between speakers. Experiments conducted on both non-overlap\u0000simulated and real-world datasets demonstrate the effectiveness of the\u0000multi-speaker anonymization system with the proposed speaker anonymizers.\u0000Additionally, we analyzed overlapping speech regarding privacy leakage and\u0000provide potential solutions.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro Lima Louro, Hugo Redinho, Ricardo Santos, Ricardo Malheiro, Renato Panda, Rui Pedro Paiva
The Music Emotion Recognition (MER) field has seen steady developments in recent years, with contributions from feature engineering, machine learning, and deep learning. The landscape has also shifted from audio-centric systems to bimodal ensembles that combine audio and lyrics. However, a severe lack of public and sizeable bimodal databases has hampered the development and improvement of bimodal audio-lyrics systems. This article proposes three new audio, lyrics, and bimodal MER research datasets, collectively called MERGE, created using a semi-automatic approach. To comprehensively assess the proposed datasets and establish a baseline for benchmarking, we conducted several experiments for each modality, using feature engineering, machine learning, and deep learning methodologies. In addition, we propose and validate fixed train-validate-test splits. The obtained results confirm the viability of the proposed datasets, achieving the best overall result of 79.21% F1-score for bimodal classification using a deep neural network.
近年来,随着特征工程、机器学习和深度学习的发展,音乐情感识别(MER)领域取得了稳步发展。该领域的格局也从以音频为中心的系统转变为结合音频和歌词的双模态系统。然而,由于严重缺乏公开且规模可观的双模数据库,双模音频-歌词系统的开发和改进受到了阻碍。本文提出了三个新的音频、歌词和双模 MER 研究数据集,统称为 MERGE,采用半自动方法创建。为了全面评估所提出的数据集并建立基准线,我们使用特征工程、机器学习和深度学习方法对每种模式进行了多次实验。此外,我们还提出并验证了固定的训练-验证-测试拆分。所获得的结果证实了所提议的数据集的可行性,使用深度神经网络进行模态分类的总体结果最好,达到了 79.21% 的 F1 分数。
{"title":"MERGE -- A Bimodal Dataset for Static Music Emotion Recognition","authors":"Pedro Lima Louro, Hugo Redinho, Ricardo Santos, Ricardo Malheiro, Renato Panda, Rui Pedro Paiva","doi":"arxiv-2407.06060","DOIUrl":"https://doi.org/arxiv-2407.06060","url":null,"abstract":"The Music Emotion Recognition (MER) field has seen steady developments in\u0000recent years, with contributions from feature engineering, machine learning,\u0000and deep learning. The landscape has also shifted from audio-centric systems to\u0000bimodal ensembles that combine audio and lyrics. However, a severe lack of\u0000public and sizeable bimodal databases has hampered the development and\u0000improvement of bimodal audio-lyrics systems. This article proposes three new\u0000audio, lyrics, and bimodal MER research datasets, collectively called MERGE,\u0000created using a semi-automatic approach. To comprehensively assess the proposed\u0000datasets and establish a baseline for benchmarking, we conducted several\u0000experiments for each modality, using feature engineering, machine learning, and\u0000deep learning methodologies. In addition, we propose and validate fixed\u0000train-validate-test splits. The obtained results confirm the viability of the\u0000proposed datasets, achieving the best overall result of 79.21% F1-score for\u0000bimodal classification using a deep neural network.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Does popular music from the 60s sound different than that of the 90s? Prior study has shown that there would exist some variations of patterns and regularities related to instrumentation changes and growing loudness across multi-decadal trends. This indicates that perceiving the era of a song from musical features such as audio and artist information is possible. Music era information can be an important feature for playlist generation and recommendation. However, the release year of a song can be inaccessible in many circumstances. This paper addresses a novel task of music era recognition. We formulate the task as a music classification problem and propose solutions based on supervised contrastive learning. An audio-based model is developed to predict the era from audio. For the case where the artist information is available, we extend the audio-based model to take multimodal inputs and develop a framework, called MultiModal Contrastive (MMC) learning, to enhance the training. Experimental result on Million Song Dataset demonstrates that the audio-based model achieves 54% in accuracy with a tolerance of 3-years range; incorporating the artist information with the MMC framework for training leads to 9% improvement further.
{"title":"Music Era Recognition Using Supervised Contrastive Learning and Artist Information","authors":"Qiqi He, Xuchen Song, Weituo Hao, Ju-Chiang Wang, Wei-Tsung Lu, Wei Li","doi":"arxiv-2407.05368","DOIUrl":"https://doi.org/arxiv-2407.05368","url":null,"abstract":"Does popular music from the 60s sound different than that of the 90s? Prior\u0000study has shown that there would exist some variations of patterns and\u0000regularities related to instrumentation changes and growing loudness across\u0000multi-decadal trends. This indicates that perceiving the era of a song from\u0000musical features such as audio and artist information is possible. Music era\u0000information can be an important feature for playlist generation and\u0000recommendation. However, the release year of a song can be inaccessible in many\u0000circumstances. This paper addresses a novel task of music era recognition. We\u0000formulate the task as a music classification problem and propose solutions\u0000based on supervised contrastive learning. An audio-based model is developed to\u0000predict the era from audio. For the case where the artist information is\u0000available, we extend the audio-based model to take multimodal inputs and\u0000develop a framework, called MultiModal Contrastive (MMC) learning, to enhance\u0000the training. Experimental result on Million Song Dataset demonstrates that the\u0000audio-based model achieves 54% in accuracy with a tolerance of 3-years range;\u0000incorporating the artist information with the MMC framework for training leads\u0000to 9% improvement further.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The proposed model aims to develop a speech recognition technology for hearing, speech, or cognitively disabled people. All the available technology in the field of speech recognition doesn't come with an interface for communication for people with hearing, speech, or cognitive disabilities. The proposed model proposes the speech from the user, is transmitted to the speech recognition layer where it is converted into text and then that text is then transmitted to the morse code conversion layer where the morse code of the corresponding speech is given as the output. The accuracy of the model is completely dependent on speech recognition, as the morse code conversion is a process. The model is tested with recorded audio files with different parameters. The proposed model's WER and accuracy are both determined to be 10.18% and 89.82%, respectively.
{"title":"Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments","authors":"Ritabrata Roy Choudhury","doi":"arxiv-2407.14525","DOIUrl":"https://doi.org/arxiv-2407.14525","url":null,"abstract":"The proposed model aims to develop a speech recognition technology for\u0000hearing, speech, or cognitively disabled people. All the available technology\u0000in the field of speech recognition doesn't come with an interface for\u0000communication for people with hearing, speech, or cognitive disabilities. The\u0000proposed model proposes the speech from the user, is transmitted to the speech\u0000recognition layer where it is converted into text and then that text is then\u0000transmitted to the morse code conversion layer where the morse code of the\u0000corresponding speech is given as the output. The accuracy of the model is\u0000completely dependent on speech recognition, as the morse code conversion is a\u0000process. The model is tested with recorded audio files with different\u0000parameters. The proposed model's WER and accuracy are both determined to be\u000010.18% and 89.82%, respectively.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141773475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhijie Yan
Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models.
{"title":"CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens","authors":"Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhijie Yan","doi":"arxiv-2407.05407","DOIUrl":"https://doi.org/arxiv-2407.05407","url":null,"abstract":"Recent years have witnessed a trend that large language model (LLM) based\u0000text-to-speech (TTS) emerges into the mainstream due to their high naturalness\u0000and zero-shot capacity. In this paradigm, speech signals are discretized into\u0000token sequences, which are modeled by an LLM with text as prompts and\u0000reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens\u0000play a critical role in LLM-based TTS models. Current speech tokens are learned\u0000in an unsupervised manner, which lacks explicit semantic information and\u0000alignment to the text. In this paper, we propose to represent speech with\u0000supervised semantic tokens, which are derived from a multilingual speech\u0000recognition model by inserting vector quantization into the encoder. Based on\u0000the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice,\u0000which consists of an LLM for text-to-token generation and a conditional flow\u0000matching model for token-to-speech synthesis. Experimental results show that\u0000supervised semantic tokens significantly outperform existing unsupervised\u0000tokens in terms of content consistency and speaker similarity for zero-shot\u0000voice cloning. Moreover, we find that utilizing large-scale data further\u0000improves the synthesis performance, indicating the scalable capacity of\u0000CosyVoice. To the best of our knowledge, this is the first attempt to involve\u0000supervised speech tokens into TTS models.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross-lingual speech emotion recognition (SER) is important for a wide range of everyday applications. While recent SER research relies heavily on large pretrained models for emotion training, existing studies often concentrate solely on the final transformer layer of these models. However, given the task-specific nature and hierarchical architecture of these models, each transformer layer encapsulates different levels of information. Leveraging this hierarchical structure, our study focuses on the information embedded across different layers. Through an examination of layer feature similarity across different languages, we propose a novel strategy called a layer-anchoring mechanism to facilitate emotion transfer in cross-lingual SER tasks. Our approach is evaluated using two distinct language affective corpora (MSP-Podcast and BIIC-Podcast), achieving a best UAR performance of 60.21% on the BIIC-podcast corpus. The analysis uncovers interesting insights into the behavior of popular pretrained models.
跨语言语音情感识别(SER)对于广泛的日常应用非常重要。虽然最近的 SER 研究在很大程度上依赖于用于情感训练的大型预训练模型,但现有研究往往只关注这些模型的最终转换层。然而,鉴于这些模型的特定任务性质和分层架构,每个转换器层都封装了不同层次的信息。利用这种分层结构,我们的研究重点放在了不同层之间所蕴含的信息上。通过对不同语言层特征相似性的研究,我们提出了一种称为层锚定机制的新策略,以促进跨语言 SER 任务中的情感转移。我们使用两种不同的语言情感语料库(MSP-Podcast 和 BIIC-Podcast)对我们的方法进行了评估,在 BIIC-podcast 语料库中取得了 60.21% 的最佳 UAR 性能。分析揭示了流行的预训练模型行为的有趣之处。
{"title":"A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition","authors":"Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee","doi":"arxiv-2407.04966","DOIUrl":"https://doi.org/arxiv-2407.04966","url":null,"abstract":"Cross-lingual speech emotion recognition (SER) is important for a wide range\u0000of everyday applications. While recent SER research relies heavily on large\u0000pretrained models for emotion training, existing studies often concentrate\u0000solely on the final transformer layer of these models. However, given the\u0000task-specific nature and hierarchical architecture of these models, each\u0000transformer layer encapsulates different levels of information. Leveraging this\u0000hierarchical structure, our study focuses on the information embedded across\u0000different layers. Through an examination of layer feature similarity across\u0000different languages, we propose a novel strategy called a layer-anchoring\u0000mechanism to facilitate emotion transfer in cross-lingual SER tasks. Our\u0000approach is evaluated using two distinct language affective corpora\u0000(MSP-Podcast and BIIC-Podcast), achieving a best UAR performance of 60.21% on\u0000the BIIC-podcast corpus. The analysis uncovers interesting insights into the\u0000behavior of popular pretrained models.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited training samples. A commonly used approach is the pre-training and fine-tuning framework. While effective in clean conditions, this approach struggles with mixed keyword spotting -- simultaneously detecting multiple keywords blended in an utterance, which is crucial in real-world applications. Previous research has proposed a Mix-Training (MT) approach to solve the problem, however, it has never been tested in the few-shot scenario. In this paper, we investigate the possibility of using MT and other relevant methods to solve the two practical challenges together: few-shot and mixed speech. Experiments conducted on the LibriSpeech and Google Speech Command corpora demonstrate that MT is highly effective on this task when employed in either the pre-training phase or the fine-tuning phase. Moreover, combining SSL-based large-scale pre-training (HuBert) and MT fine-tuning yields very strong results in all the test conditions.
{"title":"Few-Shot Keyword Spotting from Mixed Speech","authors":"Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla","doi":"arxiv-2407.06078","DOIUrl":"https://doi.org/arxiv-2407.06078","url":null,"abstract":"Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited\u0000training samples. A commonly used approach is the pre-training and fine-tuning\u0000framework. While effective in clean conditions, this approach struggles with\u0000mixed keyword spotting -- simultaneously detecting multiple keywords blended in\u0000an utterance, which is crucial in real-world applications. Previous research\u0000has proposed a Mix-Training (MT) approach to solve the problem, however, it has\u0000never been tested in the few-shot scenario. In this paper, we investigate the\u0000possibility of using MT and other relevant methods to solve the two practical\u0000challenges together: few-shot and mixed speech. Experiments conducted on the\u0000LibriSpeech and Google Speech Command corpora demonstrate that MT is highly\u0000effective on this task when employed in either the pre-training phase or the\u0000fine-tuning phase. Moreover, combining SSL-based large-scale pre-training\u0000(HuBert) and MT fine-tuning yields very strong results in all the test\u0000conditions.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rajat Bhattacharjya, Arnab Sarkar, Biswadip Maity, Nikil Dutt
Multiple Signal Classification (MUSIC) is a widely used Direction of Arrival (DoA)/Angle of Arrival (AoA) estimation algorithm applied to various application domains such as autonomous driving, medical imaging, and astronomy. However, MUSIC is computationally expensive and challenging to implement in low-power hardware, requiring exploration of trade-offs between accuracy, cost, and power. We present MUSIC-lite, which exploits approximate computing to generate a design space exploring accuracy-area-power trade-offs. This is specifically applied to the computationally intensive singular value decomposition (SVD) component of the MUSIC algorithm in an orthogonal frequency-division multiplexing (OFDM) radar use case. MUSIC-lite incorporates approximate adders into the iterative CORDIC algorithm that is used for hardware implementation of MUSIC, generating interesting accuracy-area-power trade-offs. Our experiments demonstrate MUSIC-lite's ability to save an average of 17.25% on-chip area and 19.4% power with a minimal 0.14% error for efficient MUSIC implementations.
多重信号分类(MUSIC)是一种广泛使用的到达方向(DoA)/到达角度(AoA)估计算法,应用于自动驾驶、医疗成像和天文学等多个领域。我们提出了 MUSIC-lite,它利用近似计算生成一个设计空间,探索精度、面积和功耗之间的权衡。这特别适用于正交半频分复用(OFDM)雷达应用案例中 MUSIC 算法中计算密集的奇异值分解(SVD)部分。MUSIC-lite 将近似加法器纳入了用于 MUSIC 硬件实现的迭代 CORDIC 算法,从而实现了有趣的精度-面积-功耗权衡。我们的实验证明,对于高效的 MUSIC 实现,MUSIC-lite 能够平均节省 17.25% 的芯片面积和 19.4% 的功耗,误差最小为 0.14%。
{"title":"MUSIC-lite: Efficient MUSIC using Approximate Computing: An OFDM Radar Case Study","authors":"Rajat Bhattacharjya, Arnab Sarkar, Biswadip Maity, Nikil Dutt","doi":"arxiv-2407.04849","DOIUrl":"https://doi.org/arxiv-2407.04849","url":null,"abstract":"Multiple Signal Classification (MUSIC) is a widely used Direction of Arrival\u0000(DoA)/Angle of Arrival (AoA) estimation algorithm applied to various\u0000application domains such as autonomous driving, medical imaging, and astronomy.\u0000However, MUSIC is computationally expensive and challenging to implement in\u0000low-power hardware, requiring exploration of trade-offs between accuracy, cost,\u0000and power. We present MUSIC-lite, which exploits approximate computing to\u0000generate a design space exploring accuracy-area-power trade-offs. This is\u0000specifically applied to the computationally intensive singular value\u0000decomposition (SVD) component of the MUSIC algorithm in an orthogonal\u0000frequency-division multiplexing (OFDM) radar use case. MUSIC-lite incorporates\u0000approximate adders into the iterative CORDIC algorithm that is used for\u0000hardware implementation of MUSIC, generating interesting accuracy-area-power\u0000trade-offs. Our experiments demonstrate MUSIC-lite's ability to save an average\u0000of 17.25% on-chip area and 19.4% power with a minimal 0.14% error for efficient\u0000MUSIC implementations.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, audio-visual separation approaches have taken advantage of the natural synchronization between the two modalities to boost audio source separation performance. They extracted high-level semantics from visual inputs as the guidance to help disentangle sound representation for individual sources. Can we directly learn to disentangle the individual semantics from the sound itself? The dilemma is that multiple sound sources are mixed together in the original space. To tackle the difficulty, in this paper, we present a novel Semantic Grouping Network, termed as SGN, that can directly disentangle sound representations and extract high-level semantic information for each source from input audio mixture. Specifically, SGN aggregates category-wise source features through learnable class tokens of sounds. Then, the aggregated semantic features can be used as the guidance to separate the corresponding audio sources from the mixture. We conducted extensive experiments on music-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and VGG-Sound. The results demonstrate that our SGN significantly outperforms previous audio-only methods and audio-visual models without utilizing additional visual cues.
{"title":"Semantic Grouping Network for Audio Source Separation","authors":"Shentong Mo, Yapeng Tian","doi":"arxiv-2407.03736","DOIUrl":"https://doi.org/arxiv-2407.03736","url":null,"abstract":"Recently, audio-visual separation approaches have taken advantage of the\u0000natural synchronization between the two modalities to boost audio source\u0000separation performance. They extracted high-level semantics from visual inputs\u0000as the guidance to help disentangle sound representation for individual\u0000sources. Can we directly learn to disentangle the individual semantics from the\u0000sound itself? The dilemma is that multiple sound sources are mixed together in\u0000the original space. To tackle the difficulty, in this paper, we present a novel\u0000Semantic Grouping Network, termed as SGN, that can directly disentangle sound\u0000representations and extract high-level semantic information for each source\u0000from input audio mixture. Specifically, SGN aggregates category-wise source\u0000features through learnable class tokens of sounds. Then, the aggregated\u0000semantic features can be used as the guidance to separate the corresponding\u0000audio sources from the mixture. We conducted extensive experiments on\u0000music-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and\u0000VGG-Sound. The results demonstrate that our SGN significantly outperforms\u0000previous audio-only methods and audio-visual models without utilizing\u0000additional visual cues.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"2018 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141578116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed
Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER's capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets. In particular, we incorporate 11 emotional speech datasets and illustrate a comprehensive benchmark on the SER task. We also address the challenge of imbalanced data distribution using over-sampling methods when combining SER datasets for training. Furthermore, we explore various evaluation protocols for adeptness in the generalization of SER. Building on this, we explore the potential of Whisper for SER, emphasizing the importance of thorough evaluation. Our approach is designed to advance SER technology by integrating speaker-independent methods.
语音情感识别(SER)对于增强语音应用中的人机交互至关重要。尽管在特定情感数据集方面有所改进,但在 SER 在现实世界中的泛化能力方面仍存在研究空白。在本文中,我们研究了将 SER 系统泛化到不同情感数据集的方法。特别是,我们纳入了 11 个情感语音数据集,并说明了 SER 任务的综合基准。我们还利用过度采样方法解决了在结合 SER 数据集进行训练时数据分布不平衡的难题。此外,我们还探索了各种评估协议,以评估 SER 的泛化能力。在此基础上,我们探讨了 Whisper 在 SER 方面的潜力,强调了彻底评估的重要性。我们的方法旨在通过整合与扬声器无关的方法来推动 SER 技术的发展。
{"title":"What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark","authors":"Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed","doi":"arxiv-2406.09933","DOIUrl":"https://doi.org/arxiv-2406.09933","url":null,"abstract":"Speech emotion recognition (SER) is essential for enhancing human-computer\u0000interaction in speech-based applications. Despite improvements in specific\u0000emotional datasets, there is still a research gap in SER's capability to\u0000generalize across real-world situations. In this paper, we investigate\u0000approaches to generalize the SER system across different emotion datasets. In\u0000particular, we incorporate 11 emotional speech datasets and illustrate a\u0000comprehensive benchmark on the SER task. We also address the challenge of\u0000imbalanced data distribution using over-sampling methods when combining SER\u0000datasets for training. Furthermore, we explore various evaluation protocols for\u0000adeptness in the generalization of SER. Building on this, we explore the\u0000potential of Whisper for SER, emphasizing the importance of thorough\u0000evaluation. Our approach is designed to advance SER technology by integrating\u0000speaker-independent methods.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"171 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}