Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun
While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.
虽然最近的零镜头文本到语音(ZS-TTS)模型实现了较高的自然度和说话人相似度,但它们在口音保真度和控制方面存在不足。为了解决这个问题,我们提出了零镜头口音生成技术,它将外来口音转换(FAC)、带口音的 TTS 和 ZS-TTS 结合在一起,并采用了新颖的两阶段流水线。在第一阶段,我们在口音识别(AID)方面达到了最先进的水平(SOTA),在未见过的说话者身上获得了 0.56 的 f1 分数。在第二阶段,我们以 AID 模型提取的预训练的与说话人无关的重音嵌入为 ZS-TTS 系统的条件。所提出的系统在固有口音/交叉口音生成方面实现了更高的口音保真度,并能生成未见过的口音。
{"title":"AccentBox: Towards High-Fidelity Zero-Shot Accent Generation","authors":"Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun","doi":"arxiv-2409.09098","DOIUrl":"https://doi.org/arxiv-2409.09098","url":null,"abstract":"While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high\u0000naturalness and speaker similarity, they fall short in accent fidelity and\u0000control. To address this issue, we propose zero-shot accent generation that\u0000unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel\u0000two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on\u0000Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the\u0000second stage, we condition ZS-TTS system on the pretrained speaker-agnostic\u0000accent embeddings extracted by the AID model. The proposed system achieves\u0000higher accent fidelity on inherent/cross accent generation, and enables unseen\u0000accent generation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou
We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: textit{controlled music generation} and textit{post-production editing}. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio. We encourage readers to listen to demo audio examples at https://team.doubao.com/seed-music .
{"title":"Seed-Music: A Unified Framework for High Quality and Controlled Music Generation","authors":"Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou","doi":"arxiv-2409.09214","DOIUrl":"https://doi.org/arxiv-2409.09214","url":null,"abstract":"We introduce Seed-Music, a suite of music generation systems capable of\u0000producing high-quality music with fine-grained style control. Our unified\u0000framework leverages both auto-regressive language modeling and diffusion\u0000approaches to support two key music creation workflows: textit{controlled\u0000music generation} and textit{post-production editing}. For controlled music\u0000generation, our system enables vocal music generation with performance controls\u0000from multi-modal inputs, including style descriptions, audio references,\u0000musical scores, and voice prompts. For post-production editing, it offers\u0000interactive tools for editing lyrics and vocal melodies directly in the\u0000generated audio. We encourage readers to listen to demo audio examples at\u0000https://team.doubao.com/seed-music .","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li
Music classification, with a wide range of applications, is one of the most prominent tasks in music information retrieval. To address the absence of comprehensive datasets and high-performing methods in the classification of mainstage dance music, this work introduces a novel benchmark comprising a new dataset and a baseline. Our dataset extends the number of sub-genres to cover most recent mainstage live sets by top DJs worldwide in music festivals. A continuous soft labeling approach is employed to account for tracks that span multiple sub-genres, preserving the inherent sophistication. For the baseline, we developed deep learning models that outperform current state-of-the-art multimodel language models, which struggle to identify house music sub-genres, emphasizing the need for specialized models trained on fine-grained datasets. Our benchmark is applicable to serve for application scenarios such as music recommendation, DJ set curation, and interactive multimedia, where we also provide video demos. Our code is on url{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.
音乐分类应用广泛,是音乐信息检索中最重要的任务之一。为了解决在舞台舞曲分类方面缺乏综合数据集和高性能方法的问题,这项研究引入了一个由新数据集和基线组成的新基准。我们的数据集扩展了子类型的数量,涵盖了全球顶级 DJ 最近在音乐节上的主舞台现场演出。我们采用了一种连续的软标记方法来考虑跨越多个子类型的曲目,同时保留了固有的复杂性。在基线方面,我们开发的深度学习模型优于目前最先进的多模型语言模型,这些模型在识别室内音乐子流派方面非常吃力,这强调了在细粒度数据集上训练专门模型的必要性。我们的代码在(url{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.
{"title":"Benchmarking Sub-Genre Classification For Mainstage Dance Music","authors":"Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li","doi":"arxiv-2409.06690","DOIUrl":"https://doi.org/arxiv-2409.06690","url":null,"abstract":"Music classification, with a wide range of applications, is one of the most\u0000prominent tasks in music information retrieval. To address the absence of\u0000comprehensive datasets and high-performing methods in the classification of\u0000mainstage dance music, this work introduces a novel benchmark comprising a new\u0000dataset and a baseline. Our dataset extends the number of sub-genres to cover\u0000most recent mainstage live sets by top DJs worldwide in music festivals. A\u0000continuous soft labeling approach is employed to account for tracks that span\u0000multiple sub-genres, preserving the inherent sophistication. For the baseline,\u0000we developed deep learning models that outperform current state-of-the-art\u0000multimodel language models, which struggle to identify house music sub-genres,\u0000emphasizing the need for specialized models trained on fine-grained datasets.\u0000Our benchmark is applicable to serve for application scenarios such as music\u0000recommendation, DJ set curation, and interactive multimedia, where we also\u0000provide video demos. Our code is on\u0000url{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki
Early detection of factory machinery malfunctions is crucial in industrial applications. In machine anomalous sound detection (ASD), different machines exhibit unique vibration-frequency ranges based on their physical properties. Meanwhile, the human auditory system is adept at tracking both temporal and spectral dynamics of machine sounds. Consequently, integrating the computational auditory models of the human auditory system with machine-specific properties can be an effective approach to machine ASD. We first quantified the frequency importances of four types of machines using the Fisher ratio (F-ratio). The quantified frequency importances were then used to design machine-specific non-uniform filterbanks (NUFBs), which extract the log non-uniform spectrum (LNS) feature. The designed NUFBs have a narrower bandwidth and higher filter distribution density in frequency regions with relatively high F-ratios. Finally, spectral and temporal modulation representations derived from the LNS feature were proposed. These proposed LNS feature and modulation representations are input into an autoencoder neural-network-based detector for ASD. The quantification results from the training set of the Malfunctioning Industrial Machine Investigation and Inspection dataset with a signal-to-noise (SNR) of 6 dB reveal that the distinguishing information between normal and anomalous sounds of different machines is encoded non-uniformly in the frequency domain. By highlighting these important frequency regions using NUFBs, the LNS feature can significantly enhance performance using the metric of AUC (area under the receiver operating characteristic curve) under various SNR conditions. Furthermore, modulation representations can further improve performance. Specifically, temporal modulation is effective for fans, pumps, and sliders, while spectral modulation is particularly effective for valves.
工厂机械故障的早期检测在工业应用中至关重要。在机器异常声音检测(ASD)中,不同的机器会根据其物理特性表现出独特的振动频率范围。因此,将人类听觉系统的计算听觉模型与机器的特定属性相结合,不失为一种有效的机器自动识别方法。我们首先使用菲舍尔比率(F-ratio)量化了四种机器的频率输入。然后,我们利用量化的频率导入值设计了针对特定机器的非均匀滤波器库(NUFB),从而提取出了对数均匀频谱(LNS)特征。所设计的 NUFB 在 F 比相对较高的频率区域具有更窄的带宽和更高的滤波器分布密度。最后,还提出了从 LNS 特征得出的频谱和时间调制表示法。这些提出的 LNS 特征和调制表示被输入到基于自动编码器神经网络的 ASD 检测器中。从信噪比(SNR)为 6 dB 的故障工业机器调查和检测数据集训练集的量化结果显示,不同机器的正常声音和异常声音之间的区分信息在频域中编码不均匀。通过使用 NUFBs 突出显示这些重要的频率区域,LNS 特征可以在各种信噪比条件下显著提高 AUC(接收器工作特性曲线下的面积)指标的性能。
{"title":"Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks","authors":"Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki","doi":"arxiv-2409.05319","DOIUrl":"https://doi.org/arxiv-2409.05319","url":null,"abstract":"Early detection of factory machinery malfunctions is crucial in industrial\u0000applications. In machine anomalous sound detection (ASD), different machines\u0000exhibit unique vibration-frequency ranges based on their physical properties.\u0000Meanwhile, the human auditory system is adept at tracking both temporal and\u0000spectral dynamics of machine sounds. Consequently, integrating the\u0000computational auditory models of the human auditory system with\u0000machine-specific properties can be an effective approach to machine ASD. We\u0000first quantified the frequency importances of four types of machines using the\u0000Fisher ratio (F-ratio). The quantified frequency importances were then used to\u0000design machine-specific non-uniform filterbanks (NUFBs), which extract the log\u0000non-uniform spectrum (LNS) feature. The designed NUFBs have a narrower\u0000bandwidth and higher filter distribution density in frequency regions with\u0000relatively high F-ratios. Finally, spectral and temporal modulation\u0000representations derived from the LNS feature were proposed. These proposed LNS\u0000feature and modulation representations are input into an autoencoder\u0000neural-network-based detector for ASD. The quantification results from the\u0000training set of the Malfunctioning Industrial Machine Investigation and\u0000Inspection dataset with a signal-to-noise (SNR) of 6 dB reveal that the\u0000distinguishing information between normal and anomalous sounds of different\u0000machines is encoded non-uniformly in the frequency domain. By highlighting\u0000these important frequency regions using NUFBs, the LNS feature can\u0000significantly enhance performance using the metric of AUC (area under the\u0000receiver operating characteristic curve) under various SNR conditions.\u0000Furthermore, modulation representations can further improve performance.\u0000Specifically, temporal modulation is effective for fans, pumps, and sliders,\u0000while spectral modulation is particularly effective for valves.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models (LLMs) are becoming very popular and are used for many different purposes, including creative tasks in the arts. However, these models sometimes have trouble with specific reasoning tasks, especially those that involve logical thinking and counting. This paper looks at how well LLMs understand and reason when dealing with musical tasks like figuring out notes from intervals and identifying chords and scales. We tested GPT-3.5 and GPT-4o to see how they handle these tasks. Our results show that while LLMs do well with note intervals, they struggle with more complicated tasks like recognizing chords and scales. This points out clear limits in current LLM abilities and shows where we need to make them better, which could help improve how they think and work in both artistic and other complex areas. We also provide an automatically generated benchmark data set for the described tasks.
{"title":"Harmonic Reasoning in Large Language Models","authors":"Anna Kruspe","doi":"arxiv-2409.05521","DOIUrl":"https://doi.org/arxiv-2409.05521","url":null,"abstract":"Large Language Models (LLMs) are becoming very popular and are used for many\u0000different purposes, including creative tasks in the arts. However, these models\u0000sometimes have trouble with specific reasoning tasks, especially those that\u0000involve logical thinking and counting. This paper looks at how well LLMs\u0000understand and reason when dealing with musical tasks like figuring out notes\u0000from intervals and identifying chords and scales. We tested GPT-3.5 and GPT-4o\u0000to see how they handle these tasks. Our results show that while LLMs do well\u0000with note intervals, they struggle with more complicated tasks like recognizing\u0000chords and scales. This points out clear limits in current LLM abilities and\u0000shows where we need to make them better, which could help improve how they\u0000think and work in both artistic and other complex areas. We also provide an\u0000automatically generated benchmark data set for the described tasks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso
Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly evolved in the last few years. Traditional architectures based on pipelines have been replaced by joint end-to-end (E2E) architectures that simplify and streamline the model training process. In addition, new AI training methods, such as weak-supervised learning have reduced the need for high-quality audio datasets for model training. However, despite all these advancements, little to no research has been done on real-time transcription. In real-time scenarios, the audio is not pre-recorded, and the input audio must be fragmented to be processed by the ASR systems. To achieve real-time requirements, these fragments must be as short as possible to reduce latency. However, audio cannot be split at any point as dividing an utterance into two separate fragments will generate an incorrect transcription. Also, shorter fragments provide less context for the ASR model. For this reason, it is necessary to design and test different splitting algorithms to optimize the quality and delay of the resulting transcription. In this paper, three audio splitting algorithms are evaluated with different ASR models to determine their impact on both the quality of the transcription and the end-to-end delay. The algorithms are fragmentation at fixed intervals, voice activity detection (VAD), and fragmentation with feedback. The results are compared to the performance of the same model, without audio fragmentation, to determine the effects of this division. The results show that VAD fragmentation provides the best quality with the highest delay, whereas fragmentation at fixed intervals provides the lowest quality and the lowest delay. The newly proposed feedback algorithm exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively, to the VAD splitting.
自动语音识别(ASR)或语音到文本(STT)技术在过去几年中得到了长足的发展。基于流水线的传统架构已被端到端 (E2E) 联合架构所取代,后者简化并简化了模型训练过程。此外,新的人工智能训练方法(如弱监督学习)降低了模型训练对高质量音频数据集的需求。然而,尽管取得了所有这些进步,有关实时转录的研究却少之又少。在实时场景中,音频不是预先录制的,输入的音频必须经过分片才能被 ASR 系统处理。为了达到实时要求,这些片段必须尽可能短,以减少延迟。但是,音频不能在任何时候被分割,因为将一个语句分割成两个独立的片段会产生错误的转录。此外,较短的片段为 ASR 模型提供的语境较少。因此,有必要设计和测试不同的分割算法,以优化转录结果的质量和延迟。本文使用不同的 ASR 模型对三种音频分割算法进行了评估,以确定它们对转录质量和端到端延迟的影响。这三种算法分别是固定间隔分割、语音活动检测(VAD)和带反馈的分割。将结果与不进行音频分片的相同模型的性能进行比较,以确定这种分片的效果。结果表明,VAD 分片提供了最好的质量和最高的延迟,而固定间隔分片则提供了最差的质量和最低的延迟。新提出的反馈算法分别以 WER 增加 2-4% 换取 VAD 分割延迟减少 1.5-2s 的效果。
{"title":"Evaluation of real-time transcriptions using end-to-end ASR models","authors":"Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso","doi":"arxiv-2409.05674","DOIUrl":"https://doi.org/arxiv-2409.05674","url":null,"abstract":"Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly\u0000evolved in the last few years. Traditional architectures based on pipelines\u0000have been replaced by joint end-to-end (E2E) architectures that simplify and\u0000streamline the model training process. In addition, new AI training methods,\u0000such as weak-supervised learning have reduced the need for high-quality audio\u0000datasets for model training. However, despite all these advancements, little to\u0000no research has been done on real-time transcription. In real-time scenarios,\u0000the audio is not pre-recorded, and the input audio must be fragmented to be\u0000processed by the ASR systems. To achieve real-time requirements, these\u0000fragments must be as short as possible to reduce latency. However, audio cannot\u0000be split at any point as dividing an utterance into two separate fragments will\u0000generate an incorrect transcription. Also, shorter fragments provide less\u0000context for the ASR model. For this reason, it is necessary to design and test\u0000different splitting algorithms to optimize the quality and delay of the\u0000resulting transcription. In this paper, three audio splitting algorithms are\u0000evaluated with different ASR models to determine their impact on both the\u0000quality of the transcription and the end-to-end delay. The algorithms are\u0000fragmentation at fixed intervals, voice activity detection (VAD), and\u0000fragmentation with feedback. The results are compared to the performance of the\u0000same model, without audio fragmentation, to determine the effects of this\u0000division. The results show that VAD fragmentation provides the best quality\u0000with the highest delay, whereas fragmentation at fixed intervals provides the\u0000lowest quality and the lowest delay. The newly proposed feedback algorithm\u0000exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,\u0000to the VAD splitting.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj
Speaker verification systems are crucial for authenticating identity through voice. Traditionally, these systems focus on comparing feature vectors, overlooking the speech's content. However, this paper challenges this by highlighting the importance of phonetic dominance, a measure of the frequency or duration of phonemes, as a crucial cue in speaker verification. A novel Phoneme Debiasing Attention Framework (PDAF) is introduced, integrating with existing attention frameworks to mitigate biases caused by phonetic dominance. PDAF adjusts the weighting for each phoneme and influences feature extraction, allowing for a more nuanced analysis of speech. This approach paves the way for more accurate and reliable identity authentication through voice. Furthermore, by employing various weighting strategies, we evaluate the influence of phonetic features on the efficacy of the speaker verification system.
{"title":"PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification","authors":"Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj","doi":"arxiv-2409.05799","DOIUrl":"https://doi.org/arxiv-2409.05799","url":null,"abstract":"Speaker verification systems are crucial for authenticating identity through\u0000voice. Traditionally, these systems focus on comparing feature vectors,\u0000overlooking the speech's content. However, this paper challenges this by\u0000highlighting the importance of phonetic dominance, a measure of the frequency\u0000or duration of phonemes, as a crucial cue in speaker verification. A novel\u0000Phoneme Debiasing Attention Framework (PDAF) is introduced, integrating with\u0000existing attention frameworks to mitigate biases caused by phonetic dominance.\u0000PDAF adjusts the weighting for each phoneme and influences feature extraction,\u0000allowing for a more nuanced analysis of speech. This approach paves the way for\u0000more accurate and reliable identity authentication through voice. Furthermore,\u0000by employing various weighting strategies, we evaluate the influence of\u0000phonetic features on the efficacy of the speaker verification system.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Papaleo, Xavier Lizarraga-Seijas, Frederic Font
Reverberation is a key element in spatial audio perception, historically achieved with the use of analogue devices, such as plate and spring reverb, and in the last decades with digital signal processing techniques that have allowed different approaches for Virtual Analogue Modelling (VAM). The electromechanical functioning of the spring reverb makes it a nonlinear system that is difficult to fully emulate in the digital domain with white-box modelling techniques. In this study, we compare five different neural network architectures, including convolutional and recurrent models, to assess their effectiveness in replicating the characteristics of this audio effect. The evaluation is conducted on two datasets at sampling rates of 16 kHz and 48 kHz. This paper specifically focuses on neural audio architectures that offer parametric control, aiming to advance the boundaries of current black-box modelling techniques in the domain of spring reverberation.
{"title":"Evaluating Neural Networks Architectures for Spring Reverb Modelling","authors":"Francesco Papaleo, Xavier Lizarraga-Seijas, Frederic Font","doi":"arxiv-2409.04953","DOIUrl":"https://doi.org/arxiv-2409.04953","url":null,"abstract":"Reverberation is a key element in spatial audio perception, historically\u0000achieved with the use of analogue devices, such as plate and spring reverb, and\u0000in the last decades with digital signal processing techniques that have allowed\u0000different approaches for Virtual Analogue Modelling (VAM). The\u0000electromechanical functioning of the spring reverb makes it a nonlinear system\u0000that is difficult to fully emulate in the digital domain with white-box\u0000modelling techniques. In this study, we compare five different neural network\u0000architectures, including convolutional and recurrent models, to assess their\u0000effectiveness in replicating the characteristics of this audio effect. The\u0000evaluation is conducted on two datasets at sampling rates of 16 kHz and 48 kHz.\u0000This paper specifically focuses on neural audio architectures that offer\u0000parametric control, aiming to advance the boundaries of current black-box\u0000modelling techniques in the domain of spring reverberation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The massive use of machine learning models, particularly neural networks, has raised serious concerns about their environmental impact. Indeed, over the last few years we have seen an explosion in the computing costs associated with training and deploying these systems. It is, therefore, crucial to understand their energy requirements in order to better integrate them into the evaluation of models, which has so far focused mainly on performance. In this paper, we study several neural network architectures that are key components of sound event detection systems, using an audio tagging task as an example. We measure the energy consumption for training and testing small to large architectures and establish complex relationships between the energy consumption, the number of floating-point operations, the number of parameters, and the GPU/memory utilization.
{"title":"From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems","authors":"Constance Douwes, Romain Serizel","doi":"arxiv-2409.05080","DOIUrl":"https://doi.org/arxiv-2409.05080","url":null,"abstract":"The massive use of machine learning models, particularly neural networks, has\u0000raised serious concerns about their environmental impact. Indeed, over the last\u0000few years we have seen an explosion in the computing costs associated with\u0000training and deploying these systems. It is, therefore, crucial to understand\u0000their energy requirements in order to better integrate them into the evaluation\u0000of models, which has so far focused mainly on performance. In this paper, we\u0000study several neural network architectures that are key components of sound\u0000event detection systems, using an audio tagging task as an example. We measure\u0000the energy consumption for training and testing small to large architectures\u0000and establish complex relationships between the energy consumption, the number\u0000of floating-point operations, the number of parameters, and the GPU/memory\u0000utilization.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Indonesian and Western gamelan ensembles are investigated with respect to performance differences. Thereby, the often exotistic history of this music in the West might be reflected in contemporary tonal system, articulation, or large-scale form differences. Analyzing recordings of four Western and five Indonesian orchestras with respect to tonal systems and timbre features and using self-organizing Kohonen map (SOM) as a machine learning algorithm, a clear clustering between Indonesian and Western ensembles appears using certain psychoacoustic features. These point to a reduced articulation and large-scale form variability of Western ensembles compared to Indonesian ones. The SOM also clusters the ensembles with respect to their tonal systems, but no clusters between Indonesian and Western ensembles can be found in this respect. Therefore, a clear analogy between lower articulatory variability and large-scale form variation and a more exostistic, mediative and calm performance expectation and reception of gamelan in the West therefore appears.
{"title":"Clustering of Indonesian and Western Gamelan Orchestras through Machine Learning of Performance Parameters","authors":"Simon Linke, Gerrit Wendt, Rolf Bader","doi":"arxiv-2409.03713","DOIUrl":"https://doi.org/arxiv-2409.03713","url":null,"abstract":"Indonesian and Western gamelan ensembles are investigated with respect to\u0000performance differences. Thereby, the often exotistic history of this music in\u0000the West might be reflected in contemporary tonal system, articulation, or\u0000large-scale form differences. Analyzing recordings of four Western and five\u0000Indonesian orchestras with respect to tonal systems and timbre features and\u0000using self-organizing Kohonen map (SOM) as a machine learning algorithm, a\u0000clear clustering between Indonesian and Western ensembles appears using certain\u0000psychoacoustic features. These point to a reduced articulation and large-scale\u0000form variability of Western ensembles compared to Indonesian ones. The SOM also\u0000clusters the ensembles with respect to their tonal systems, but no clusters\u0000between Indonesian and Western ensembles can be found in this respect.\u0000Therefore, a clear analogy between lower articulatory variability and\u0000large-scale form variation and a more exostistic, mediative and calm\u0000performance expectation and reception of gamelan in the West therefore appears.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}