首页 > 最新文献

arXiv - CS - Sound最新文献

英文 中文
AccentBox: Towards High-Fidelity Zero-Shot Accent Generation AccentBox:实现高保真零重音生成
Pub Date : 2024-09-13 DOI: arxiv-2409.09098
Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun
While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved highnaturalness and speaker similarity, they fall short in accent fidelity andcontrol. To address this issue, we propose zero-shot accent generation thatunifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a noveltwo-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) onAccent Identification (AID) with 0.56 f1 score on unseen speakers. In thesecond stage, we condition ZS-TTS system on the pretrained speaker-agnosticaccent embeddings extracted by the AID model. The proposed system achieveshigher accent fidelity on inherent/cross accent generation, and enables unseenaccent generation.
虽然最近的零镜头文本到语音(ZS-TTS)模型实现了较高的自然度和说话人相似度,但它们在口音保真度和控制方面存在不足。为了解决这个问题,我们提出了零镜头口音生成技术,它将外来口音转换(FAC)、带口音的 TTS 和 ZS-TTS 结合在一起,并采用了新颖的两阶段流水线。在第一阶段,我们在口音识别(AID)方面达到了最先进的水平(SOTA),在未见过的说话者身上获得了 0.56 的 f1 分数。在第二阶段,我们以 AID 模型提取的预训练的与说话人无关的重音嵌入为 ZS-TTS 系统的条件。所提出的系统在固有口音/交叉口音生成方面实现了更高的口音保真度,并能生成未见过的口音。
{"title":"AccentBox: Towards High-Fidelity Zero-Shot Accent Generation","authors":"Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun","doi":"arxiv-2409.09098","DOIUrl":"https://doi.org/arxiv-2409.09098","url":null,"abstract":"While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high\u0000naturalness and speaker similarity, they fall short in accent fidelity and\u0000control. To address this issue, we propose zero-shot accent generation that\u0000unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel\u0000two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on\u0000Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the\u0000second stage, we condition ZS-TTS system on the pretrained speaker-agnostic\u0000accent embeddings extracted by the AID model. The proposed system achieves\u0000higher accent fidelity on inherent/cross accent generation, and enables unseen\u0000accent generation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation 种子音乐:高质量可控音乐生成的统一框架
Pub Date : 2024-09-13 DOI: arxiv-2409.09214
Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou
We introduce Seed-Music, a suite of music generation systems capable ofproducing high-quality music with fine-grained style control. Our unifiedframework leverages both auto-regressive language modeling and diffusionapproaches to support two key music creation workflows: textit{controlledmusic generation} and textit{post-production editing}. For controlled musicgeneration, our system enables vocal music generation with performance controlsfrom multi-modal inputs, including style descriptions, audio references,musical scores, and voice prompts. For post-production editing, it offersinteractive tools for editing lyrics and vocal melodies directly in thegenerated audio. We encourage readers to listen to demo audio examples athttps://team.doubao.com/seed-music .
我们介绍的 Seed-Music 是一套音乐生成系统,能够生成具有细粒度风格控制的高质量音乐。我们的统一框架利用自动回归语言建模和扩散方法来支持两个关键的音乐创作工作流:textit{受控音乐生成}和textit{后期编辑}。在受控音乐生成方面,我们的系统可根据多模式输入(包括风格描述、音频参考、乐谱和语音提示)进行表演控制,从而生成声乐。在后期制作编辑方面,它提供了交互式工具,可直接在生成的音频中编辑歌词和声乐旋律。我们鼓励读者收听演示音频示例,网址是:https://team.doubao.com/seed-music 。
{"title":"Seed-Music: A Unified Framework for High Quality and Controlled Music Generation","authors":"Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou","doi":"arxiv-2409.09214","DOIUrl":"https://doi.org/arxiv-2409.09214","url":null,"abstract":"We introduce Seed-Music, a suite of music generation systems capable of\u0000producing high-quality music with fine-grained style control. Our unified\u0000framework leverages both auto-regressive language modeling and diffusion\u0000approaches to support two key music creation workflows: textit{controlled\u0000music generation} and textit{post-production editing}. For controlled music\u0000generation, our system enables vocal music generation with performance controls\u0000from multi-modal inputs, including style descriptions, audio references,\u0000musical scores, and voice prompts. For post-production editing, it offers\u0000interactive tools for editing lyrics and vocal melodies directly in the\u0000generated audio. We encourage readers to listen to demo audio examples at\u0000https://team.doubao.com/seed-music .","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking Sub-Genre Classification For Mainstage Dance Music 主舞台舞曲子流派分类基准测试
Pub Date : 2024-09-10 DOI: arxiv-2409.06690
Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li
Music classification, with a wide range of applications, is one of the mostprominent tasks in music information retrieval. To address the absence ofcomprehensive datasets and high-performing methods in the classification ofmainstage dance music, this work introduces a novel benchmark comprising a newdataset and a baseline. Our dataset extends the number of sub-genres to covermost recent mainstage live sets by top DJs worldwide in music festivals. Acontinuous soft labeling approach is employed to account for tracks that spanmultiple sub-genres, preserving the inherent sophistication. For the baseline,we developed deep learning models that outperform current state-of-the-artmultimodel language models, which struggle to identify house music sub-genres,emphasizing the need for specialized models trained on fine-grained datasets.Our benchmark is applicable to serve for application scenarios such as musicrecommendation, DJ set curation, and interactive multimedia, where we alsoprovide video demos. Our code is onurl{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.
音乐分类应用广泛,是音乐信息检索中最重要的任务之一。为了解决在舞台舞曲分类方面缺乏综合数据集和高性能方法的问题,这项研究引入了一个由新数据集和基线组成的新基准。我们的数据集扩展了子类型的数量,涵盖了全球顶级 DJ 最近在音乐节上的主舞台现场演出。我们采用了一种连续的软标记方法来考虑跨越多个子类型的曲目,同时保留了固有的复杂性。在基线方面,我们开发的深度学习模型优于目前最先进的多模型语言模型,这些模型在识别室内音乐子流派方面非常吃力,这强调了在细粒度数据集上训练专门模型的必要性。我们的代码在(url{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.
{"title":"Benchmarking Sub-Genre Classification For Mainstage Dance Music","authors":"Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li","doi":"arxiv-2409.06690","DOIUrl":"https://doi.org/arxiv-2409.06690","url":null,"abstract":"Music classification, with a wide range of applications, is one of the most\u0000prominent tasks in music information retrieval. To address the absence of\u0000comprehensive datasets and high-performing methods in the classification of\u0000mainstage dance music, this work introduces a novel benchmark comprising a new\u0000dataset and a baseline. Our dataset extends the number of sub-genres to cover\u0000most recent mainstage live sets by top DJs worldwide in music festivals. A\u0000continuous soft labeling approach is employed to account for tracks that span\u0000multiple sub-genres, preserving the inherent sophistication. For the baseline,\u0000we developed deep learning models that outperform current state-of-the-art\u0000multimodel language models, which struggle to identify house music sub-genres,\u0000emphasizing the need for specialized models trained on fine-grained datasets.\u0000Our benchmark is applicable to serve for application scenarios such as music\u0000recommendation, DJ set curation, and interactive multimedia, where we also\u0000provide video demos. Our code is on\u0000url{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks 利用从机器特定滤波器库中提取的频谱-时态调制表示法进行机器异常声音检测
Pub Date : 2024-09-09 DOI: arxiv-2409.05319
Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki
Early detection of factory machinery malfunctions is crucial in industrialapplications. In machine anomalous sound detection (ASD), different machinesexhibit unique vibration-frequency ranges based on their physical properties.Meanwhile, the human auditory system is adept at tracking both temporal andspectral dynamics of machine sounds. Consequently, integrating thecomputational auditory models of the human auditory system withmachine-specific properties can be an effective approach to machine ASD. Wefirst quantified the frequency importances of four types of machines using theFisher ratio (F-ratio). The quantified frequency importances were then used todesign machine-specific non-uniform filterbanks (NUFBs), which extract the lognon-uniform spectrum (LNS) feature. The designed NUFBs have a narrowerbandwidth and higher filter distribution density in frequency regions withrelatively high F-ratios. Finally, spectral and temporal modulationrepresentations derived from the LNS feature were proposed. These proposed LNSfeature and modulation representations are input into an autoencoderneural-network-based detector for ASD. The quantification results from thetraining set of the Malfunctioning Industrial Machine Investigation andInspection dataset with a signal-to-noise (SNR) of 6 dB reveal that thedistinguishing information between normal and anomalous sounds of differentmachines is encoded non-uniformly in the frequency domain. By highlightingthese important frequency regions using NUFBs, the LNS feature cansignificantly enhance performance using the metric of AUC (area under thereceiver operating characteristic curve) under various SNR conditions.Furthermore, modulation representations can further improve performance.Specifically, temporal modulation is effective for fans, pumps, and sliders,while spectral modulation is particularly effective for valves.
工厂机械故障的早期检测在工业应用中至关重要。在机器异常声音检测(ASD)中,不同的机器会根据其物理特性表现出独特的振动频率范围。因此,将人类听觉系统的计算听觉模型与机器的特定属性相结合,不失为一种有效的机器自动识别方法。我们首先使用菲舍尔比率(F-ratio)量化了四种机器的频率输入。然后,我们利用量化的频率导入值设计了针对特定机器的非均匀滤波器库(NUFB),从而提取出了对数均匀频谱(LNS)特征。所设计的 NUFB 在 F 比相对较高的频率区域具有更窄的带宽和更高的滤波器分布密度。最后,还提出了从 LNS 特征得出的频谱和时间调制表示法。这些提出的 LNS 特征和调制表示被输入到基于自动编码器神经网络的 ASD 检测器中。从信噪比(SNR)为 6 dB 的故障工业机器调查和检测数据集训练集的量化结果显示,不同机器的正常声音和异常声音之间的区分信息在频域中编码不均匀。通过使用 NUFBs 突出显示这些重要的频率区域,LNS 特征可以在各种信噪比条件下显著提高 AUC(接收器工作特性曲线下的面积)指标的性能。
{"title":"Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks","authors":"Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki","doi":"arxiv-2409.05319","DOIUrl":"https://doi.org/arxiv-2409.05319","url":null,"abstract":"Early detection of factory machinery malfunctions is crucial in industrial\u0000applications. In machine anomalous sound detection (ASD), different machines\u0000exhibit unique vibration-frequency ranges based on their physical properties.\u0000Meanwhile, the human auditory system is adept at tracking both temporal and\u0000spectral dynamics of machine sounds. Consequently, integrating the\u0000computational auditory models of the human auditory system with\u0000machine-specific properties can be an effective approach to machine ASD. We\u0000first quantified the frequency importances of four types of machines using the\u0000Fisher ratio (F-ratio). The quantified frequency importances were then used to\u0000design machine-specific non-uniform filterbanks (NUFBs), which extract the log\u0000non-uniform spectrum (LNS) feature. The designed NUFBs have a narrower\u0000bandwidth and higher filter distribution density in frequency regions with\u0000relatively high F-ratios. Finally, spectral and temporal modulation\u0000representations derived from the LNS feature were proposed. These proposed LNS\u0000feature and modulation representations are input into an autoencoder\u0000neural-network-based detector for ASD. The quantification results from the\u0000training set of the Malfunctioning Industrial Machine Investigation and\u0000Inspection dataset with a signal-to-noise (SNR) of 6 dB reveal that the\u0000distinguishing information between normal and anomalous sounds of different\u0000machines is encoded non-uniformly in the frequency domain. By highlighting\u0000these important frequency regions using NUFBs, the LNS feature can\u0000significantly enhance performance using the metric of AUC (area under the\u0000receiver operating characteristic curve) under various SNR conditions.\u0000Furthermore, modulation representations can further improve performance.\u0000Specifically, temporal modulation is effective for fans, pumps, and sliders,\u0000while spectral modulation is particularly effective for valves.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Harmonic Reasoning in Large Language Models 大型语言模型中的谐波推理
Pub Date : 2024-09-09 DOI: arxiv-2409.05521
Anna Kruspe
Large Language Models (LLMs) are becoming very popular and are used for manydifferent purposes, including creative tasks in the arts. However, these modelssometimes have trouble with specific reasoning tasks, especially those thatinvolve logical thinking and counting. This paper looks at how well LLMsunderstand and reason when dealing with musical tasks like figuring out notesfrom intervals and identifying chords and scales. We tested GPT-3.5 and GPT-4oto see how they handle these tasks. Our results show that while LLMs do wellwith note intervals, they struggle with more complicated tasks like recognizingchords and scales. This points out clear limits in current LLM abilities andshows where we need to make them better, which could help improve how theythink and work in both artistic and other complex areas. We also provide anautomatically generated benchmark data set for the described tasks.
大型语言模型(LLM)正变得非常流行,并被用于许多不同的目的,包括艺术领域的创造性任务。然而,这些模型在处理特定的推理任务时,尤其是涉及逻辑思维和计算的任务时,有时会遇到困难。本文研究了 LLM 在处理音乐任务(如从音程中找出音符、识别和弦和音阶)时的理解和推理能力。我们测试了 GPT-3.5 和 GPT-4,看看它们是如何处理这些任务的。结果表明,尽管 LLM 在音符音程方面表现出色,但在识别和弦和音阶等更复杂的任务上却举步维艰。这指出了当前 LLM 能力的明显局限,并显示了我们需要在哪些方面改进它们,这将有助于改善它们在艺术和其他复杂领域的思维和工作方式。我们还为所述任务提供了自动生成的基准数据集。
{"title":"Harmonic Reasoning in Large Language Models","authors":"Anna Kruspe","doi":"arxiv-2409.05521","DOIUrl":"https://doi.org/arxiv-2409.05521","url":null,"abstract":"Large Language Models (LLMs) are becoming very popular and are used for many\u0000different purposes, including creative tasks in the arts. However, these models\u0000sometimes have trouble with specific reasoning tasks, especially those that\u0000involve logical thinking and counting. This paper looks at how well LLMs\u0000understand and reason when dealing with musical tasks like figuring out notes\u0000from intervals and identifying chords and scales. We tested GPT-3.5 and GPT-4o\u0000to see how they handle these tasks. Our results show that while LLMs do well\u0000with note intervals, they struggle with more complicated tasks like recognizing\u0000chords and scales. This points out clear limits in current LLM abilities and\u0000shows where we need to make them better, which could help improve how they\u0000think and work in both artistic and other complex areas. We also provide an\u0000automatically generated benchmark data set for the described tasks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation of real-time transcriptions using end-to-end ASR models 使用端到端 ASR 模型对实时转录进行评估
Pub Date : 2024-09-09 DOI: arxiv-2409.05674
Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso
Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatlyevolved in the last few years. Traditional architectures based on pipelineshave been replaced by joint end-to-end (E2E) architectures that simplify andstreamline the model training process. In addition, new AI training methods,such as weak-supervised learning have reduced the need for high-quality audiodatasets for model training. However, despite all these advancements, little tono research has been done on real-time transcription. In real-time scenarios,the audio is not pre-recorded, and the input audio must be fragmented to beprocessed by the ASR systems. To achieve real-time requirements, thesefragments must be as short as possible to reduce latency. However, audio cannotbe split at any point as dividing an utterance into two separate fragments willgenerate an incorrect transcription. Also, shorter fragments provide lesscontext for the ASR model. For this reason, it is necessary to design and testdifferent splitting algorithms to optimize the quality and delay of theresulting transcription. In this paper, three audio splitting algorithms areevaluated with different ASR models to determine their impact on both thequality of the transcription and the end-to-end delay. The algorithms arefragmentation at fixed intervals, voice activity detection (VAD), andfragmentation with feedback. The results are compared to the performance of thesame model, without audio fragmentation, to determine the effects of thisdivision. The results show that VAD fragmentation provides the best qualitywith the highest delay, whereas fragmentation at fixed intervals provides thelowest quality and the lowest delay. The newly proposed feedback algorithmexchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,to the VAD splitting.
自动语音识别(ASR)或语音到文本(STT)技术在过去几年中得到了长足的发展。基于流水线的传统架构已被端到端 (E2E) 联合架构所取代,后者简化并简化了模型训练过程。此外,新的人工智能训练方法(如弱监督学习)降低了模型训练对高质量音频数据集的需求。然而,尽管取得了所有这些进步,有关实时转录的研究却少之又少。在实时场景中,音频不是预先录制的,输入的音频必须经过分片才能被 ASR 系统处理。为了达到实时要求,这些片段必须尽可能短,以减少延迟。但是,音频不能在任何时候被分割,因为将一个语句分割成两个独立的片段会产生错误的转录。此外,较短的片段为 ASR 模型提供的语境较少。因此,有必要设计和测试不同的分割算法,以优化转录结果的质量和延迟。本文使用不同的 ASR 模型对三种音频分割算法进行了评估,以确定它们对转录质量和端到端延迟的影响。这三种算法分别是固定间隔分割、语音活动检测(VAD)和带反馈的分割。将结果与不进行音频分片的相同模型的性能进行比较,以确定这种分片的效果。结果表明,VAD 分片提供了最好的质量和最高的延迟,而固定间隔分片则提供了最差的质量和最低的延迟。新提出的反馈算法分别以 WER 增加 2-4% 换取 VAD 分割延迟减少 1.5-2s 的效果。
{"title":"Evaluation of real-time transcriptions using end-to-end ASR models","authors":"Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso","doi":"arxiv-2409.05674","DOIUrl":"https://doi.org/arxiv-2409.05674","url":null,"abstract":"Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly\u0000evolved in the last few years. Traditional architectures based on pipelines\u0000have been replaced by joint end-to-end (E2E) architectures that simplify and\u0000streamline the model training process. In addition, new AI training methods,\u0000such as weak-supervised learning have reduced the need for high-quality audio\u0000datasets for model training. However, despite all these advancements, little to\u0000no research has been done on real-time transcription. In real-time scenarios,\u0000the audio is not pre-recorded, and the input audio must be fragmented to be\u0000processed by the ASR systems. To achieve real-time requirements, these\u0000fragments must be as short as possible to reduce latency. However, audio cannot\u0000be split at any point as dividing an utterance into two separate fragments will\u0000generate an incorrect transcription. Also, shorter fragments provide less\u0000context for the ASR model. For this reason, it is necessary to design and test\u0000different splitting algorithms to optimize the quality and delay of the\u0000resulting transcription. In this paper, three audio splitting algorithms are\u0000evaluated with different ASR models to determine their impact on both the\u0000quality of the transcription and the end-to-end delay. The algorithms are\u0000fragmentation at fixed intervals, voice activity detection (VAD), and\u0000fragmentation with feedback. The results are compared to the performance of the\u0000same model, without audio fragmentation, to determine the effects of this\u0000division. The results show that VAD fragmentation provides the best quality\u0000with the highest delay, whereas fragmentation at fixed intervals provides the\u0000lowest quality and the lowest delay. The newly proposed feedback algorithm\u0000exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,\u0000to the VAD splitting.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification PDAF:用于验证说话人的语音去重注意框架
Pub Date : 2024-09-09 DOI: arxiv-2409.05799
Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj
Speaker verification systems are crucial for authenticating identity throughvoice. Traditionally, these systems focus on comparing feature vectors,overlooking the speech's content. However, this paper challenges this byhighlighting the importance of phonetic dominance, a measure of the frequencyor duration of phonemes, as a crucial cue in speaker verification. A novelPhoneme Debiasing Attention Framework (PDAF) is introduced, integrating withexisting attention frameworks to mitigate biases caused by phonetic dominance.PDAF adjusts the weighting for each phoneme and influences feature extraction,allowing for a more nuanced analysis of speech. This approach paves the way formore accurate and reliable identity authentication through voice. Furthermore,by employing various weighting strategies, we evaluate the influence ofphonetic features on the efficacy of the speaker verification system.
说话人验证系统对于通过语音验证身份至关重要。传统上,这些系统侧重于比较特征向量,而忽略了语音内容。然而,本文通过强调音素优势(音素频率或音素持续时间的度量)在说话人验证中作为关键线索的重要性,对这一观点提出了挑战。本文介绍了一种新颖的音素去重注意框架(PDAF),它与现有的注意框架相结合,减轻了音素优势造成的偏差。这种方法为通过语音进行更准确、更可靠的身份验证铺平了道路。此外,通过采用不同的加权策略,我们评估了语音特征对说话人验证系统功效的影响。
{"title":"PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification","authors":"Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj","doi":"arxiv-2409.05799","DOIUrl":"https://doi.org/arxiv-2409.05799","url":null,"abstract":"Speaker verification systems are crucial for authenticating identity through\u0000voice. Traditionally, these systems focus on comparing feature vectors,\u0000overlooking the speech's content. However, this paper challenges this by\u0000highlighting the importance of phonetic dominance, a measure of the frequency\u0000or duration of phonemes, as a crucial cue in speaker verification. A novel\u0000Phoneme Debiasing Attention Framework (PDAF) is introduced, integrating with\u0000existing attention frameworks to mitigate biases caused by phonetic dominance.\u0000PDAF adjusts the weighting for each phoneme and influences feature extraction,\u0000allowing for a more nuanced analysis of speech. This approach paves the way for\u0000more accurate and reliable identity authentication through voice. Furthermore,\u0000by employing various weighting strategies, we evaluate the influence of\u0000phonetic features on the efficacy of the speaker verification system.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Neural Networks Architectures for Spring Reverb Modelling 评估用于弹簧混响建模的神经网络架构
Pub Date : 2024-09-08 DOI: arxiv-2409.04953
Francesco Papaleo, Xavier Lizarraga-Seijas, Frederic Font
Reverberation is a key element in spatial audio perception, historicallyachieved with the use of analogue devices, such as plate and spring reverb, andin the last decades with digital signal processing techniques that have alloweddifferent approaches for Virtual Analogue Modelling (VAM). Theelectromechanical functioning of the spring reverb makes it a nonlinear systemthat is difficult to fully emulate in the digital domain with white-boxmodelling techniques. In this study, we compare five different neural networkarchitectures, including convolutional and recurrent models, to assess theireffectiveness in replicating the characteristics of this audio effect. Theevaluation is conducted on two datasets at sampling rates of 16 kHz and 48 kHz.This paper specifically focuses on neural audio architectures that offerparametric control, aiming to advance the boundaries of current black-boxmodelling techniques in the domain of spring reverberation.
混响是空间音频感知中的一个关键因素,历史上曾通过使用板式混响和弹簧混响等模拟设备来实现,而在过去的几十年中,数字信号处理技术则为虚拟模拟建模(VAM)提供了不同的方法。弹簧混响的机电功能使其成为一个非线性系统,很难通过白盒建模技术在数字领域完全模拟。在这项研究中,我们比较了五种不同的神经网络架构,包括卷积模型和递归模型,以评估它们在复制这种音频效果特性方面的有效性。本文特别关注提供参数控制的神经音频架构,旨在推动当前黑盒建模技术在弹簧混响领域的发展。
{"title":"Evaluating Neural Networks Architectures for Spring Reverb Modelling","authors":"Francesco Papaleo, Xavier Lizarraga-Seijas, Frederic Font","doi":"arxiv-2409.04953","DOIUrl":"https://doi.org/arxiv-2409.04953","url":null,"abstract":"Reverberation is a key element in spatial audio perception, historically\u0000achieved with the use of analogue devices, such as plate and spring reverb, and\u0000in the last decades with digital signal processing techniques that have allowed\u0000different approaches for Virtual Analogue Modelling (VAM). The\u0000electromechanical functioning of the spring reverb makes it a nonlinear system\u0000that is difficult to fully emulate in the digital domain with white-box\u0000modelling techniques. In this study, we compare five different neural network\u0000architectures, including convolutional and recurrent models, to assess their\u0000effectiveness in replicating the characteristics of this audio effect. The\u0000evaluation is conducted on two datasets at sampling rates of 16 kHz and 48 kHz.\u0000This paper specifically focuses on neural audio architectures that offer\u0000parametric control, aiming to advance the boundaries of current black-box\u0000modelling techniques in the domain of spring reverberation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems 从计算到消耗:探索用于 SED 系统的神经网络训练和测试的计算-能源联系
Pub Date : 2024-09-08 DOI: arxiv-2409.05080
Constance Douwes, Romain Serizel
The massive use of machine learning models, particularly neural networks, hasraised serious concerns about their environmental impact. Indeed, over the lastfew years we have seen an explosion in the computing costs associated withtraining and deploying these systems. It is, therefore, crucial to understandtheir energy requirements in order to better integrate them into the evaluationof models, which has so far focused mainly on performance. In this paper, westudy several neural network architectures that are key components of soundevent detection systems, using an audio tagging task as an example. We measurethe energy consumption for training and testing small to large architecturesand establish complex relationships between the energy consumption, the numberof floating-point operations, the number of parameters, and the GPU/memoryutilization.
机器学习模型,尤其是神经网络的大量使用,引起了人们对其环境影响的严重关注。事实上,在过去几年中,我们看到与训练和部署这些系统相关的计算成本激增。因此,了解这些系统的能源需求至关重要,以便更好地将其纳入模型评估,而迄今为止,模型评估主要侧重于性能。本文以音频标记任务为例,研究了作为声音事件检测系统关键组成部分的几种神经网络架构。我们测量了从小型到大型架构的训练和测试能耗,并在能耗、浮点运算次数、参数数量和 GPU/内存利用率之间建立了复杂的关系。
{"title":"From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems","authors":"Constance Douwes, Romain Serizel","doi":"arxiv-2409.05080","DOIUrl":"https://doi.org/arxiv-2409.05080","url":null,"abstract":"The massive use of machine learning models, particularly neural networks, has\u0000raised serious concerns about their environmental impact. Indeed, over the last\u0000few years we have seen an explosion in the computing costs associated with\u0000training and deploying these systems. It is, therefore, crucial to understand\u0000their energy requirements in order to better integrate them into the evaluation\u0000of models, which has so far focused mainly on performance. In this paper, we\u0000study several neural network architectures that are key components of sound\u0000event detection systems, using an audio tagging task as an example. We measure\u0000the energy consumption for training and testing small to large architectures\u0000and establish complex relationships between the energy consumption, the number\u0000of floating-point operations, the number of parameters, and the GPU/memory\u0000utilization.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering of Indonesian and Western Gamelan Orchestras through Machine Learning of Performance Parameters 通过对演奏参数的机器学习对印度尼西亚和西方加麦兰管弦乐队进行分类
Pub Date : 2024-09-03 DOI: arxiv-2409.03713
Simon Linke, Gerrit Wendt, Rolf Bader
Indonesian and Western gamelan ensembles are investigated with respect toperformance differences. Thereby, the often exotistic history of this music inthe West might be reflected in contemporary tonal system, articulation, orlarge-scale form differences. Analyzing recordings of four Western and fiveIndonesian orchestras with respect to tonal systems and timbre features andusing self-organizing Kohonen map (SOM) as a machine learning algorithm, aclear clustering between Indonesian and Western ensembles appears using certainpsychoacoustic features. These point to a reduced articulation and large-scaleform variability of Western ensembles compared to Indonesian ones. The SOM alsoclusters the ensembles with respect to their tonal systems, but no clustersbetween Indonesian and Western ensembles can be found in this respect.Therefore, a clear analogy between lower articulatory variability andlarge-scale form variation and a more exostistic, mediative and calmperformance expectation and reception of gamelan in the West therefore appears.
研究印尼和西方的加麦兰乐团在演奏方面的差异。因此,这种音乐在西方的外来历史可能会反映在当代的音调系统、发音或大尺度的形式差异上。通过分析四支西方管弦乐队和五支印尼管弦乐队的录音的音调系统和音色特征,并使用自组织 Kohonen 地图(SOM)作为机器学习算法,印尼和西方乐团之间通过某些回声特征出现了明显的聚类。这些特征表明,与印尼语合奏相比,西方合奏的发音和大尺度形式变异性较低。因此,较低的发音变异性和大尺度的形式变异性与西方对加麦兰更多的外向型、调解型和平静型的表演期待和接受之间出现了明显的类比。
{"title":"Clustering of Indonesian and Western Gamelan Orchestras through Machine Learning of Performance Parameters","authors":"Simon Linke, Gerrit Wendt, Rolf Bader","doi":"arxiv-2409.03713","DOIUrl":"https://doi.org/arxiv-2409.03713","url":null,"abstract":"Indonesian and Western gamelan ensembles are investigated with respect to\u0000performance differences. Thereby, the often exotistic history of this music in\u0000the West might be reflected in contemporary tonal system, articulation, or\u0000large-scale form differences. Analyzing recordings of four Western and five\u0000Indonesian orchestras with respect to tonal systems and timbre features and\u0000using self-organizing Kohonen map (SOM) as a machine learning algorithm, a\u0000clear clustering between Indonesian and Western ensembles appears using certain\u0000psychoacoustic features. These point to a reduced articulation and large-scale\u0000form variability of Western ensembles compared to Indonesian ones. The SOM also\u0000clusters the ensembles with respect to their tonal systems, but no clusters\u0000between Indonesian and Western ensembles can be found in this respect.\u0000Therefore, a clear analogy between lower articulatory variability and\u0000large-scale form variation and a more exostistic, mediative and calm\u0000performance expectation and reception of gamelan in the West therefore appears.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Sound
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1