首页 > 最新文献

arXiv - CS - Sound最新文献

英文 中文
Sample-Efficient Diffusion for Text-To-Speech Synthesis 文本到语音合成的样本高效扩散
Pub Date : 2024-09-01 DOI: arxiv-2409.03717
Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu
This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithmfor effective speech synthesis in modest data regimes through latent diffusion.It is based on a novel diffusion architecture, that we call U-Audio Transformer(U-AT), that efficiently scales to long sequences and operates in the latentspace of a pre-trained audio autoencoder. Conditioned on character-awarelanguage model representations, SESD achieves impressive results despitetraining on less than 1k hours of speech - far less than currentstate-of-the-art systems. In fact, it synthesizes more intelligible speech thanthe state-of-the-art auto-regressive model, VALL-E, while using less than 2%the training data.
这项工作介绍了样本高效语音扩散(SESD),这是一种通过潜在扩散在适度数据环境中进行有效语音合成的算法。它基于一种新颖的扩散架构,我们称之为 U-Audio Transformer(U-AT),它能有效地扩展到长序列,并在预训练音频自动编码器的潜在空间中运行。SESD 以字符感知语言模型表示为条件,在不到 1K 小时的语音训练中取得了令人印象深刻的成果,远远低于目前最先进的系统。事实上,它合成的语音比最先进的自动回归模型 VALL-E 更清晰,而使用的训练数据却不到 2%。
{"title":"Sample-Efficient Diffusion for Text-To-Speech Synthesis","authors":"Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu","doi":"arxiv-2409.03717","DOIUrl":"https://doi.org/arxiv-2409.03717","url":null,"abstract":"This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm\u0000for effective speech synthesis in modest data regimes through latent diffusion.\u0000It is based on a novel diffusion architecture, that we call U-Audio Transformer\u0000(U-AT), that efficiently scales to long sequences and operates in the latent\u0000space of a pre-trained audio autoencoder. Conditioned on character-aware\u0000language model representations, SESD achieves impressive results despite\u0000training on less than 1k hours of speech - far less than current\u0000state-of-the-art systems. In fact, it synthesizes more intelligible speech than\u0000the state-of-the-art auto-regressive model, VALL-E, while using less than 2%\u0000the training data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation 光学音乐识别中的知识发现:利用实例分割加强信息检索
Pub Date : 2024-08-27 DOI: arxiv-2408.15002
Elona Shatri, George Fazekas
Optical Music Recognition (OMR) automates the transcription of musicalnotation from images into machine-readable formats like MusicXML, MEI, or MIDI,significantly reducing the costs and time of manual transcription. This studyexplores knowledge discovery in OMR by applying instance segmentation usingMask R-CNN to enhance the detection and delineation of musical symbols in sheetmusic. Unlike Optical Character Recognition (OCR), OMR must handle theintricate semantics of Common Western Music Notation (CWMN), where symbolmeanings depend on shape, position, and context. Our approach leveragesinstance segmentation to manage the density and overlap of musical symbols,facilitating more precise information retrieval from music scores. Evaluationson the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, withour method achieving a mean Average Precision (mAP) of up to 59.70% in densesymbol environments, achieving comparable results to object detection.Furthermore, using traditional computer vision techniques, we add a parallelstep for staff detection to infer the pitch for the recognised symbols. Thisstudy emphasises the role of pixel-wise segmentation in advancing accuratemusic symbol recognition, contributing to knowledge discovery in OMR. Ourfindings indicate that instance segmentation provides more preciserepresentations of musical symbols, particularly in densely populated scores,advancing OMR technology. We make our implementation, pre-processing scripts,trained models, and evaluation results publicly available to support furtherresearch and development.
光学音乐识别(OMR)可将音乐符号从图像自动转录为机器可读的格式,如 MusicXML、MEI 或 MIDI,大大减少了人工转录的成本和时间。本研究通过使用掩码 R-CNN 进行实例分割来增强乐谱中音乐符号的检测和划分,从而探索 OMR 中的知识发现。与光学字符识别(OCR)不同,OMR 必须处理通用西方音乐符号(CWMN)的复杂语义,其中符号的含义取决于形状、位置和上下文。我们的方法利用实例分割来管理音乐符号的密度和重叠,从而促进从乐谱中进行更精确的信息检索。在 DoReMi 和 MUSCIMA++ 数据集上进行的评估表明,我们的方法有了实质性的改进,在符号密集的环境中,平均精确度(mAP)高达 59.70%,达到了与物体检测相当的结果。这项研究强调了像素分割在提高音乐符号识别准确性方面的作用,有助于发现 OMR 中的知识。我们的研究结果表明,实例分割可提供更精确的音乐符号表示,尤其是在音乐符号密集的乐谱中,从而推动了 OMR 技术的发展。我们公开了我们的实现、预处理脚本、训练模型和评估结果,以支持进一步的研究和开发。
{"title":"Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation","authors":"Elona Shatri, George Fazekas","doi":"arxiv-2408.15002","DOIUrl":"https://doi.org/arxiv-2408.15002","url":null,"abstract":"Optical Music Recognition (OMR) automates the transcription of musical\u0000notation from images into machine-readable formats like MusicXML, MEI, or MIDI,\u0000significantly reducing the costs and time of manual transcription. This study\u0000explores knowledge discovery in OMR by applying instance segmentation using\u0000Mask R-CNN to enhance the detection and delineation of musical symbols in sheet\u0000music. Unlike Optical Character Recognition (OCR), OMR must handle the\u0000intricate semantics of Common Western Music Notation (CWMN), where symbol\u0000meanings depend on shape, position, and context. Our approach leverages\u0000instance segmentation to manage the density and overlap of musical symbols,\u0000facilitating more precise information retrieval from music scores. Evaluations\u0000on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with\u0000our method achieving a mean Average Precision (mAP) of up to 59.70% in dense\u0000symbol environments, achieving comparable results to object detection.\u0000Furthermore, using traditional computer vision techniques, we add a parallel\u0000step for staff detection to infer the pitch for the recognised symbols. This\u0000study emphasises the role of pixel-wise segmentation in advancing accurate\u0000music symbol recognition, contributing to knowledge discovery in OMR. Our\u0000findings indicate that instance segmentation provides more precise\u0000representations of musical symbols, particularly in densely populated scores,\u0000advancing OMR technology. We make our implementation, pre-processing scripts,\u0000trained models, and evaluation results publicly available to support further\u0000research and development.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition 通过对齐和标签匹配加强模态融合,实现多模态情感识别
Pub Date : 2024-08-18 DOI: arxiv-2408.09438
Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li
To address the limitation in multimodal emotion recognition (MER) performancearising from inter-modal information fusion, we propose a novel MER frameworkbased on multitask learning where fusion occurs after alignment, calledFoal-Net. The framework is designed to enhance the effectiveness of modalityfusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL)and cross-modal emotion label matching (MEM). First, AVEL achieves alignment ofemotional information in audio-video representations through contrastivelearning. Then, a modal fusion network integrates the aligned features.Meanwhile, MEM assesses whether the emotions of the current sample pair are thesame, providing assistance for modal information fusion and guiding the modelto focus more on emotional information. The experimental results conducted onIEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods andemotion alignment is necessary before modal fusion.
为了解决跨模态信息融合对多模态情感识别(MER)性能造成的限制,我们提出了一种基于多任务学习的新型多模态情感识别框架,即在配准后进行融合,称为Foal-Net。该框架旨在提高模态融合的有效性,包括两个辅助任务:音频视频情感配准(AVEL)和跨模态情感标签匹配(MEM)。首先,AVEL 通过对比学习实现音视频表征中情感信息的对齐。同时,MEM 评估当前样本对的情感是否相同,为模态信息融合提供帮助,并引导模型更加关注情感信息。在 IEMOCAP 语料库上进行的实验结果表明,Foal-Net 的性能优于最先进的方法,而且在模态融合之前必须进行情感对齐。
{"title":"Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition","authors":"Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li","doi":"arxiv-2408.09438","DOIUrl":"https://doi.org/arxiv-2408.09438","url":null,"abstract":"To address the limitation in multimodal emotion recognition (MER) performance\u0000arising from inter-modal information fusion, we propose a novel MER framework\u0000based on multitask learning where fusion occurs after alignment, called\u0000Foal-Net. The framework is designed to enhance the effectiveness of modality\u0000fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL)\u0000and cross-modal emotion label matching (MEM). First, AVEL achieves alignment of\u0000emotional information in audio-video representations through contrastive\u0000learning. Then, a modal fusion network integrates the aligned features.\u0000Meanwhile, MEM assesses whether the emotions of the current sample pair are the\u0000same, providing assistance for modal information fusion and guiding the model\u0000to focus more on emotional information. The experimental results conducted on\u0000IEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and\u0000emotion alignment is necessary before modal fusion.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Dataset, Notation Software, and Representation for Computational Schenkerian Analysis 用于计算申克分析的新数据集、符号软件和表示法
Pub Date : 2024-08-13 DOI: arxiv-2408.07184
Stephen Ni-Hahn, Weihan Xu, Jerry Yin, Rico Zhu, Simon Mak, Yue Jiang, Cynthia Rudin
Schenkerian Analysis (SchA) is a uniquely expressive method of musicanalysis, combining elements of melody, harmony, counterpoint, and form todescribe the hierarchical structure supporting a work of music. However,despite its powerful analytical utility and potential to improve musicunderstanding and generation, SchA has rarely been utilized by the computermusic community. This is in large part due to the paucity of availablehigh-quality data in a computer-readable format. With a larger corpus ofSchenkerian data, it may be possible to infuse machine learning models with adeeper understanding of musical structure, thus leading to more "human"results. To encourage further research in Schenkerian analysis and itspotential benefits for music informatics and generation, this paper presentsthree main contributions: 1) a new and growing dataset of SchAs, the largest inhuman- and computer-readable formats to date (>140 excerpts), 2) a novelsoftware for visualization and collection of SchA data, and 3) a novel,flexible representation of SchA as a heterogeneous-edge graph data structure.
申克分析法(Schenkerian Analysis,简称 SchA)是一种独具表现力的音乐分析方法,它将旋律、和声、对位和形式等元素结合起来,描绘出支持音乐作品的层次结构。然而,尽管 SchA 具有强大的分析功能和改善音乐理解与生成的潜力,但计算机音乐界却很少使用它。这在很大程度上是由于以计算机可读格式提供的高质量数据太少。有了更多的申克数据,就有可能为机器学习模型注入对音乐结构更深入的理解,从而获得更 "人性化 "的结果。为了鼓励进一步研究申克式分析及其对音乐信息学和音乐生成的潜在益处,本文提出了三项主要贡献:1)一个新的、不断增长的申克分析数据集,这是迄今为止最大的非人类和计算机可读格式的数据集(超过 140 个节选);2)一个用于申克分析数据可视化和收集的新颖软件;3)一种新颖、灵活的申克分析异质边缘图数据结构表示法。
{"title":"A New Dataset, Notation Software, and Representation for Computational Schenkerian Analysis","authors":"Stephen Ni-Hahn, Weihan Xu, Jerry Yin, Rico Zhu, Simon Mak, Yue Jiang, Cynthia Rudin","doi":"arxiv-2408.07184","DOIUrl":"https://doi.org/arxiv-2408.07184","url":null,"abstract":"Schenkerian Analysis (SchA) is a uniquely expressive method of music\u0000analysis, combining elements of melody, harmony, counterpoint, and form to\u0000describe the hierarchical structure supporting a work of music. However,\u0000despite its powerful analytical utility and potential to improve music\u0000understanding and generation, SchA has rarely been utilized by the computer\u0000music community. This is in large part due to the paucity of available\u0000high-quality data in a computer-readable format. With a larger corpus of\u0000Schenkerian data, it may be possible to infuse machine learning models with a\u0000deeper understanding of musical structure, thus leading to more \"human\"\u0000results. To encourage further research in Schenkerian analysis and its\u0000potential benefits for music informatics and generation, this paper presents\u0000three main contributions: 1) a new and growing dataset of SchAs, the largest in\u0000human- and computer-readable formats to date (>140 excerpts), 2) a novel\u0000software for visualization and collection of SchA data, and 3) a novel,\u0000flexible representation of SchA as a heterogeneous-edge graph data structure.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling MIDI-to-Tab:通过掩码语言建模进行吉他谱推理
Pub Date : 2024-08-09 DOI: arxiv-2408.05024
Drew Edwards, Xavier Riley, Pedro Sarmento, Simon Dixon
Guitar tablatures enrich the structure of traditional music notation byassigning each note to a string and fret of a guitar in a particular tuning,indicating precisely where to play the note on the instrument. The problem ofgenerating tablature from a symbolic music representation involves inferringthis string and fret assignment per note across an entire composition orperformance. On the guitar, multiple string-fret assignments are possible formost pitches, which leads to a large combinatorial space that preventsexhaustive search approaches. Most modern methods use constraint-based dynamicprogramming to minimize some cost function (e.g. hand position movement). Inthis work, we introduce a novel deep learning solution to symbolic guitartablature estimation. We train an encoder-decoder Transformer model in a maskedlanguage modeling paradigm to assign notes to strings. The model is firstpre-trained on DadaGP, a dataset of over 25K tablatures, and then fine-tuned ona curated set of professionally transcribed guitar performances. Given thesubjective nature of assessing tablature quality, we conduct a user studyamongst guitarists, wherein we ask participants to rate the playability ofmultiple versions of tablature for the same four-bar excerpt. The resultsindicate our system significantly outperforms competing algorithms.
吉他谱丰富了传统音乐记谱法的结构,它将每个音符分配到特定调式的吉他弦和音格上,并精确地指出该音符在乐器上的演奏位置。从符号音乐表示法生成音谱的问题涉及推断整个作品或演奏中每个音符的弦和音格分配。在吉他上,大多数音高都可能有多个琴弦-音格分配,这就导致了一个巨大的组合空间,无法使用穷举搜索方法。大多数现代方法都使用基于约束的动态编程来最小化某些成本函数(例如:手的位置移动)。在这项工作中,我们引入了一种新颖的深度学习解决方案来进行符号吉他声调估算。我们在掩码语言建模范式中训练了一个编码器-解码器转换器模型,以便为字符串分配音符。该模型首先在 DadaGP(一个包含超过 25K 个吉他谱的数据集)上进行预训练,然后在一组经过精心策划的专业吉他演奏转录集上进行微调。考虑到评估制表法质量的主观性,我们在吉他手中进行了用户研究,要求参与者对同一四小节节选的多个制表法版本的可演奏性进行评分。结果表明,我们的系统明显优于竞争算法。
{"title":"MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling","authors":"Drew Edwards, Xavier Riley, Pedro Sarmento, Simon Dixon","doi":"arxiv-2408.05024","DOIUrl":"https://doi.org/arxiv-2408.05024","url":null,"abstract":"Guitar tablatures enrich the structure of traditional music notation by\u0000assigning each note to a string and fret of a guitar in a particular tuning,\u0000indicating precisely where to play the note on the instrument. The problem of\u0000generating tablature from a symbolic music representation involves inferring\u0000this string and fret assignment per note across an entire composition or\u0000performance. On the guitar, multiple string-fret assignments are possible for\u0000most pitches, which leads to a large combinatorial space that prevents\u0000exhaustive search approaches. Most modern methods use constraint-based dynamic\u0000programming to minimize some cost function (e.g. hand position movement). In\u0000this work, we introduce a novel deep learning solution to symbolic guitar\u0000tablature estimation. We train an encoder-decoder Transformer model in a masked\u0000language modeling paradigm to assign notes to strings. The model is first\u0000pre-trained on DadaGP, a dataset of over 25K tablatures, and then fine-tuned on\u0000a curated set of professionally transcribed guitar performances. Given the\u0000subjective nature of assessing tablature quality, we conduct a user study\u0000amongst guitarists, wherein we ask participants to rate the playability of\u0000multiple versions of tablature for the same four-bar excerpt. The results\u0000indicate our system significantly outperforms competing algorithms.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework DiM-Gesture:利用自适应层归一化技术生成协同语音手势 Mamba-2 框架
Pub Date : 2024-08-01 DOI: arxiv-2408.00370
Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma
Speech-driven gesture generation is an emerging domain within virtual humancreation, where current methods predominantly utilize Transformer-basedarchitectures that necessitate extensive memory and are characterized by slowinference speeds. In response to these limitations, we proposetextit{DiM-Gestures}, a novel end-to-end generative model crafted to createhighly personalized 3D full-body gestures solely from raw speech audio,employing Mamba-based architectures. This model integrates a Mamba-based fuzzyfeature extractor with a non-autoregressive Adaptive Layer Normalization(AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mambaframework and a WavLM pre-trained model, autonomously derives implicit,continuous fuzzy features, which are then unified into a singular latentfeature. This feature is processed by the AdaLN Mamba-2, which implements auniform conditional mechanism across all tokens to robustly model the interplaybetween the fuzzy features and the resultant gesture sequence. This innovativeapproach guarantees high fidelity in gesture-speech synchronization whilemaintaining the naturalness of the gestures. Employing a diffusion model fortraining and inference, our framework has undergone extensive subjective andobjective evaluations on the ZEGGS and BEAT datasets. These assessmentssubstantiate our model's enhanced performance relative to contemporarystate-of-the-art methods, demonstrating competitive outcomes with the DiTsarchitecture (Persona-Gestors) while optimizing memory usage and acceleratinginference speed.
语音驱动的手势生成是虚拟人创作中的一个新兴领域,目前的方法主要使用基于变压器的架构,这种架构需要大量内存,而且推理速度较慢。针对这些局限性,我们提出了 "DiM-Gestures"(DiM-手势)这一新颖的端到端生成模型,该模型采用基于 Mamba 的体系结构,可完全根据原始语音音频创建高度个性化的 3D 全身手势。该模型集成了一个基于 Mamba 的模糊特征提取器和一个非自回归自适应层归一化(AdaLN)Mamba-2 扩散架构。该提取器利用 Mambaframework 和 WavLM 预训练模型,自主提取隐含的连续模糊特征,然后将其统一为一个奇异的潜在特征。该特征由 AdaLN Mamba-2 处理,Mamba-2 对所有标记实施统一的条件机制,以对模糊特征和由此产生的手势序列之间的相互作用进行稳健建模。这种创新方法保证了手势与语音同步的高保真性,同时保持了手势的自然性。我们的框架采用扩散模型进行训练和推理,并在 ZEGGS 和 BEAT 数据集上进行了广泛的主观和客观评估。这些评估证明,与当代最先进的方法相比,我们的模型性能更强,与 DiTs 架构(Persona-Gestors)相比具有竞争力,同时优化了内存使用并加快了推理速度。
{"title":"DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework","authors":"Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma","doi":"arxiv-2408.00370","DOIUrl":"https://doi.org/arxiv-2408.00370","url":null,"abstract":"Speech-driven gesture generation is an emerging domain within virtual human\u0000creation, where current methods predominantly utilize Transformer-based\u0000architectures that necessitate extensive memory and are characterized by slow\u0000inference speeds. In response to these limitations, we propose\u0000textit{DiM-Gestures}, a novel end-to-end generative model crafted to create\u0000highly personalized 3D full-body gestures solely from raw speech audio,\u0000employing Mamba-based architectures. This model integrates a Mamba-based fuzzy\u0000feature extractor with a non-autoregressive Adaptive Layer Normalization\u0000(AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba\u0000framework and a WavLM pre-trained model, autonomously derives implicit,\u0000continuous fuzzy features, which are then unified into a singular latent\u0000feature. This feature is processed by the AdaLN Mamba-2, which implements a\u0000uniform conditional mechanism across all tokens to robustly model the interplay\u0000between the fuzzy features and the resultant gesture sequence. This innovative\u0000approach guarantees high fidelity in gesture-speech synchronization while\u0000maintaining the naturalness of the gestures. Employing a diffusion model for\u0000training and inference, our framework has undergone extensive subjective and\u0000objective evaluations on the ZEGGS and BEAT datasets. These assessments\u0000substantiate our model's enhanced performance relative to contemporary\u0000state-of-the-art methods, demonstrating competitive outcomes with the DiTs\u0000architecture (Persona-Gestors) while optimizing memory usage and accelerating\u0000inference speed.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation 利用对比表征在音频分类中实现稳健的少量类增量学习
Pub Date : 2024-07-27 DOI: arxiv-2407.19265
Riyansha SinghIIT Kanpur, India, Parinita NemaIISER Bhopal, India, Vinod K KurmiIISER Bhopal, India
In machine learning applications, gradual data ingress is common, especiallyin audio processing where incremental learning is vital for real-timeanalytics. Few-shot class-incremental learning addresses challenges arisingfrom limited incoming data. Existing methods often integrate additionaltrainable components or rely on a fixed embedding extractor post-training onbase sessions to mitigate concerns related to catastrophic forgetting and thedangers of model overfitting. However, using cross-entropy loss alone duringbase session training is suboptimal for audio data. To address this, we proposeincorporating supervised contrastive learning to refine the representationspace, enhancing discriminative power and leading to better generalizationsince it facilitates seamless integration of incremental classes, upon arrival.Experimental results on NSynth and LibriSpeech datasets with 100 classes, aswell as ESC dataset with 50 and 10 classes, demonstrate state-of-the-artperformance.
在机器学习应用中,渐进式数据输入很常见,尤其是在音频处理中,增量学习对实时分析至关重要。少量类增量学习可以解决有限输入数据带来的挑战。现有方法通常集成了额外的可训练组件,或依赖于固定的嵌入提取器对基础会话进行后训练,以减轻与灾难性遗忘和模型过拟合危险有关的担忧。然而,在基础会话训练期间仅使用交叉熵损失对于音频数据来说并不理想。为了解决这个问题,我们建议结合有监督的对比学习来完善表征空间,从而增强判别能力,并在增量类到达时进行无缝整合,从而实现更好的泛化。
{"title":"Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation","authors":"Riyansha SinghIIT Kanpur, India, Parinita NemaIISER Bhopal, India, Vinod K KurmiIISER Bhopal, India","doi":"arxiv-2407.19265","DOIUrl":"https://doi.org/arxiv-2407.19265","url":null,"abstract":"In machine learning applications, gradual data ingress is common, especially\u0000in audio processing where incremental learning is vital for real-time\u0000analytics. Few-shot class-incremental learning addresses challenges arising\u0000from limited incoming data. Existing methods often integrate additional\u0000trainable components or rely on a fixed embedding extractor post-training on\u0000base sessions to mitigate concerns related to catastrophic forgetting and the\u0000dangers of model overfitting. However, using cross-entropy loss alone during\u0000base session training is suboptimal for audio data. To address this, we propose\u0000incorporating supervised contrastive learning to refine the representation\u0000space, enhancing discriminative power and leading to better generalization\u0000since it facilitates seamless integration of incremental classes, upon arrival.\u0000Experimental results on NSynth and LibriSpeech datasets with 100 classes, as\u0000well as ESC dataset with 50 and 10 classes, demonstrate state-of-the-art\u0000performance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implementation and Applications of WakeWords Integrated with Speaker Recognition: A Case Study WakeWords 与说话人识别整合的实施与应用:案例研究
Pub Date : 2024-07-25 DOI: arxiv-2407.18985
Alexandre Costa Ferro Filho, Elisa Ayumi Masasi de Oliveira, Iago Alves Brito, Pedro Martins Bittencourt
This paper explores the application of artificial intelligence techniques inaudio and voice processing, focusing on the integration of wake words andspeaker recognition for secure access in embedded systems. With the growingprevalence of voice-activated devices such as Amazon Alexa, ensuring secure anduser-specific interactions has become paramount. Our study aims to enhance thesecurity framework of these systems by leveraging wake words for initialactivation and speaker recognition to validate user permissions. Byincorporating these AI-driven methodologies, we propose a robust solution thatrestricts system usage to authorized individuals, thereby mitigatingunauthorized access risks. This research delves into the algorithms andtechnologies underpinning wake word detection and speaker recognition,evaluates their effectiveness in real-world applications, and discusses thepotential for their implementation in various embedded systems, emphasizingsecurity and user convenience. The findings underscore the feasibility andadvantages of employing these AI techniques to create secure, user-friendlyvoice-activated systems.
本文探讨了人工智能技术在音频和语音处理中的应用,重点是在嵌入式系统中整合唤醒词和扬声器识别以实现安全访问。随着亚马逊 Alexa 等声控设备的日益普及,确保安全和针对特定用户的交互变得至关重要。我们的研究旨在利用唤醒词进行初始激活和扬声器识别来验证用户权限,从而增强这些系统的安全框架。通过结合这些人工智能驱动的方法,我们提出了一种稳健的解决方案,该方案可将系统的使用权限制在经授权的个人身上,从而降低未经授权访问的风险。本研究深入探讨了唤醒词检测和说话人识别的算法和技术,评估了它们在实际应用中的有效性,并讨论了它们在各种嵌入式系统中的应用潜力,同时强调了安全性和用户便利性。研究结果强调了采用这些人工智能技术创建安全、用户友好的声控系统的可行性和优势。
{"title":"Implementation and Applications of WakeWords Integrated with Speaker Recognition: A Case Study","authors":"Alexandre Costa Ferro Filho, Elisa Ayumi Masasi de Oliveira, Iago Alves Brito, Pedro Martins Bittencourt","doi":"arxiv-2407.18985","DOIUrl":"https://doi.org/arxiv-2407.18985","url":null,"abstract":"This paper explores the application of artificial intelligence techniques in\u0000audio and voice processing, focusing on the integration of wake words and\u0000speaker recognition for secure access in embedded systems. With the growing\u0000prevalence of voice-activated devices such as Amazon Alexa, ensuring secure and\u0000user-specific interactions has become paramount. Our study aims to enhance the\u0000security framework of these systems by leveraging wake words for initial\u0000activation and speaker recognition to validate user permissions. By\u0000incorporating these AI-driven methodologies, we propose a robust solution that\u0000restricts system usage to authorized individuals, thereby mitigating\u0000unauthorized access risks. This research delves into the algorithms and\u0000technologies underpinning wake word detection and speaker recognition,\u0000evaluates their effectiveness in real-world applications, and discusses the\u0000potential for their implementation in various embedded systems, emphasizing\u0000security and user convenience. The findings underscore the feasibility and\u0000advantages of employing these AI techniques to create secure, user-friendly\u0000voice-activated systems.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Enhanced Classification of Abnormal Lung sound in Multi-breath: A Light Weight Multi-label and Multi-head Attention Classification Method 增强多重呼吸中异常肺音的分类:轻量级多标签和多头注意力分类方法
Pub Date : 2024-07-15 DOI: arxiv-2407.10828
Yi-Wei Chua, Yun-Chien Cheng
This study aims to develop an auxiliary diagnostic system for classifyingabnormal lung respiratory sounds, enhancing the accuracy of automatic abnormalbreath sound classification through an innovative multi-label learning approachand multi-head attention mechanism. Addressing the issue of class imbalance andlack of diversity in existing respiratory sound datasets, our study employs alightweight and highly accurate model, using a two-dimensional label set torepresent multiple respiratory sound characteristics. Our method achieved a59.2% ICBHI score in the four-category task on the ICBHI2017 dataset,demonstrating its advantages in terms of lightweight and high accuracy. Thisstudy not only improves the accuracy of automatic diagnosis of lung respiratorysound abnormalities but also opens new possibilities for clinical applications.
本研究旨在开发一种用于肺部异常呼吸音分类的辅助诊断系统,通过创新的多标签学习方法和多头关注机制提高异常呼吸音自动分类的准确性。针对现有呼吸音数据集类不平衡和缺乏多样性的问题,我们的研究采用了轻量级和高精度模型,使用二维标签集来代表多种呼吸音特征。我们的方法在 ICBHI2017 数据集的四类任务中取得了 59.2% 的 ICBHI 分数,证明了它在轻量级和高准确度方面的优势。这项研究不仅提高了肺部呼吸音异常自动诊断的准确性,也为临床应用提供了新的可能性。
{"title":"Towards Enhanced Classification of Abnormal Lung sound in Multi-breath: A Light Weight Multi-label and Multi-head Attention Classification Method","authors":"Yi-Wei Chua, Yun-Chien Cheng","doi":"arxiv-2407.10828","DOIUrl":"https://doi.org/arxiv-2407.10828","url":null,"abstract":"This study aims to develop an auxiliary diagnostic system for classifying\u0000abnormal lung respiratory sounds, enhancing the accuracy of automatic abnormal\u0000breath sound classification through an innovative multi-label learning approach\u0000and multi-head attention mechanism. Addressing the issue of class imbalance and\u0000lack of diversity in existing respiratory sound datasets, our study employs a\u0000lightweight and highly accurate model, using a two-dimensional label set to\u0000represent multiple respiratory sound characteristics. Our method achieved a\u000059.2% ICBHI score in the four-category task on the ICBHI2017 dataset,\u0000demonstrating its advantages in terms of lightweight and high accuracy. This\u0000study not only improves the accuracy of automatic diagnosis of lung respiratory\u0000sound abnormalities but also opens new possibilities for clinical applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards zero-shot amplifier modeling: One-to-many amplifier modeling via tone embedding control 迈向零射频放大器建模:通过音调嵌入控制进行一对多放大器建模
Pub Date : 2024-07-15 DOI: arxiv-2407.10646
Yu-Hua Chen, Yen-Tung Yeh, Yuan-Chiao Cheng, Jui-Te Wu, Yu-Hsiang Ho, Jyh-Shing Roger Jang, Yi-Hsuan Yang
Replicating analog device circuits through neural audio effect modeling hasgarnered increasing interest in recent years. Existing work has predominantlyfocused on a one-to-one emulation strategy, modeling specific devicesindividually. In this paper, we tackle the less-explored scenario ofone-to-many emulation, utilizing conditioning mechanisms to emulate multipleguitar amplifiers through a single neural model. For condition representation,we use contrastive learning to build a tone embedding encoder that extractsstyle-related features of various amplifiers, leveraging a dataset ofcomprehensive amplifier settings. Targeting zero-shot application scenarios, wealso examine various strategies for tone embedding representation, evaluatingreferenced tone embedding against two retrieval-based embedding methods foramplifiers unseen in the training time. Our findings showcase the efficacy andpotential of the proposed methods in achieving versatile one-to-many amplifiermodeling, contributing a foundational step towards zero-shot audio modelingapplications.
近年来,通过神经音频效果建模复制模拟设备电路的研究越来越受到关注。现有工作主要集中在一对一仿真策略上,对特定设备进行单独建模。在本文中,我们利用调节机制,通过单个神经模型模拟多个吉他放大器,从而解决了一对多仿真这一探索较少的问题。在条件表示方面,我们利用对比学习建立了一个音调嵌入编码器,该编码器可以提取各种放大器的风格相关特征,并利用一个综合放大器设置的数据集。针对零镜头应用场景,我们还研究了音调嵌入表示的各种策略,针对训练时间内未见的放大器,评估了参考音调嵌入和两种基于检索的嵌入方法。我们的研究结果展示了所提方法在实现多用途一对多放大器建模方面的功效和潜力,为实现零镜头音频建模应用迈出了奠基性的一步。
{"title":"Towards zero-shot amplifier modeling: One-to-many amplifier modeling via tone embedding control","authors":"Yu-Hua Chen, Yen-Tung Yeh, Yuan-Chiao Cheng, Jui-Te Wu, Yu-Hsiang Ho, Jyh-Shing Roger Jang, Yi-Hsuan Yang","doi":"arxiv-2407.10646","DOIUrl":"https://doi.org/arxiv-2407.10646","url":null,"abstract":"Replicating analog device circuits through neural audio effect modeling has\u0000garnered increasing interest in recent years. Existing work has predominantly\u0000focused on a one-to-one emulation strategy, modeling specific devices\u0000individually. In this paper, we tackle the less-explored scenario of\u0000one-to-many emulation, utilizing conditioning mechanisms to emulate multiple\u0000guitar amplifiers through a single neural model. For condition representation,\u0000we use contrastive learning to build a tone embedding encoder that extracts\u0000style-related features of various amplifiers, leveraging a dataset of\u0000comprehensive amplifier settings. Targeting zero-shot application scenarios, we\u0000also examine various strategies for tone embedding representation, evaluating\u0000referenced tone embedding against two retrieval-based embedding methods for\u0000amplifiers unseen in the training time. Our findings showcase the efficacy and\u0000potential of the proposed methods in achieving versatile one-to-many amplifier\u0000modeling, contributing a foundational step towards zero-shot audio modeling\u0000applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Sound
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1