首页 > 最新文献

arXiv - CS - Sound最新文献

英文 中文
A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings 一种利用半监督语音嵌入进行 PD 检测的新型融合架构
Pub Date : 2024-05-21 DOI: arxiv-2405.17206
Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque
We present a framework to recognize Parkinson's disease (PD) through anEnglish pangram utterance speech collected using a web application from diverserecording settings and environments, including participants' homes. Our datasetincludes a global cohort of 1306 participants, including 392 diagnosed with PD.Leveraging the diversity of the dataset, spanning various demographicproperties (such as age, sex, and ethnicity), we used deep learning embeddingsderived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBindrepresenting the speech dynamics associated with PD. Our novel fusion model forPD classification, which aligns different speech embeddings into a cohesivefeature space, demonstrated superior performance over standardconcatenation-based fusion models and other baselines (including models builton traditional acoustic features). In a randomized data split configuration,the model achieved an Area Under the Receiver Operating Characteristic Curve(AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysisconfirmed that our model performs equitably across various demographicsubgroups in terms of sex, ethnicity, and age, and remains robust regardless ofdisease duration. Furthermore, our model, when tested on two entirely unseentest datasets collected from clinical settings and from a PD care center,maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms themodel's robustness and it's potential to enhance accessibility and healthequity in real-world applications.
我们提出了一个框架,通过使用网络应用程序从不同的记录设置和环境(包括参与者的家庭)中收集的英语泛型语篇语音来识别帕金森病(PD)。我们的数据集包括全球 1306 名参与者,其中 392 人被诊断为帕金森病。利用数据集的多样性,涵盖各种人口统计属性(如年龄、性别和种族),我们使用了从半监督模型(如 Wav2Vec 2.0、WavLM 和 ImageBind)中提取的深度学习嵌入,这些模型代表了与帕金森病相关的语音动态。我们用于 PD 分类的新型融合模型将不同的语音嵌入对齐到一个具有凝聚力的特征空间中,其性能优于基于标准嵌入的融合模型和其他基线(包括传统声学特征模型)。在随机数据分割配置中,该模型的接收者工作特征曲线下面积(AUROC)达到了 88.94%,准确率达到了 85.65%。严格的统计分析证实,我们的模型在不同性别、种族和年龄的人口统计学分组中表现公平,并且无论疾病持续时间长短都保持稳健。此外,我们的模型在两个完全未经测试的数据集上进行了测试,这两个数据集分别来自临床环境和一个帕金森病护理中心,其AUROC得分分别为82.12%和78.44%。这证明了该模型的稳健性,以及在实际应用中提高可及性和治疗公平性的潜力。
{"title":"A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings","authors":"Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque","doi":"arxiv-2405.17206","DOIUrl":"https://doi.org/arxiv-2405.17206","url":null,"abstract":"We present a framework to recognize Parkinson's disease (PD) through an\u0000English pangram utterance speech collected using a web application from diverse\u0000recording settings and environments, including participants' homes. Our dataset\u0000includes a global cohort of 1306 participants, including 392 diagnosed with PD.\u0000Leveraging the diversity of the dataset, spanning various demographic\u0000properties (such as age, sex, and ethnicity), we used deep learning embeddings\u0000derived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind\u0000representing the speech dynamics associated with PD. Our novel fusion model for\u0000PD classification, which aligns different speech embeddings into a cohesive\u0000feature space, demonstrated superior performance over standard\u0000concatenation-based fusion models and other baselines (including models built\u0000on traditional acoustic features). In a randomized data split configuration,\u0000the model achieved an Area Under the Receiver Operating Characteristic Curve\u0000(AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis\u0000confirmed that our model performs equitably across various demographic\u0000subgroups in terms of sex, ethnicity, and age, and remains robust regardless of\u0000disease duration. Furthermore, our model, when tested on two entirely unseen\u0000test datasets collected from clinical settings and from a PD care center,\u0000maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the\u0000model's robustness and it's potential to enhance accessibility and health\u0000equity in real-world applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing DMI Interactions by Integrating Haptic Feedback for Intricate Vibrato Technique 通过整合触觉反馈增强复杂颤音技术的 DMI 互动
Pub Date : 2024-05-17 DOI: arxiv-2405.10502
Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley
This paper investigates the integration of force feedback in Digital MusicalInstruments (DMI), specifically evaluating the reproduction of intricatevibrato techniques using haptic feedback controllers. We introduce our systemfor vibrato modulation using force feedback, composed of Bend-aid (a web-basedsequencer platform using pre-designed haptic feedback models) and TorqueTuner(an open-source 1 Degree-of-Freedom (DoF) rotary haptic device for generatingprogrammable haptic effects). We designed a formal user study to assess theimpact of each haptic mode on user experience in a vibrato mimicry task. Twentymusically trained participants rated their user experience for the three hapticmodes (Smooth, Detent, and Spring) using four Likert-scale scores: comfort,flexibility, ease of control, and helpfulness for the task. Finally, we askedparticipants to share their reflections. Our research indicates that while theSpring mode can help with light vibrato, preferences for haptic modes varybased on musical training background. This emphasizes the need for adaptabletask interfaces and flexible haptic feedback in DMI design.
本文研究了在数字乐器(DMI)中整合力反馈的问题,特别是评估了使用触觉反馈控制器再现复杂揉弦技术的情况。我们介绍了使用力反馈进行揉弦调制的系统,该系统由 Bend-aid(使用预先设计的触觉反馈模型的网络编曲平台)和 TorqueTuner(用于生成可编程触觉效果的开源 1 自由度(DoF)旋转触觉设备)组成。我们设计了一项正式的用户研究,以评估每种触觉模式在模仿颤音任务中对用户体验的影响。20 名受过音乐训练的参与者对三种触觉模式(平滑、棘爪和弹簧)的用户体验进行了评分,采用的是四项李克特量表评分:舒适度、灵活性、易控性和对任务的帮助。最后,我们请参与者分享他们的反思。我们的研究表明,虽然 "弹簧 "模式对轻微颤音有帮助,但音乐训练背景不同的人对触觉模式的偏好也不同。这就强调了在 DMI 设计中需要有适应性强的任务界面和灵活的触觉反馈。
{"title":"Enhancing DMI Interactions by Integrating Haptic Feedback for Intricate Vibrato Technique","authors":"Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley","doi":"arxiv-2405.10502","DOIUrl":"https://doi.org/arxiv-2405.10502","url":null,"abstract":"This paper investigates the integration of force feedback in Digital Musical\u0000Instruments (DMI), specifically evaluating the reproduction of intricate\u0000vibrato techniques using haptic feedback controllers. We introduce our system\u0000for vibrato modulation using force feedback, composed of Bend-aid (a web-based\u0000sequencer platform using pre-designed haptic feedback models) and TorqueTuner\u0000(an open-source 1 Degree-of-Freedom (DoF) rotary haptic device for generating\u0000programmable haptic effects). We designed a formal user study to assess the\u0000impact of each haptic mode on user experience in a vibrato mimicry task. Twenty\u0000musically trained participants rated their user experience for the three haptic\u0000modes (Smooth, Detent, and Spring) using four Likert-scale scores: comfort,\u0000flexibility, ease of control, and helpfulness for the task. Finally, we asked\u0000participants to share their reflections. Our research indicates that while the\u0000Spring mode can help with light vibrato, preferences for haptic modes vary\u0000based on musical training background. This emphasizes the need for adaptable\u0000task interfaces and flexible haptic feedback in DMI design.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparative Study of Recurrent Neural Networks for Virtual Analog Audio Effects Modeling 用于虚拟模拟音频效果建模的递归神经网络比较研究
Pub Date : 2024-05-07 DOI: arxiv-2405.04124
Riccardo Simionato, Stefano Fasciani
Analog electronic circuits are at the core of an important category ofmusical devices. The nonlinear features of their electronic components giveanalog musical devices a distinctive timbre and sound quality, making themhighly desirable. Artificial neural networks have rapidly gained popularity forthe emulation of analog audio effects circuits, particularly recurrentnetworks. While neural approaches have been successful in accurately modelingdistortion circuits, they require architectural improvements that account forparameter conditioning and low latency response. In this article, we explorethe application of recent machine learning advancements for virtual analogmodeling. We compare State Space models and Linear Recurrent Units against themore common Long Short Term Memory networks. These have shown promising abilityin sequence to sequence modeling tasks, showing a notable improvement in signalhistory encoding. Our comparative study uses these black box neural modelingtechniques with a variety of audio effects. We evaluate the performance andlimitations using multiple metrics aiming to assess the models' ability toaccurately replicate energy envelopes, frequency contents, and transients inthe audio signal. To incorporate control parameters we employ the Feature wiseLinear Modulation method. Long Short Term Memory networks exhibit betteraccuracy in emulating distortions and equalizers, while the State Space model,followed by Long Short Term Memory networks when integrated in an encoderdecoder structure, outperforms others in emulating saturation and compression.When considering long time variant characteristics, the State Space modeldemonstrates the greatest accuracy. The Long Short Term Memory and, inparticular, Linear Recurrent Unit networks present more tendency to introduceaudio artifacts.
模拟电子电路是一类重要的音乐设备的核心。其电子元件的非线性特征赋予了模拟音乐设备独特的音色和音质,使其备受青睐。在模拟音频效果电路的仿真方面,人工神经网络,尤其是递归网络,迅速得到普及。虽然神经方法在准确模拟失真电路方面取得了成功,但它们需要在结构上进行改进,以考虑参数调节和低延迟响应。在本文中,我们探讨了最近机器学习技术在虚拟模拟中的应用。我们将状态空间模型和线性递归单元与更常见的长短期记忆网络进行了比较。这些模型在序列到序列建模任务中表现出了良好的能力,在信号历史编码方面也有明显的改进。我们的比较研究采用了这些黑盒神经建模技术,并使用了多种音频效果。我们使用多个指标来评估模型的性能和局限性,旨在评估模型准确复制音频信号中的能量包络、频率内容和瞬态的能力。为了加入控制参数,我们采用了特征线性调制方法。长短期记忆网络在仿真失真和均衡器方面表现出更高的精度,而状态空间模型,其次是长短期记忆网络,当集成到编码器和解码器结构中时,在仿真饱和和压缩方面优于其他模型。长短期记忆网络,尤其是线性递归单元网络更容易引入音频伪音。
{"title":"Comparative Study of Recurrent Neural Networks for Virtual Analog Audio Effects Modeling","authors":"Riccardo Simionato, Stefano Fasciani","doi":"arxiv-2405.04124","DOIUrl":"https://doi.org/arxiv-2405.04124","url":null,"abstract":"Analog electronic circuits are at the core of an important category of\u0000musical devices. The nonlinear features of their electronic components give\u0000analog musical devices a distinctive timbre and sound quality, making them\u0000highly desirable. Artificial neural networks have rapidly gained popularity for\u0000the emulation of analog audio effects circuits, particularly recurrent\u0000networks. While neural approaches have been successful in accurately modeling\u0000distortion circuits, they require architectural improvements that account for\u0000parameter conditioning and low latency response. In this article, we explore\u0000the application of recent machine learning advancements for virtual analog\u0000modeling. We compare State Space models and Linear Recurrent Units against the\u0000more common Long Short Term Memory networks. These have shown promising ability\u0000in sequence to sequence modeling tasks, showing a notable improvement in signal\u0000history encoding. Our comparative study uses these black box neural modeling\u0000techniques with a variety of audio effects. We evaluate the performance and\u0000limitations using multiple metrics aiming to assess the models' ability to\u0000accurately replicate energy envelopes, frequency contents, and transients in\u0000the audio signal. To incorporate control parameters we employ the Feature wise\u0000Linear Modulation method. Long Short Term Memory networks exhibit better\u0000accuracy in emulating distortions and equalizers, while the State Space model,\u0000followed by Long Short Term Memory networks when integrated in an encoder\u0000decoder structure, outperforms others in emulating saturation and compression.\u0000When considering long time variant characteristics, the State Space model\u0000demonstrates the greatest accuracy. The Long Short Term Memory and, in\u0000particular, Linear Recurrent Unit networks present more tendency to introduce\u0000audio artifacts.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140927682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
POPDG: Popular 3D Dance Generation with PopDanceSet POPDG:使用 PopDanceSet 生成流行 3D 舞蹈
Pub Date : 2024-05-06 DOI: arxiv-2405.03178
Zhenye Luo, Min Ren, Xuecai Hu, Yongzhen Huang, Li Yao
Generating dances that are both lifelike and well-aligned with musiccontinues to be a challenging task in the cross-modal domain. This paperintroduces PopDanceSet, the first dataset tailored to the preferences of youngaudiences, enabling the generation of aesthetically oriented dances. And itsurpasses the AIST++ dataset in music genre diversity and the intricacy anddepth of dance movements. Moreover, the proposed POPDG model within the iDDPMframework enhances dance diversity and, through the Space AugmentationAlgorithm, strengthens spatial physical connections between human body joints,ensuring that increased diversity does not compromise generation quality. Astreamlined Alignment Module is also designed to improve the temporal alignmentbetween dance and music. Extensive experiments show that POPDG achieves SOTAresults on two datasets. Furthermore, the paper also expands on currentevaluation metrics. The dataset and code are available athttps://github.com/Luke-Luo1/POPDG.
在跨模态领域,生成既逼真又与音乐完美契合的舞蹈仍然是一项具有挑战性的任务。本文介绍了 PopDanceSet,它是第一个根据年轻观众的喜好量身定制的数据集,可以生成以美学为导向的舞蹈。该数据集在音乐流派多样性以及舞蹈动作的复杂性和深度方面超越了 AIST++ 数据集。此外,在 iDDPM 框架内提出的 POPDG 模型增强了舞蹈的多样性,并通过空间增强算法加强了人体关节之间的空间物理连接,确保在增加多样性的同时不影响生成质量。此外,还设计了一个简化的对齐模块,以改进舞蹈与音乐之间的时间对齐。广泛的实验表明,POPDG 在两个数据集上取得了 SOTA 的结果。此外,本文还扩展了当前的评估指标。数据集和代码可在https://github.com/Luke-Luo1/POPDG。
{"title":"POPDG: Popular 3D Dance Generation with PopDanceSet","authors":"Zhenye Luo, Min Ren, Xuecai Hu, Yongzhen Huang, Li Yao","doi":"arxiv-2405.03178","DOIUrl":"https://doi.org/arxiv-2405.03178","url":null,"abstract":"Generating dances that are both lifelike and well-aligned with music\u0000continues to be a challenging task in the cross-modal domain. This paper\u0000introduces PopDanceSet, the first dataset tailored to the preferences of young\u0000audiences, enabling the generation of aesthetically oriented dances. And it\u0000surpasses the AIST++ dataset in music genre diversity and the intricacy and\u0000depth of dance movements. Moreover, the proposed POPDG model within the iDDPM\u0000framework enhances dance diversity and, through the Space Augmentation\u0000Algorithm, strengthens spatial physical connections between human body joints,\u0000ensuring that increased diversity does not compromise generation quality. A\u0000streamlined Alignment Module is also designed to improve the temporal alignment\u0000between dance and music. Extensive experiments show that POPDG achieves SOTA\u0000results on two datasets. Furthermore, the paper also expands on current\u0000evaluation metrics. The dataset and code are available at\u0000https://github.com/Luke-Luo1/POPDG.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transhuman Ansambl - Voice Beyond Language 超人类 Ansambl - 超越语言的声音
Pub Date : 2024-05-06 DOI: arxiv-2405.03134
Lucija Ivsic, Jon McCormack, Vince Dziekan
In this paper we present the design and development of the TranshumanAnsambl, a novel interactive singing-voice interface which senses itsenvironment and responds to vocal input with vocalisations using human voice.Designed for live performance with a human performer and as a standalone soundinstallation, the ansambl consists of sixteen bespoke virtual singers arrangedin a circle. When performing live, the virtual singers listen to the humanperformer and respond to their singing by reading pitch, intonation and volumecues. In a standalone sound installation mode, singers use ultrasonic distancesensors to sense audience presence. Developed as part of the 1st author'spractice-based PhD and artistic practice as a live performer, this work employsthe singing-voice to explore voice interactions in HCI beyond language, andinnovative ways of live performing. How is technology supporting the effect ofintimacy produced through voice? Does the act of surrounding the audience withresponsive virtual singers challenge the traditional roles ofperformer-listener? To answer these questions, we draw upon the 1st author'sexperience with the system, and the interdisciplinary field of voice studiesthat consider the voice as the sound medium independent of language, capable ofenacting a reciprocal connection between bodies.
在本文中,我们介绍了 "超人类"(TranshumanAnsambl)的设计和开发,这是一种新颖的交互式歌唱语音界面,它能感知周围环境,并通过人声对语音输入做出响应。"超人类 "设计用于与人类表演者一起进行现场表演,也可作为独立的声音装置,它由十六位定制的虚拟歌手组成,他们围成一圈。在现场表演时,虚拟歌手会聆听人类表演者的演唱,并通过读取音高、音调和音量来回应他们的演唱。在独立的声音装置模式下,歌手使用超声波距离感应器来感知观众的存在。作为第一作者基于实践的博士论文和现场表演艺术实践的一部分,该作品利用歌声探索人机交互中超越语言的语音交互,以及创新的现场表演方式。技术是如何支持通过声音产生的亲密效果的?用反应灵敏的虚拟歌手围绕观众的行为是否挑战了表演者-听众的传统角色?为了回答这些问题,我们借鉴了第一作者使用该系统的经验,以及将声音视为独立于语言的声音媒介、能够在身体之间建立互惠联系的跨学科声音研究领域。
{"title":"Transhuman Ansambl - Voice Beyond Language","authors":"Lucija Ivsic, Jon McCormack, Vince Dziekan","doi":"arxiv-2405.03134","DOIUrl":"https://doi.org/arxiv-2405.03134","url":null,"abstract":"In this paper we present the design and development of the Transhuman\u0000Ansambl, a novel interactive singing-voice interface which senses its\u0000environment and responds to vocal input with vocalisations using human voice.\u0000Designed for live performance with a human performer and as a standalone sound\u0000installation, the ansambl consists of sixteen bespoke virtual singers arranged\u0000in a circle. When performing live, the virtual singers listen to the human\u0000performer and respond to their singing by reading pitch, intonation and volume\u0000cues. In a standalone sound installation mode, singers use ultrasonic distance\u0000sensors to sense audience presence. Developed as part of the 1st author's\u0000practice-based PhD and artistic practice as a live performer, this work employs\u0000the singing-voice to explore voice interactions in HCI beyond language, and\u0000innovative ways of live performing. How is technology supporting the effect of\u0000intimacy produced through voice? Does the act of surrounding the audience with\u0000responsive virtual singers challenge the traditional roles of\u0000performer-listener? To answer these questions, we draw upon the 1st author's\u0000experience with the system, and the interdisciplinary field of voice studies\u0000that consider the voice as the sound medium independent of language, capable of\u0000enacting a reciprocal connection between bodies.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Determined Multichannel Blind Source Separation with Clustered Source Model 利用聚类声源模型确定多通道盲声源分离技术
Pub Date : 2024-05-06 DOI: arxiv-2405.03118
Jianyu Wang, Shanzheng Guan
The independent low-rank matrix analysis (ILRMA) method stands out as aprominent technique for multichannel blind audio source separation. Itleverages nonnegative matrix factorization (NMF) and nonnegative canonicalpolyadic decomposition (NCPD) to model source parameters. While it effectivelycaptures the low-rank structure of sources, the NMF model overlooksinter-channel dependencies. On the other hand, NCPD preserves intrinsicstructure but lacks interpretable latent factors, making it challenging toincorporate prior information as constraints. To address these limitations, weintroduce a clustered source model based on nonnegative block-termdecomposition (NBTD). This model defines blocks as outer products of vectors(clusters) and matrices (for spectral structure modeling), offeringinterpretable latent vectors. Moreover, it enables straightforward integrationof orthogonality constraints to ensure independence among source images.Experimental results demonstrate that our proposed method outperforms ILRMA andits extensions in anechoic conditions and surpasses the original ILRMA insimulated reverberant environments.
独立低阶矩阵分析(ILRMA)方法是多声道盲音频源分离的主要技术。它利用非负矩阵因式分解(NMF)和非负同义分解(NCPD)来建立音源参数模型。虽然它能有效捕捉声源的低秩结构,但 NMF 模型忽略了信道间的依赖性。另一方面,NCPD 保留了内在结构,但缺乏可解释的潜在因素,这使得将先验信息作为约束条件具有挑战性。为了解决这些局限性,我们引入了基于非负块项分解(NBTD)的聚类源模型。该模型将块定义为向量(聚类)和矩阵(用于频谱结构建模)的外积,提供了可解释的潜在向量。实验结果表明,我们提出的方法在消声条件下优于 ILRMA 及其扩展方法,在模拟混响环境下优于原始的 ILRMA 方法。
{"title":"Determined Multichannel Blind Source Separation with Clustered Source Model","authors":"Jianyu Wang, Shanzheng Guan","doi":"arxiv-2405.03118","DOIUrl":"https://doi.org/arxiv-2405.03118","url":null,"abstract":"The independent low-rank matrix analysis (ILRMA) method stands out as a\u0000prominent technique for multichannel blind audio source separation. It\u0000leverages nonnegative matrix factorization (NMF) and nonnegative canonical\u0000polyadic decomposition (NCPD) to model source parameters. While it effectively\u0000captures the low-rank structure of sources, the NMF model overlooks\u0000inter-channel dependencies. On the other hand, NCPD preserves intrinsic\u0000structure but lacks interpretable latent factors, making it challenging to\u0000incorporate prior information as constraints. To address these limitations, we\u0000introduce a clustered source model based on nonnegative block-term\u0000decomposition (NBTD). This model defines blocks as outer products of vectors\u0000(clusters) and matrices (for spectral structure modeling), offering\u0000interpretable latent vectors. Moreover, it enables straightforward integration\u0000of orthogonality constraints to ensure independence among source images.\u0000Experimental results demonstrate that our proposed method outperforms ILRMA and\u0000its extensions in anechoic conditions and surpasses the original ILRMA in\u0000simulated reverberant environments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Whispy: Adapting STT Whisper Models to Real-Time Environments Whispy:根据实时环境调整 STT Whisper 模型
Pub Date : 2024-05-06 DOI: arxiv-2405.03484
Antonio Bevilacqua, Paolo Saviano, Alessandro Amirante, Simon Pietro Romano
Large general-purpose transformer models have recently become the mainstay inthe realm of speech analysis. In particular, Whisper achieves state-of-the-artresults in relevant tasks such as speech recognition, translation, languageidentification, and voice activity detection. However, Whisper models are notdesigned to be used in real-time conditions, and this limitation makes themunsuitable for a vast plethora of practical applications. In this paper, weintroduce Whispy, a system intended to bring live capabilities to the Whisperpretrained models. As a result of a number of architectural optimisations,Whispy is able to consume live audio streams and generate high level, coherentvoice transcriptions, while still maintaining a low computational cost. Weevaluate the performance of our system on a large repository of publiclyavailable speech datasets, investigating how the transcription mechanismintroduced by Whispy impacts on the Whisper output. Experimental results showhow Whispy excels in robustness, promptness, and accuracy.
大型通用变压器模型最近已成为语音分析领域的主流。其中,Whisper 在语音识别、翻译、语言识别和语音活动检测等相关任务中取得了最先进的结果。然而,Whisper 模型并非设计用于实时条件下,这一局限性使其不适合大量的实际应用。在本文中,我们介绍了 Whispy,这是一个旨在为 Whisper 预测模型带来实时功能的系统。经过一系列架构优化后,Whispy 能够处理实时音频流,并生成高水平、连贯的语音转录,同时还能保持较低的计算成本。我们在一个大型公开语音数据集库中评估了系统的性能,研究了 Whispy 引入的转录机制对 Whisper 输出的影响。实验结果表明,Whispy 在鲁棒性、及时性和准确性方面表现出色。
{"title":"Whispy: Adapting STT Whisper Models to Real-Time Environments","authors":"Antonio Bevilacqua, Paolo Saviano, Alessandro Amirante, Simon Pietro Romano","doi":"arxiv-2405.03484","DOIUrl":"https://doi.org/arxiv-2405.03484","url":null,"abstract":"Large general-purpose transformer models have recently become the mainstay in\u0000the realm of speech analysis. In particular, Whisper achieves state-of-the-art\u0000results in relevant tasks such as speech recognition, translation, language\u0000identification, and voice activity detection. However, Whisper models are not\u0000designed to be used in real-time conditions, and this limitation makes them\u0000unsuitable for a vast plethora of practical applications. In this paper, we\u0000introduce Whispy, a system intended to bring live capabilities to the Whisper\u0000pretrained models. As a result of a number of architectural optimisations,\u0000Whispy is able to consume live audio streams and generate high level, coherent\u0000voice transcriptions, while still maintaining a low computational cost. We\u0000evaluate the performance of our system on a large repository of publicly\u0000available speech datasets, investigating how the transcription mechanism\u0000introduced by Whispy impacts on the Whisper output. Experimental results show\u0000how Whispy excels in robustness, promptness, and accuracy.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Space Separable Distillation for Lightweight Acoustic Scene Classification 用于轻量级声学场景分类的深空可分离蒸馏技术
Pub Date : 2024-05-06 DOI: arxiv-2405.03567
ShuQi Ye, Yuan Tian
Acoustic scene classification (ASC) is highly important in the real world.Recently, deep learning-based methods have been widely employed for acousticscene classification. However, these methods are currently not lightweightenough as well as their performance is not satisfactory. To solve theseproblems, we propose a deep space separable distillation network. Firstly, thenetwork performs high-low frequency decomposition on the log-mel spectrogram,significantly reducing computational complexity while maintaining modelperformance. Secondly, we specially design three lightweight operators for ASC,including Separable Convolution (SC), Orthonormal Separable Convolution (OSC),and Separable Partial Convolution (SPC). These operators exhibit highlyefficient feature extraction capabilities in acoustic scene classificationtasks. The experimental results demonstrate that the proposed method achieves aperformance gain of 9.8% compared to the currently popular deep learningmethods, while also having smaller parameter count and computationalcomplexity.
声学场景分类(ASC)在现实世界中非常重要。最近,基于深度学习的方法被广泛用于声学场景分类。然而,这些方法目前还不够轻便,性能也不尽如人意。为了解决这些问题,我们提出了一种深度空间可分离蒸馏网络。首先,该网络对log-mel频谱图进行高低频分解,在保持模型性能的同时大大降低了计算复杂度。其次,我们专门为 ASC 设计了三个轻量级算子,包括可分离卷积(SC)、正交可分离卷积(OSC)和可分离部分卷积(SPC)。这些算子在声学场景分类任务中表现出高效的特征提取能力。实验结果表明,与目前流行的深度学习方法相比,所提出的方法实现了 9.8% 的性能增益,同时还具有更少的参数数量和计算复杂性。
{"title":"Deep Space Separable Distillation for Lightweight Acoustic Scene Classification","authors":"ShuQi Ye, Yuan Tian","doi":"arxiv-2405.03567","DOIUrl":"https://doi.org/arxiv-2405.03567","url":null,"abstract":"Acoustic scene classification (ASC) is highly important in the real world.\u0000Recently, deep learning-based methods have been widely employed for acoustic\u0000scene classification. However, these methods are currently not lightweight\u0000enough as well as their performance is not satisfactory. To solve these\u0000problems, we propose a deep space separable distillation network. Firstly, the\u0000network performs high-low frequency decomposition on the log-mel spectrogram,\u0000significantly reducing computational complexity while maintaining model\u0000performance. Secondly, we specially design three lightweight operators for ASC,\u0000including Separable Convolution (SC), Orthonormal Separable Convolution (OSC),\u0000and Separable Partial Convolution (SPC). These operators exhibit highly\u0000efficient feature extraction capabilities in acoustic scene classification\u0000tasks. The experimental results demonstrate that the proposed method achieves a\u0000performance gain of 9.8% compared to the currently popular deep learning\u0000methods, while also having smaller parameter count and computational\u0000complexity.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models 莫扎特的触摸基于预训练大型模型的轻量级多模态音乐生成框架
Pub Date : 2024-05-05 DOI: arxiv-2405.02801
Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu
In recent years, AI-Generated Content (AIGC) has witnessed rapidadvancements, facilitating the generation of music, images, and other forms ofartistic expression across various industries. However, researches on generalmulti-modal music generation model remain scarce. To fill this gap, we proposea multi-modal music generation framework Mozart's Touch. It could generatealigned music with the cross-modality inputs, such as images, videos and text.Mozart's Touch is composed of three main components: Multi-modal CaptioningModule, Large Language Model (LLM) Understanding & Bridging Module, and MusicGeneration Module. Unlike traditional approaches, Mozart's Touch requires notraining or fine-tuning pre-trained models, offering efficiency andtransparency through clear, interpretable prompts. We also introduce"LLM-Bridge" method to resolve the heterogeneous representation problemsbetween descriptive texts of different modalities. We conduct a series ofobjective and subjective evaluations on the proposed model, and resultsindicate that our model surpasses the performance of current state-of-the-artmodels. Our codes and examples is availble at:https://github.com/WangTooNaive/MozartsTouch
近年来,人工智能生成内容(AI-Generated Content,AIGC)发展迅速,促进了各行各业音乐、图像和其他艺术表现形式的生成。然而,关于通用多模态音乐生成模型的研究仍然很少。为了填补这一空白,我们提出了多模态音乐生成框架 "莫扎特的触摸"。莫扎特之触由三个主要部分组成:Mozart's Touch 由三个主要部分组成:多模态字幕模块(Multi-modal CaptioningModule)、大型语言模型(LLM)理解与桥接模块(Large Language Model Understanding & Bridging Module)和音乐生成模块(MusicGeneration Module)。与传统方法不同,莫扎特之音不需要对预先训练好的模型进行训练或微调,而是通过清晰、可解释的提示来提供效率和透明度。我们还引入了 "LLM-Bridge "方法,以解决不同模式的描述性文本之间的异构表示问题。我们对所提出的模型进行了一系列客观和主观评估,结果表明我们的模型超越了当前最先进模型的性能。我们的代码和示例可在以下网址获取:https://github.com/WangTooNaive/MozartsTouch
{"title":"Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models","authors":"Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu","doi":"arxiv-2405.02801","DOIUrl":"https://doi.org/arxiv-2405.02801","url":null,"abstract":"In recent years, AI-Generated Content (AIGC) has witnessed rapid\u0000advancements, facilitating the generation of music, images, and other forms of\u0000artistic expression across various industries. However, researches on general\u0000multi-modal music generation model remain scarce. To fill this gap, we propose\u0000a multi-modal music generation framework Mozart's Touch. It could generate\u0000aligned music with the cross-modality inputs, such as images, videos and text.\u0000Mozart's Touch is composed of three main components: Multi-modal Captioning\u0000Module, Large Language Model (LLM) Understanding & Bridging Module, and Music\u0000Generation Module. Unlike traditional approaches, Mozart's Touch requires no\u0000training or fine-tuning pre-trained models, offering efficiency and\u0000transparency through clear, interpretable prompts. We also introduce\u0000\"LLM-Bridge\" method to resolve the heterogeneous representation problems\u0000between descriptive texts of different modalities. We conduct a series of\u0000objective and subjective evaluations on the proposed model, and results\u0000indicate that our model surpasses the performance of current state-of-the-art\u0000models. Our codes and examples is availble at:\u0000https://github.com/WangTooNaive/MozartsTouch","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction 利用频率自适应声场预测进行视听导航的 Sim2Real 传输
Pub Date : 2024-05-05 DOI: arxiv-2405.02821
Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman
Sim2real transfer has received increasing attention lately due to the successof learning robotic tasks in simulation end-to-end. While there has been a lotof progress in transferring vision-based navigation policies, the existingsim2real strategy for audio-visual navigation performs data augmentationempirically without measuring the acoustic gap. The sound differs from light inthat it spans across much wider frequencies and thus requires a differentsolution for sim2real. We propose the first treatment of sim2real foraudio-visual navigation by disentangling it into acoustic field prediction(AFP) and waypoint navigation. We first validate our design choice in theSoundSpaces simulator and show improvement on the Continuous AudioGoalnavigation benchmark. We then collect real-world data to measure the spectraldifference between the simulation and the real world by training AFP modelsthat only take a specific frequency subband as input. We further propose afrequency-adaptive strategy that intelligently selects the best frequency bandfor prediction based on both the measured spectral difference and the energydistribution of the received audio, which improves the performance on the realdata. Lastly, we build a real robot platform and show that the transferredpolicy can successfully navigate to sounding objects. This work demonstratesthe potential of building intelligent agents that can see, hear, and actentirely from simulation, and transferring them to the real world.
由于端到端仿真机器人任务学习的成功,仿真到真实的转换近来受到越来越多的关注。虽然在基于视觉的导航策略转移方面已经取得了很大进展,但现有的用于视听导航的 Sim2 Real 策略是在不测量声学间隙的情况下经验性地执行数据增强。声音与光不同,它的频率跨度更大,因此需要不同的 sim2real 解决方案。我们首次提出了用于视听导航的 sim2real 方法,将其分为声场预测(AFP)和航点导航。我们首先在声场模拟器(SoundSpaces)中验证了我们的设计选择,并在连续音频目标导航(Continuous AudioGoalnavigation)基准测试中展示了改进效果。然后,我们收集真实世界的数据,通过训练只将特定频率子带作为输入的 AFP 模型来测量模拟与真实世界之间的频谱差异。我们进一步提出了一种频率自适应策略,根据测量到的频谱差和接收音频的能量分布,智能地选择最佳频段进行预测,从而提高了在真实数据上的性能。最后,我们搭建了一个真实的机器人平台,并展示了所传输的策略能够成功导航到发声物体。这项工作展示了构建智能代理的潜力,这些代理可以完全通过模拟来观看、聆听和行动,并将它们转移到真实世界中。
{"title":"Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction","authors":"Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman","doi":"arxiv-2405.02821","DOIUrl":"https://doi.org/arxiv-2405.02821","url":null,"abstract":"Sim2real transfer has received increasing attention lately due to the success\u0000of learning robotic tasks in simulation end-to-end. While there has been a lot\u0000of progress in transferring vision-based navigation policies, the existing\u0000sim2real strategy for audio-visual navigation performs data augmentation\u0000empirically without measuring the acoustic gap. The sound differs from light in\u0000that it spans across much wider frequencies and thus requires a different\u0000solution for sim2real. We propose the first treatment of sim2real for\u0000audio-visual navigation by disentangling it into acoustic field prediction\u0000(AFP) and waypoint navigation. We first validate our design choice in the\u0000SoundSpaces simulator and show improvement on the Continuous AudioGoal\u0000navigation benchmark. We then collect real-world data to measure the spectral\u0000difference between the simulation and the real world by training AFP models\u0000that only take a specific frequency subband as input. We further propose a\u0000frequency-adaptive strategy that intelligently selects the best frequency band\u0000for prediction based on both the measured spectral difference and the energy\u0000distribution of the received audio, which improves the performance on the real\u0000data. Lastly, we build a real robot platform and show that the transferred\u0000policy can successfully navigate to sounding objects. This work demonstrates\u0000the potential of building intelligent agents that can see, hear, and act\u0000entirely from simulation, and transferring them to the real world.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Sound
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1