We present a framework to recognize Parkinson's disease (PD) through an English pangram utterance speech collected using a web application from diverse recording settings and environments, including participants' homes. Our dataset includes a global cohort of 1306 participants, including 392 diagnosed with PD. Leveraging the diversity of the dataset, spanning various demographic properties (such as age, sex, and ethnicity), we used deep learning embeddings derived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind representing the speech dynamics associated with PD. Our novel fusion model for PD classification, which aligns different speech embeddings into a cohesive feature space, demonstrated superior performance over standard concatenation-based fusion models and other baselines (including models built on traditional acoustic features). In a randomized data split configuration, the model achieved an Area Under the Receiver Operating Characteristic Curve (AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis confirmed that our model performs equitably across various demographic subgroups in terms of sex, ethnicity, and age, and remains robust regardless of disease duration. Furthermore, our model, when tested on two entirely unseen test datasets collected from clinical settings and from a PD care center, maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the model's robustness and it's potential to enhance accessibility and health equity in real-world applications.
{"title":"A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings","authors":"Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque","doi":"arxiv-2405.17206","DOIUrl":"https://doi.org/arxiv-2405.17206","url":null,"abstract":"We present a framework to recognize Parkinson's disease (PD) through an\u0000English pangram utterance speech collected using a web application from diverse\u0000recording settings and environments, including participants' homes. Our dataset\u0000includes a global cohort of 1306 participants, including 392 diagnosed with PD.\u0000Leveraging the diversity of the dataset, spanning various demographic\u0000properties (such as age, sex, and ethnicity), we used deep learning embeddings\u0000derived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind\u0000representing the speech dynamics associated with PD. Our novel fusion model for\u0000PD classification, which aligns different speech embeddings into a cohesive\u0000feature space, demonstrated superior performance over standard\u0000concatenation-based fusion models and other baselines (including models built\u0000on traditional acoustic features). In a randomized data split configuration,\u0000the model achieved an Area Under the Receiver Operating Characteristic Curve\u0000(AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis\u0000confirmed that our model performs equitably across various demographic\u0000subgroups in terms of sex, ethnicity, and age, and remains robust regardless of\u0000disease duration. Furthermore, our model, when tested on two entirely unseen\u0000test datasets collected from clinical settings and from a PD care center,\u0000maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the\u0000model's robustness and it's potential to enhance accessibility and health\u0000equity in real-world applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley
This paper investigates the integration of force feedback in Digital Musical Instruments (DMI), specifically evaluating the reproduction of intricate vibrato techniques using haptic feedback controllers. We introduce our system for vibrato modulation using force feedback, composed of Bend-aid (a web-based sequencer platform using pre-designed haptic feedback models) and TorqueTuner (an open-source 1 Degree-of-Freedom (DoF) rotary haptic device for generating programmable haptic effects). We designed a formal user study to assess the impact of each haptic mode on user experience in a vibrato mimicry task. Twenty musically trained participants rated their user experience for the three haptic modes (Smooth, Detent, and Spring) using four Likert-scale scores: comfort, flexibility, ease of control, and helpfulness for the task. Finally, we asked participants to share their reflections. Our research indicates that while the Spring mode can help with light vibrato, preferences for haptic modes vary based on musical training background. This emphasizes the need for adaptable task interfaces and flexible haptic feedback in DMI design.
{"title":"Enhancing DMI Interactions by Integrating Haptic Feedback for Intricate Vibrato Technique","authors":"Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley","doi":"arxiv-2405.10502","DOIUrl":"https://doi.org/arxiv-2405.10502","url":null,"abstract":"This paper investigates the integration of force feedback in Digital Musical\u0000Instruments (DMI), specifically evaluating the reproduction of intricate\u0000vibrato techniques using haptic feedback controllers. We introduce our system\u0000for vibrato modulation using force feedback, composed of Bend-aid (a web-based\u0000sequencer platform using pre-designed haptic feedback models) and TorqueTuner\u0000(an open-source 1 Degree-of-Freedom (DoF) rotary haptic device for generating\u0000programmable haptic effects). We designed a formal user study to assess the\u0000impact of each haptic mode on user experience in a vibrato mimicry task. Twenty\u0000musically trained participants rated their user experience for the three haptic\u0000modes (Smooth, Detent, and Spring) using four Likert-scale scores: comfort,\u0000flexibility, ease of control, and helpfulness for the task. Finally, we asked\u0000participants to share their reflections. Our research indicates that while the\u0000Spring mode can help with light vibrato, preferences for haptic modes vary\u0000based on musical training background. This emphasizes the need for adaptable\u0000task interfaces and flexible haptic feedback in DMI design.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Analog electronic circuits are at the core of an important category of musical devices. The nonlinear features of their electronic components give analog musical devices a distinctive timbre and sound quality, making them highly desirable. Artificial neural networks have rapidly gained popularity for the emulation of analog audio effects circuits, particularly recurrent networks. While neural approaches have been successful in accurately modeling distortion circuits, they require architectural improvements that account for parameter conditioning and low latency response. In this article, we explore the application of recent machine learning advancements for virtual analog modeling. We compare State Space models and Linear Recurrent Units against the more common Long Short Term Memory networks. These have shown promising ability in sequence to sequence modeling tasks, showing a notable improvement in signal history encoding. Our comparative study uses these black box neural modeling techniques with a variety of audio effects. We evaluate the performance and limitations using multiple metrics aiming to assess the models' ability to accurately replicate energy envelopes, frequency contents, and transients in the audio signal. To incorporate control parameters we employ the Feature wise Linear Modulation method. Long Short Term Memory networks exhibit better accuracy in emulating distortions and equalizers, while the State Space model, followed by Long Short Term Memory networks when integrated in an encoder decoder structure, outperforms others in emulating saturation and compression. When considering long time variant characteristics, the State Space model demonstrates the greatest accuracy. The Long Short Term Memory and, in particular, Linear Recurrent Unit networks present more tendency to introduce audio artifacts.
{"title":"Comparative Study of Recurrent Neural Networks for Virtual Analog Audio Effects Modeling","authors":"Riccardo Simionato, Stefano Fasciani","doi":"arxiv-2405.04124","DOIUrl":"https://doi.org/arxiv-2405.04124","url":null,"abstract":"Analog electronic circuits are at the core of an important category of\u0000musical devices. The nonlinear features of their electronic components give\u0000analog musical devices a distinctive timbre and sound quality, making them\u0000highly desirable. Artificial neural networks have rapidly gained popularity for\u0000the emulation of analog audio effects circuits, particularly recurrent\u0000networks. While neural approaches have been successful in accurately modeling\u0000distortion circuits, they require architectural improvements that account for\u0000parameter conditioning and low latency response. In this article, we explore\u0000the application of recent machine learning advancements for virtual analog\u0000modeling. We compare State Space models and Linear Recurrent Units against the\u0000more common Long Short Term Memory networks. These have shown promising ability\u0000in sequence to sequence modeling tasks, showing a notable improvement in signal\u0000history encoding. Our comparative study uses these black box neural modeling\u0000techniques with a variety of audio effects. We evaluate the performance and\u0000limitations using multiple metrics aiming to assess the models' ability to\u0000accurately replicate energy envelopes, frequency contents, and transients in\u0000the audio signal. To incorporate control parameters we employ the Feature wise\u0000Linear Modulation method. Long Short Term Memory networks exhibit better\u0000accuracy in emulating distortions and equalizers, while the State Space model,\u0000followed by Long Short Term Memory networks when integrated in an encoder\u0000decoder structure, outperforms others in emulating saturation and compression.\u0000When considering long time variant characteristics, the State Space model\u0000demonstrates the greatest accuracy. The Long Short Term Memory and, in\u0000particular, Linear Recurrent Unit networks present more tendency to introduce\u0000audio artifacts.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140927682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenye Luo, Min Ren, Xuecai Hu, Yongzhen Huang, Li Yao
Generating dances that are both lifelike and well-aligned with music continues to be a challenging task in the cross-modal domain. This paper introduces PopDanceSet, the first dataset tailored to the preferences of young audiences, enabling the generation of aesthetically oriented dances. And it surpasses the AIST++ dataset in music genre diversity and the intricacy and depth of dance movements. Moreover, the proposed POPDG model within the iDDPM framework enhances dance diversity and, through the Space Augmentation Algorithm, strengthens spatial physical connections between human body joints, ensuring that increased diversity does not compromise generation quality. A streamlined Alignment Module is also designed to improve the temporal alignment between dance and music. Extensive experiments show that POPDG achieves SOTA results on two datasets. Furthermore, the paper also expands on current evaluation metrics. The dataset and code are available at https://github.com/Luke-Luo1/POPDG.
{"title":"POPDG: Popular 3D Dance Generation with PopDanceSet","authors":"Zhenye Luo, Min Ren, Xuecai Hu, Yongzhen Huang, Li Yao","doi":"arxiv-2405.03178","DOIUrl":"https://doi.org/arxiv-2405.03178","url":null,"abstract":"Generating dances that are both lifelike and well-aligned with music\u0000continues to be a challenging task in the cross-modal domain. This paper\u0000introduces PopDanceSet, the first dataset tailored to the preferences of young\u0000audiences, enabling the generation of aesthetically oriented dances. And it\u0000surpasses the AIST++ dataset in music genre diversity and the intricacy and\u0000depth of dance movements. Moreover, the proposed POPDG model within the iDDPM\u0000framework enhances dance diversity and, through the Space Augmentation\u0000Algorithm, strengthens spatial physical connections between human body joints,\u0000ensuring that increased diversity does not compromise generation quality. A\u0000streamlined Alignment Module is also designed to improve the temporal alignment\u0000between dance and music. Extensive experiments show that POPDG achieves SOTA\u0000results on two datasets. Furthermore, the paper also expands on current\u0000evaluation metrics. The dataset and code are available at\u0000https://github.com/Luke-Luo1/POPDG.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present the design and development of the Transhuman Ansambl, a novel interactive singing-voice interface which senses its environment and responds to vocal input with vocalisations using human voice. Designed for live performance with a human performer and as a standalone sound installation, the ansambl consists of sixteen bespoke virtual singers arranged in a circle. When performing live, the virtual singers listen to the human performer and respond to their singing by reading pitch, intonation and volume cues. In a standalone sound installation mode, singers use ultrasonic distance sensors to sense audience presence. Developed as part of the 1st author's practice-based PhD and artistic practice as a live performer, this work employs the singing-voice to explore voice interactions in HCI beyond language, and innovative ways of live performing. How is technology supporting the effect of intimacy produced through voice? Does the act of surrounding the audience with responsive virtual singers challenge the traditional roles of performer-listener? To answer these questions, we draw upon the 1st author's experience with the system, and the interdisciplinary field of voice studies that consider the voice as the sound medium independent of language, capable of enacting a reciprocal connection between bodies.
{"title":"Transhuman Ansambl - Voice Beyond Language","authors":"Lucija Ivsic, Jon McCormack, Vince Dziekan","doi":"arxiv-2405.03134","DOIUrl":"https://doi.org/arxiv-2405.03134","url":null,"abstract":"In this paper we present the design and development of the Transhuman\u0000Ansambl, a novel interactive singing-voice interface which senses its\u0000environment and responds to vocal input with vocalisations using human voice.\u0000Designed for live performance with a human performer and as a standalone sound\u0000installation, the ansambl consists of sixteen bespoke virtual singers arranged\u0000in a circle. When performing live, the virtual singers listen to the human\u0000performer and respond to their singing by reading pitch, intonation and volume\u0000cues. In a standalone sound installation mode, singers use ultrasonic distance\u0000sensors to sense audience presence. Developed as part of the 1st author's\u0000practice-based PhD and artistic practice as a live performer, this work employs\u0000the singing-voice to explore voice interactions in HCI beyond language, and\u0000innovative ways of live performing. How is technology supporting the effect of\u0000intimacy produced through voice? Does the act of surrounding the audience with\u0000responsive virtual singers challenge the traditional roles of\u0000performer-listener? To answer these questions, we draw upon the 1st author's\u0000experience with the system, and the interdisciplinary field of voice studies\u0000that consider the voice as the sound medium independent of language, capable of\u0000enacting a reciprocal connection between bodies.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The independent low-rank matrix analysis (ILRMA) method stands out as a prominent technique for multichannel blind audio source separation. It leverages nonnegative matrix factorization (NMF) and nonnegative canonical polyadic decomposition (NCPD) to model source parameters. While it effectively captures the low-rank structure of sources, the NMF model overlooks inter-channel dependencies. On the other hand, NCPD preserves intrinsic structure but lacks interpretable latent factors, making it challenging to incorporate prior information as constraints. To address these limitations, we introduce a clustered source model based on nonnegative block-term decomposition (NBTD). This model defines blocks as outer products of vectors (clusters) and matrices (for spectral structure modeling), offering interpretable latent vectors. Moreover, it enables straightforward integration of orthogonality constraints to ensure independence among source images. Experimental results demonstrate that our proposed method outperforms ILRMA and its extensions in anechoic conditions and surpasses the original ILRMA in simulated reverberant environments.
{"title":"Determined Multichannel Blind Source Separation with Clustered Source Model","authors":"Jianyu Wang, Shanzheng Guan","doi":"arxiv-2405.03118","DOIUrl":"https://doi.org/arxiv-2405.03118","url":null,"abstract":"The independent low-rank matrix analysis (ILRMA) method stands out as a\u0000prominent technique for multichannel blind audio source separation. It\u0000leverages nonnegative matrix factorization (NMF) and nonnegative canonical\u0000polyadic decomposition (NCPD) to model source parameters. While it effectively\u0000captures the low-rank structure of sources, the NMF model overlooks\u0000inter-channel dependencies. On the other hand, NCPD preserves intrinsic\u0000structure but lacks interpretable latent factors, making it challenging to\u0000incorporate prior information as constraints. To address these limitations, we\u0000introduce a clustered source model based on nonnegative block-term\u0000decomposition (NBTD). This model defines blocks as outer products of vectors\u0000(clusters) and matrices (for spectral structure modeling), offering\u0000interpretable latent vectors. Moreover, it enables straightforward integration\u0000of orthogonality constraints to ensure independence among source images.\u0000Experimental results demonstrate that our proposed method outperforms ILRMA and\u0000its extensions in anechoic conditions and surpasses the original ILRMA in\u0000simulated reverberant environments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Bevilacqua, Paolo Saviano, Alessandro Amirante, Simon Pietro Romano
Large general-purpose transformer models have recently become the mainstay in the realm of speech analysis. In particular, Whisper achieves state-of-the-art results in relevant tasks such as speech recognition, translation, language identification, and voice activity detection. However, Whisper models are not designed to be used in real-time conditions, and this limitation makes them unsuitable for a vast plethora of practical applications. In this paper, we introduce Whispy, a system intended to bring live capabilities to the Whisper pretrained models. As a result of a number of architectural optimisations, Whispy is able to consume live audio streams and generate high level, coherent voice transcriptions, while still maintaining a low computational cost. We evaluate the performance of our system on a large repository of publicly available speech datasets, investigating how the transcription mechanism introduced by Whispy impacts on the Whisper output. Experimental results show how Whispy excels in robustness, promptness, and accuracy.
{"title":"Whispy: Adapting STT Whisper Models to Real-Time Environments","authors":"Antonio Bevilacqua, Paolo Saviano, Alessandro Amirante, Simon Pietro Romano","doi":"arxiv-2405.03484","DOIUrl":"https://doi.org/arxiv-2405.03484","url":null,"abstract":"Large general-purpose transformer models have recently become the mainstay in\u0000the realm of speech analysis. In particular, Whisper achieves state-of-the-art\u0000results in relevant tasks such as speech recognition, translation, language\u0000identification, and voice activity detection. However, Whisper models are not\u0000designed to be used in real-time conditions, and this limitation makes them\u0000unsuitable for a vast plethora of practical applications. In this paper, we\u0000introduce Whispy, a system intended to bring live capabilities to the Whisper\u0000pretrained models. As a result of a number of architectural optimisations,\u0000Whispy is able to consume live audio streams and generate high level, coherent\u0000voice transcriptions, while still maintaining a low computational cost. We\u0000evaluate the performance of our system on a large repository of publicly\u0000available speech datasets, investigating how the transcription mechanism\u0000introduced by Whispy impacts on the Whisper output. Experimental results show\u0000how Whispy excels in robustness, promptness, and accuracy.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Acoustic scene classification (ASC) is highly important in the real world. Recently, deep learning-based methods have been widely employed for acoustic scene classification. However, these methods are currently not lightweight enough as well as their performance is not satisfactory. To solve these problems, we propose a deep space separable distillation network. Firstly, the network performs high-low frequency decomposition on the log-mel spectrogram, significantly reducing computational complexity while maintaining model performance. Secondly, we specially design three lightweight operators for ASC, including Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC). These operators exhibit highly efficient feature extraction capabilities in acoustic scene classification tasks. The experimental results demonstrate that the proposed method achieves a performance gain of 9.8% compared to the currently popular deep learning methods, while also having smaller parameter count and computational complexity.
{"title":"Deep Space Separable Distillation for Lightweight Acoustic Scene Classification","authors":"ShuQi Ye, Yuan Tian","doi":"arxiv-2405.03567","DOIUrl":"https://doi.org/arxiv-2405.03567","url":null,"abstract":"Acoustic scene classification (ASC) is highly important in the real world.\u0000Recently, deep learning-based methods have been widely employed for acoustic\u0000scene classification. However, these methods are currently not lightweight\u0000enough as well as their performance is not satisfactory. To solve these\u0000problems, we propose a deep space separable distillation network. Firstly, the\u0000network performs high-low frequency decomposition on the log-mel spectrogram,\u0000significantly reducing computational complexity while maintaining model\u0000performance. Secondly, we specially design three lightweight operators for ASC,\u0000including Separable Convolution (SC), Orthonormal Separable Convolution (OSC),\u0000and Separable Partial Convolution (SPC). These operators exhibit highly\u0000efficient feature extraction capabilities in acoustic scene classification\u0000tasks. The experimental results demonstrate that the proposed method achieves a\u0000performance gain of 9.8% compared to the currently popular deep learning\u0000methods, while also having smaller parameter count and computational\u0000complexity.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu
In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at: https://github.com/WangTooNaive/MozartsTouch
近年来,人工智能生成内容(AI-Generated Content,AIGC)发展迅速,促进了各行各业音乐、图像和其他艺术表现形式的生成。然而,关于通用多模态音乐生成模型的研究仍然很少。为了填补这一空白,我们提出了多模态音乐生成框架 "莫扎特的触摸"。莫扎特之触由三个主要部分组成:Mozart's Touch 由三个主要部分组成:多模态字幕模块(Multi-modal CaptioningModule)、大型语言模型(LLM)理解与桥接模块(Large Language Model Understanding & Bridging Module)和音乐生成模块(MusicGeneration Module)。与传统方法不同,莫扎特之音不需要对预先训练好的模型进行训练或微调,而是通过清晰、可解释的提示来提供效率和透明度。我们还引入了 "LLM-Bridge "方法,以解决不同模式的描述性文本之间的异构表示问题。我们对所提出的模型进行了一系列客观和主观评估,结果表明我们的模型超越了当前最先进模型的性能。我们的代码和示例可在以下网址获取:https://github.com/WangTooNaive/MozartsTouch
{"title":"Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models","authors":"Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu","doi":"arxiv-2405.02801","DOIUrl":"https://doi.org/arxiv-2405.02801","url":null,"abstract":"In recent years, AI-Generated Content (AIGC) has witnessed rapid\u0000advancements, facilitating the generation of music, images, and other forms of\u0000artistic expression across various industries. However, researches on general\u0000multi-modal music generation model remain scarce. To fill this gap, we propose\u0000a multi-modal music generation framework Mozart's Touch. It could generate\u0000aligned music with the cross-modality inputs, such as images, videos and text.\u0000Mozart's Touch is composed of three main components: Multi-modal Captioning\u0000Module, Large Language Model (LLM) Understanding & Bridging Module, and Music\u0000Generation Module. Unlike traditional approaches, Mozart's Touch requires no\u0000training or fine-tuning pre-trained models, offering efficiency and\u0000transparency through clear, interpretable prompts. We also introduce\u0000\"LLM-Bridge\" method to resolve the heterogeneous representation problems\u0000between descriptive texts of different modalities. We conduct a series of\u0000objective and subjective evaluations on the proposed model, and results\u0000indicate that our model surpasses the performance of current state-of-the-art\u0000models. Our codes and examples is availble at:\u0000https://github.com/WangTooNaive/MozartsTouch","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman
Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans across much wider frequencies and thus requires a different solution for sim2real. We propose the first treatment of sim2real for audio-visual navigation by disentangling it into acoustic field prediction (AFP) and waypoint navigation. We first validate our design choice in the SoundSpaces simulator and show improvement on the Continuous AudioGoal navigation benchmark. We then collect real-world data to measure the spectral difference between the simulation and the real world by training AFP models that only take a specific frequency subband as input. We further propose a frequency-adaptive strategy that intelligently selects the best frequency band for prediction based on both the measured spectral difference and the energy distribution of the received audio, which improves the performance on the real data. Lastly, we build a real robot platform and show that the transferred policy can successfully navigate to sounding objects. This work demonstrates the potential of building intelligent agents that can see, hear, and act entirely from simulation, and transferring them to the real world.
{"title":"Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction","authors":"Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman","doi":"arxiv-2405.02821","DOIUrl":"https://doi.org/arxiv-2405.02821","url":null,"abstract":"Sim2real transfer has received increasing attention lately due to the success\u0000of learning robotic tasks in simulation end-to-end. While there has been a lot\u0000of progress in transferring vision-based navigation policies, the existing\u0000sim2real strategy for audio-visual navigation performs data augmentation\u0000empirically without measuring the acoustic gap. The sound differs from light in\u0000that it spans across much wider frequencies and thus requires a different\u0000solution for sim2real. We propose the first treatment of sim2real for\u0000audio-visual navigation by disentangling it into acoustic field prediction\u0000(AFP) and waypoint navigation. We first validate our design choice in the\u0000SoundSpaces simulator and show improvement on the Continuous AudioGoal\u0000navigation benchmark. We then collect real-world data to measure the spectral\u0000difference between the simulation and the real world by training AFP models\u0000that only take a specific frequency subband as input. We further propose a\u0000frequency-adaptive strategy that intelligently selects the best frequency band\u0000for prediction based on both the measured spectral difference and the energy\u0000distribution of the received audio, which improves the performance on the real\u0000data. Lastly, we build a real robot platform and show that the transferred\u0000policy can successfully navigate to sounding objects. This work demonstrates\u0000the potential of building intelligent agents that can see, hear, and act\u0000entirely from simulation, and transferring them to the real world.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}