首页 > 最新文献

arXiv - CS - Sound最新文献

英文 中文
Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model 转换任何人的声音:利用条件扩散模型进行端到端表达式语音转换
Pub Date : 2024-05-02 DOI: arxiv-2405.01730
Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman
Expressive voice conversion (VC) conducts speaker identity conversion foremotional speakers by jointly converting speaker identity and emotional style.Emotional style modeling for arbitrary speakers in expressive VC has not beenextensively explored. Previous approaches have relied on vocoders for speechreconstruction, which makes speech quality heavily dependent on the performanceof vocoders. A major challenge of expressive VC lies in emotion prosodymodeling. To address these challenges, this paper proposes a fully end-to-endexpressive VC framework based on a conditional denoising diffusionprobabilistic model (DDPM). We utilize speech units derived fromself-supervised speech models as content conditioning, along with deep featuresextracted from speech emotion recognition and speaker verification systems tomodel emotional style and speaker identity. Objective and subjectiveevaluations show the effectiveness of our framework. Codes and samples arepublicly available.
表达式语音转换(VC)通过对说话者身份和情感风格进行联合转换,对情感丰富的说话者进行说话者身份转换。以往的方法依赖于声码器进行语音重构,这使得语音质量严重依赖于声码器的性能。表达式 VC 的一大挑战在于情感前体建模。为了应对这些挑战,本文提出了一种基于条件去噪扩散概率模型(DDPM)的完全端到表达式 VC 框架。我们利用从自我监督语音模型中提取的语音单元作为内容条件,同时利用从语音情感识别和说话人验证系统中提取的深度特征来模拟情感风格和说话人身份。客观和主观评价显示了我们框架的有效性。代码和样本可公开获取。
{"title":"Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model","authors":"Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman","doi":"arxiv-2405.01730","DOIUrl":"https://doi.org/arxiv-2405.01730","url":null,"abstract":"Expressive voice conversion (VC) conducts speaker identity conversion for\u0000emotional speakers by jointly converting speaker identity and emotional style.\u0000Emotional style modeling for arbitrary speakers in expressive VC has not been\u0000extensively explored. Previous approaches have relied on vocoders for speech\u0000reconstruction, which makes speech quality heavily dependent on the performance\u0000of vocoders. A major challenge of expressive VC lies in emotion prosody\u0000modeling. To address these challenges, this paper proposes a fully end-to-end\u0000expressive VC framework based on a conditional denoising diffusion\u0000probabilistic model (DDPM). We utilize speech units derived from\u0000self-supervised speech models as content conditioning, along with deep features\u0000extracted from speech emotion recognition and speaker verification systems to\u0000model emotional style and speaker identity. Objective and subjective\u0000evaluations show the effectiveness of our framework. Codes and samples are\u0000publicly available.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
USAT: A Universal Speaker-Adaptive Text-to-Speech Approach USAT:通用演讲者自适应文本到语音方法
Pub Date : 2024-04-28 DOI: arxiv-2404.18094
Wenbin Wang, Yang Song, Sanjay Jha
Conventional text-to-speech (TTS) research has predominantly focused onenhancing the quality of synthesized speech for speakers in the trainingdataset. The challenge of synthesizing lifelike speech for unseen,out-of-dataset speakers, especially those with limited reference data, remainsa significant and unresolved problem. While zero-shot or few-shotspeaker-adaptive TTS approaches have been explored, they have many limitations.Zero-shot approaches tend to suffer from insufficient generalizationperformance to reproduce the voice of speakers with heavy accents. Whilefew-shot methods can reproduce highly varying accents, they bring a significantstorage burden and the risk of overfitting and catastrophic forgetting. Inaddition, prior approaches only provide either zero-shot or few-shotadaptation, constraining their utility across varied real-world scenarios withdifferent demands. Besides, most current evaluations of speaker-adaptive TTSare conducted only on datasets of native speakers, inadvertently neglecting avast portion of non-native speakers with diverse accents. Our proposedframework unifies both zero-shot and few-shot speaker adaptation strategies,which we term as "instant" and "fine-grained" adaptations based on theirmerits. To alleviate the insufficient generalization performance observed inzero-shot speaker adaptation, we designed two innovative discriminators andintroduced a memory mechanism for the speech decoder. To prevent catastrophicforgetting and reduce storage implications for few-shot speaker adaptation, wedesigned two adapters and a unique adaptation procedure.
传统的文本到语音(TTS)研究主要集中在提高训练数据集中说话人的合成语音质量上。为未见过、数据集外的说话者,尤其是参考数据有限的说话者合成栩栩如生的语音,仍然是一个尚未解决的重大难题。虽然人们已经探索了零镜头或少镜头扬声器自适应 TTS 方法,但它们有很多局限性。零镜头方法往往存在泛化性能不足的问题,无法重现重口音扬声器的声音。虽然少镜头方法可以重现变化很大的口音,但它们带来了巨大的存储负担,以及过度拟合和灾难性遗忘的风险。此外,先前的方法只能提供零镜头或少镜头适应,这限制了它们在不同需求的真实世界场景中的实用性。此外,目前对说话者自适应 TTS 的评估大多只在母语使用者的数据集上进行,无意中忽略了口音各异的非母语使用者。我们提出的框架统一了 "零镜头 "和 "少镜头 "说话者适应策略,并根据它们的优点将其分别称为 "即时 "和 "细粒度 "适应。为了缓解零镜头说话人适配中观察到的泛化性能不足的问题,我们设计了两个创新的判别器,并为语音解码器引入了记忆机制。为了防止灾难性遗忘,并减少存储对少量说话人适配的影响,我们设计了两个适配器和一个独特的适配程序。
{"title":"USAT: A Universal Speaker-Adaptive Text-to-Speech Approach","authors":"Wenbin Wang, Yang Song, Sanjay Jha","doi":"arxiv-2404.18094","DOIUrl":"https://doi.org/arxiv-2404.18094","url":null,"abstract":"Conventional text-to-speech (TTS) research has predominantly focused on\u0000enhancing the quality of synthesized speech for speakers in the training\u0000dataset. The challenge of synthesizing lifelike speech for unseen,\u0000out-of-dataset speakers, especially those with limited reference data, remains\u0000a significant and unresolved problem. While zero-shot or few-shot\u0000speaker-adaptive TTS approaches have been explored, they have many limitations.\u0000Zero-shot approaches tend to suffer from insufficient generalization\u0000performance to reproduce the voice of speakers with heavy accents. While\u0000few-shot methods can reproduce highly varying accents, they bring a significant\u0000storage burden and the risk of overfitting and catastrophic forgetting. In\u0000addition, prior approaches only provide either zero-shot or few-shot\u0000adaptation, constraining their utility across varied real-world scenarios with\u0000different demands. Besides, most current evaluations of speaker-adaptive TTS\u0000are conducted only on datasets of native speakers, inadvertently neglecting a\u0000vast portion of non-native speakers with diverse accents. Our proposed\u0000framework unifies both zero-shot and few-shot speaker adaptation strategies,\u0000which we term as \"instant\" and \"fine-grained\" adaptations based on their\u0000merits. To alleviate the insufficient generalization performance observed in\u0000zero-shot speaker adaptation, we designed two innovative discriminators and\u0000introduced a memory mechanism for the speech decoder. To prevent catastrophic\u0000forgetting and reduce storage implications for few-shot speaker adaptation, we\u0000designed two adapters and a unique adaptation procedure.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition 比较自监督域内转移学习和监督域外转移学习在鸟类物种识别中的应用
Pub Date : 2024-04-26 DOI: arxiv-2404.17252
Houtan Ghaffari, Paul Devos
Transferring the weights of a pre-trained model to assist another task hasbecome a crucial part of modern deep learning, particularly in data-scarcescenarios. Pre-training refers to the initial step of training models outsidethe current task of interest, typically on another dataset. It can be done viasupervised models using human-annotated datasets or self-supervised modelstrained on unlabeled datasets. In both cases, many pre-trained models areavailable to fine-tune for the task of interest. Interestingly, research hasshown that pre-trained models from ImageNet can be helpful for audio tasksdespite being trained on image datasets. Hence, it's unclear whether in-domainmodels would be advantageous compared to competent out-domain models, such asconvolutional neural networks from ImageNet. Our experiments will demonstratethe usefulness of in-domain models and datasets for bird species recognition byleveraging VICReg, a recent and powerful self-supervised method.
将预先训练好的模型的权重转移到另一项任务上,已成为现代深度学习的重要组成部分,尤其是在数据稀缺的情况下。预训练指的是在当前任务之外训练模型的初始步骤,通常是在另一个数据集上。预训练可以通过使用人类标注数据集的监督模型或在无标注数据集上的自监督模型来完成。在这两种情况下,都有许多预先训练好的模型,可以针对感兴趣的任务进行微调。有趣的是,研究表明,来自 ImageNet 的预训练模型尽管是在图像数据集上训练的,但对音频任务也有帮助。因此,目前还不清楚域内模型与胜任的域外模型(如 ImageNet 的卷积神经网络)相比是否具有优势。我们的实验将通过利用 VICReg 这一最新的强大自监督方法,证明域内模型和数据集在鸟类物种识别方面的实用性。
{"title":"Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition","authors":"Houtan Ghaffari, Paul Devos","doi":"arxiv-2404.17252","DOIUrl":"https://doi.org/arxiv-2404.17252","url":null,"abstract":"Transferring the weights of a pre-trained model to assist another task has\u0000become a crucial part of modern deep learning, particularly in data-scarce\u0000scenarios. Pre-training refers to the initial step of training models outside\u0000the current task of interest, typically on another dataset. It can be done via\u0000supervised models using human-annotated datasets or self-supervised models\u0000trained on unlabeled datasets. In both cases, many pre-trained models are\u0000available to fine-tune for the task of interest. Interestingly, research has\u0000shown that pre-trained models from ImageNet can be helpful for audio tasks\u0000despite being trained on image datasets. Hence, it's unclear whether in-domain\u0000models would be advantageous compared to competent out-domain models, such as\u0000convolutional neural networks from ImageNet. Our experiments will demonstrate\u0000the usefulness of in-domain models and datasets for bird species recognition by\u0000leveraging VICReg, a recent and powerful self-supervised method.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech Technology Services for Oral History Research 为口述历史研究提供语音技术服务
Pub Date : 2024-04-26 DOI: arxiv-2405.02333
Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Pavel Ircing, Jan Lehečka
Oral history is about oral sources of witnesses and commentors on historicalevents. Speech technology is an important instrument to process such recordingsin order to obtain transcription and further enhancements to structure the oralaccount In this contribution we address the transcription portal and thewebservices associated with speech processing at BAS, speech solutionsdeveloped at LINDAT, how to do it yourself with Whisper, remaining challenges,and future developments.
口述历史是关于历史事件的证人和评论者的口头资料。语音技术是处理此类录音的重要工具,可以获得转录并进一步增强口述资料的结构。在本文中,我们将讨论转录门户网站和与 BAS 语音处理相关的网络服务、LINDAT 开发的语音解决方案、如何使用 Whisper 自己操作、仍然存在的挑战以及未来的发展。
{"title":"Speech Technology Services for Oral History Research","authors":"Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Pavel Ircing, Jan Lehečka","doi":"arxiv-2405.02333","DOIUrl":"https://doi.org/arxiv-2405.02333","url":null,"abstract":"Oral history is about oral sources of witnesses and commentors on historical\u0000events. Speech technology is an important instrument to process such recordings\u0000in order to obtain transcription and further enhancements to structure the oral\u0000account In this contribution we address the transcription portal and the\u0000webservices associated with speech processing at BAS, speech solutions\u0000developed at LINDAT, how to do it yourself with Whisper, remaining challenges,\u0000and future developments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Music Style Transfer With Diffusion Model 采用扩散模型的音乐风格转移
Pub Date : 2024-04-23 DOI: arxiv-2404.14771
Hong Huang, Yuyi Wang, Luyao Li, Jun Lin
Previous studies on music style transfer have mainly focused on one-to-onestyle conversion, which is relatively limited. When considering the conversionbetween multiple styles, previous methods required designing multiple modes todisentangle the complex style of the music, resulting in large computationalcosts and slow audio generation. The existing music style transfer methodsgenerate spectrograms with artifacts, leading to significant noise in thegenerated audio. To address these issues, this study proposes a music styletransfer framework based on diffusion models (DM) and uses spectrogram-basedmethods to achieve multi-to-multi music style transfer. The GuideDiff method isused to restore spectrograms to high-fidelity audio, accelerating audiogeneration speed and reducing noise in the generated audio. Experimentalresults show that our model has good performance in multi-mode music styletransfer compared to the baseline and can generate high-quality audio inreal-time on consumer-grade GPUs.
以往关于音乐风格转换的研究主要集中在一对一的风格转换上,局限性相对较大。在考虑多种风格之间的转换时,以往的方法需要设计多种模式来分离复杂的音乐风格,导致计算成本高、音频生成速度慢。现有的音乐风格转换方法生成的频谱图有人工痕迹,导致生成的音频有明显的噪声。针对这些问题,本研究提出了一种基于扩散模型(DM)的音乐风格转换框架,并使用基于频谱图的方法实现多音乐风格对多音乐风格的转换。GuideDiff 方法用于将频谱图还原为高保真音频,从而加快音频生成速度并减少生成音频中的噪音。实验结果表明,与基线相比,我们的模型在多模式音乐风格转换方面具有良好的性能,并能在消费级 GPU 上实时生成高质量音频。
{"title":"Music Style Transfer With Diffusion Model","authors":"Hong Huang, Yuyi Wang, Luyao Li, Jun Lin","doi":"arxiv-2404.14771","DOIUrl":"https://doi.org/arxiv-2404.14771","url":null,"abstract":"Previous studies on music style transfer have mainly focused on one-to-one\u0000style conversion, which is relatively limited. When considering the conversion\u0000between multiple styles, previous methods required designing multiple modes to\u0000disentangle the complex style of the music, resulting in large computational\u0000costs and slow audio generation. The existing music style transfer methods\u0000generate spectrograms with artifacts, leading to significant noise in the\u0000generated audio. To address these issues, this study proposes a music style\u0000transfer framework based on diffusion models (DM) and uses spectrogram-based\u0000methods to achieve multi-to-multi music style transfer. The GuideDiff method is\u0000used to restore spectrograms to high-fidelity audio, accelerating audio\u0000generation speed and reducing noise in the generated audio. Experimental\u0000results show that our model has good performance in multi-mode music style\u0000transfer compared to the baseline and can generate high-quality audio in\u0000real-time on consumer-grade GPUs.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140806306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vector Signal Reconstruction Sparse and Parametric Approach of direction of arrival Using Single Vector Hydrophone 使用单矢量水听器的到达方向矢量信号重构稀疏和参数方法
Pub Date : 2024-04-21 DOI: arxiv-2404.15160
Jiabin Guo
This article discusses the application of single vector hydrophones in thefield of underwater acoustic signal processing for Direction Of Arrival (DOA)estimation. Addressing the limitations of traditional DOA estimation methods inmulti-source environments and under noise interference, this study introduces aVector Signal Reconstruction Sparse and Parametric Approach (VSRSPA). Thismethod involves reconstructing the signal model of a single vector hydrophone,converting its covariance matrix into a Toeplitz structure suitable for theSparse and Parametric Approach (SPA) algorithm. The process then optimizes itusing the SPA algorithm to achieve more accurate DOA estimation. Throughdetailed simulation analysis, this research has confirmed the performance ofthe proposed algorithm in single and dual-target DOA estimation scenarios,especially under various signal-to-noise ratio(SNR) conditions. The simulationresults show that, compared to traditional DOA estimation methods, thisalgorithm has significant advantages in estimation accuracy and resolution,particularly in multi-source signals and low SNR environments. The contributionof this study lies in providing an effective new method for DOA estimation withsingle vector hydrophones in complex environments, introducing new researchdirections and solutions in the field of vector hydrophone signal processing.
本文讨论了单矢量水听器在水下声学信号处理领域中对到达方向(DOA)估计的应用。针对传统 DOA 估算方法在多声源环境和噪声干扰下的局限性,本研究引入了矢量信号稀疏和参数重构方法(VSRSPA)。该方法包括重建单个矢量水听器的信号模型,将其协方差矩阵转换为适合稀疏和参数方法(SPA)算法的托普利兹结构。然后利用 SPA 算法对其进行优化,以实现更精确的 DOA 估计。通过详细的仿真分析,本研究证实了所提算法在单目标和双目标 DOA 估计场景中的性能,尤其是在各种信噪比(SNR)条件下。仿真结果表明,与传统的 DOA 估计方法相比,该算法在估计精度和分辨率方面具有显著优势,尤其是在多源信号和低信噪比环境下。本研究的贡献在于为复杂环境下的单矢量水听器 DOA 估计提供了一种有效的新方法,为矢量水听器信号处理领域引入了新的研究方向和解决方案。
{"title":"Vector Signal Reconstruction Sparse and Parametric Approach of direction of arrival Using Single Vector Hydrophone","authors":"Jiabin Guo","doi":"arxiv-2404.15160","DOIUrl":"https://doi.org/arxiv-2404.15160","url":null,"abstract":"This article discusses the application of single vector hydrophones in the\u0000field of underwater acoustic signal processing for Direction Of Arrival (DOA)\u0000estimation. Addressing the limitations of traditional DOA estimation methods in\u0000multi-source environments and under noise interference, this study introduces a\u0000Vector Signal Reconstruction Sparse and Parametric Approach (VSRSPA). This\u0000method involves reconstructing the signal model of a single vector hydrophone,\u0000converting its covariance matrix into a Toeplitz structure suitable for the\u0000Sparse and Parametric Approach (SPA) algorithm. The process then optimizes it\u0000using the SPA algorithm to achieve more accurate DOA estimation. Through\u0000detailed simulation analysis, this research has confirmed the performance of\u0000the proposed algorithm in single and dual-target DOA estimation scenarios,\u0000especially under various signal-to-noise ratio(SNR) conditions. The simulation\u0000results show that, compared to traditional DOA estimation methods, this\u0000algorithm has significant advantages in estimation accuracy and resolution,\u0000particularly in multi-source signals and low SNR environments. The contribution\u0000of this study lies in providing an effective new method for DOA estimation with\u0000single vector hydrophones in complex environments, introducing new research\u0000directions and solutions in the field of vector hydrophone signal processing.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DensePANet: An improved generative adversarial network for photoacoustic tomography image reconstruction from sparse data DensePANet:从稀疏数据重建光声断层图像的改进生成对抗网络
Pub Date : 2024-04-19 DOI: arxiv-2404.13101
Hesam Hakimnejad, Zohreh Azimifar, Narjes Goshtasbi
Image reconstruction is an essential step of every medical imaging method,including Photoacoustic Tomography (PAT), which is a promising modality ofimaging, that unites the benefits of both ultrasound and optical imagingmethods. Reconstruction of PAT images using conventional methods results inrough artifacts, especially when applied directly to sparse PAT data. In recentyears, generative adversarial networks (GANs) have shown a powerful performancein image generation as well as translation, rendering them a smart choice to beapplied to reconstruction tasks. In this study, we proposed an end-to-endmethod called DensePANet to solve the problem of PAT image reconstruction fromsparse data. The proposed model employs a novel modification of UNet in itsgenerator, called FD-UNet++, which considerably improves the reconstructionperformance. We evaluated the method on various in-vivo and simulated datasets.Quantitative and qualitative results show the better performance of our modelover other prevalent deep learning techniques.
图像重建是每种医学成像方法的基本步骤,包括光声断层扫描(PAT),它是一种很有前途的成像模式,融合了超声和光学成像方法的优点。使用传统方法重建 PAT 图像会产生伪影,尤其是直接应用于稀疏的 PAT 数据时。近年来,生成式对抗网络(GANs)在图像生成和翻译方面表现出了强大的性能,使其成为应用于重建任务的明智选择。在这项研究中,我们提出了一种名为 DensePANet 的端到端方法,用于解决从稀疏数据重建 PAT 图像的问题。所提出的模型在其生成器中采用了一种名为 FD-UNet++ 的对 UNet 的新型修改,从而大大提高了重建性能。我们在各种体内和模拟数据集上对该方法进行了评估。定量和定性结果表明,我们的模型比其他流行的深度学习技术性能更好。
{"title":"DensePANet: An improved generative adversarial network for photoacoustic tomography image reconstruction from sparse data","authors":"Hesam Hakimnejad, Zohreh Azimifar, Narjes Goshtasbi","doi":"arxiv-2404.13101","DOIUrl":"https://doi.org/arxiv-2404.13101","url":null,"abstract":"Image reconstruction is an essential step of every medical imaging method,\u0000including Photoacoustic Tomography (PAT), which is a promising modality of\u0000imaging, that unites the benefits of both ultrasound and optical imaging\u0000methods. Reconstruction of PAT images using conventional methods results in\u0000rough artifacts, especially when applied directly to sparse PAT data. In recent\u0000years, generative adversarial networks (GANs) have shown a powerful performance\u0000in image generation as well as translation, rendering them a smart choice to be\u0000applied to reconstruction tasks. In this study, we proposed an end-to-end\u0000method called DensePANet to solve the problem of PAT image reconstruction from\u0000sparse data. The proposed model employs a novel modification of UNet in its\u0000generator, called FD-UNet++, which considerably improves the reconstruction\u0000performance. We evaluated the method on various in-vivo and simulated datasets.\u0000Quantitative and qualitative results show the better performance of our model\u0000over other prevalent deep learning techniques.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"121 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos 声音行动:从以自我为中心的旁白视频中学习动作的声音
Pub Date : 2024-04-08 DOI: arxiv-2404.05206
Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
We propose a novel self-supervised embedding to learn how actions sound fromnarrated in-the-wild egocentric videos. Whereas existing methods rely oncurated data with known audio-visual correspondence, our multimodalcontrastive-consensus coding (MC3) embedding reinforces the associationsbetween audio, language, and vision when all modality pairs agree, whilediminishing those associations when any one pair does not. We show our approachcan successfully discover how the long tail of human actions sound fromegocentric video, outperforming an array of recent multimodal embeddingtechniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modaltasks.
我们提出了一种新颖的自监督嵌入方法,以学习野外自我中心视频中的动作声音。现有方法依赖的是已知视听对应关系的整合数据,而我们的多模态对比共识编码(MC3)嵌入法在所有模态对都一致时,会加强音频、语言和视觉之间的关联,而在任何一个模态对不一致时,则会减少这些关联。我们在两个数据集(Ego4D 和 EPIC-Sounds)和多个跨模态任务上证明了我们的方法能够成功地从以视觉为中心的视频中发现人类行动的长尾声音,其表现优于一系列最新的多模态嵌入技术。
{"title":"SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos","authors":"Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman","doi":"arxiv-2404.05206","DOIUrl":"https://doi.org/arxiv-2404.05206","url":null,"abstract":"We propose a novel self-supervised embedding to learn how actions sound from\u0000narrated in-the-wild egocentric videos. Whereas existing methods rely on\u0000curated data with known audio-visual correspondence, our multimodal\u0000contrastive-consensus coding (MC3) embedding reinforces the associations\u0000between audio, language, and vision when all modality pairs agree, while\u0000diminishing those associations when any one pair does not. We show our approach\u0000can successfully discover how the long tail of human actions sound from\u0000egocentric video, outperforming an array of recent multimodal embedding\u0000techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal\u0000tasks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpeechAlign: Aligning Speech Generation to Human Preferences SpeechAlign:使语音生成符合人类偏好
Pub Date : 2024-04-08 DOI: arxiv-2404.05600
Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu
Speech language models have significantly advanced in generating realisticspeech, with neural codec language models standing out. However, theintegration of human feedback to align speech outputs to human preferences isoften neglected. This paper addresses this gap by first analyzing thedistribution gap in codec language models, highlighting how it leads todiscrepancies between the training and inference phases, which negativelyaffects performance. Then we explore leveraging learning from human feedback tobridge the distribution gap. We introduce SpeechAlign, an iterativeself-improvement strategy that aligns speech language models to humanpreferences. SpeechAlign involves constructing a preference codec datasetcontrasting golden codec tokens against synthetic tokens, followed bypreference optimization to improve the codec language model. This cycle ofimprovement is carried out iteratively to steadily convert weak models tostrong ones. Through both subjective and objective evaluations, we show thatSpeechAlign can bridge the distribution gap and facilitating continuousself-improvement of the speech language model. Moreover, SpeechAlign exhibitsrobust generalization capabilities and works for smaller models. Code andmodels will be available at https://github.com/0nutation/SpeechGPT.
语音语言模型在生成逼真语音方面取得了重大进展,其中神经编解码语言模型尤为突出。然而,为使语音输出符合人类偏好而整合的人类反馈往往被忽视。本文针对这一缺陷,首先分析了编解码语言模型中的分布差距,强调了它如何导致训练和推理阶段之间的差异,从而对性能产生负面影响。然后,我们探讨了如何利用从人类反馈中学习来弥合分布差距。我们介绍了 SpeechAlign,这是一种迭代自我改进策略,可将语音语言模型与人类偏好相一致。SpeechAlign 包括构建一个偏好编解码器数据集,将黄金编解码器词库与合成词库进行对比,然后通过偏好优化来改进编解码器语言模型。这种改进循环反复地进行,稳步地将弱模型转换为强模型。通过主观和客观评估,我们发现 SpeechAlign 可以弥补分布差距,促进语音语言模型的不断自我完善。此外,SpeechAlign 还具有强大的泛化能力,适用于较小的模型。代码和模型将发布在 https://github.com/0nutation/SpeechGPT 网站上。
{"title":"SpeechAlign: Aligning Speech Generation to Human Preferences","authors":"Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu","doi":"arxiv-2404.05600","DOIUrl":"https://doi.org/arxiv-2404.05600","url":null,"abstract":"Speech language models have significantly advanced in generating realistic\u0000speech, with neural codec language models standing out. However, the\u0000integration of human feedback to align speech outputs to human preferences is\u0000often neglected. This paper addresses this gap by first analyzing the\u0000distribution gap in codec language models, highlighting how it leads to\u0000discrepancies between the training and inference phases, which negatively\u0000affects performance. Then we explore leveraging learning from human feedback to\u0000bridge the distribution gap. We introduce SpeechAlign, an iterative\u0000self-improvement strategy that aligns speech language models to human\u0000preferences. SpeechAlign involves constructing a preference codec dataset\u0000contrasting golden codec tokens against synthetic tokens, followed by\u0000preference optimization to improve the codec language model. This cycle of\u0000improvement is carried out iteratively to steadily convert weak models to\u0000strong ones. Through both subjective and objective evaluations, we show that\u0000SpeechAlign can bridge the distribution gap and facilitating continuous\u0000self-improvement of the speech language model. Moreover, SpeechAlign exhibits\u0000robust generalization capabilities and works for smaller models. Code and\u0000models will be available at https://github.com/0nutation/SpeechGPT.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Test-Time Training for Depression Detection 抑郁检测的测试时间训练
Pub Date : 2024-04-07 DOI: arxiv-2404.05071
Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore
Previous works on depression detection use datasets collected in similarenvironments to train and test the models. In practice, however, the train andtest distributions cannot be guaranteed to be identical. Distribution shiftscan be introduced due to variations such as recording environment (e.g.,background noise) and demographics (e.g., gender, age, etc). Suchdistributional shifts can surprisingly lead to severe performance degradationof the depression detection models. In this paper, we analyze the applicationof test-time training (TTT) to improve robustness of models trained fordepression detection. When compared to regular testing of the models, we findTTT can significantly improve the robustness of the model under a variety ofdistributional shifts introduced due to: (a) background-noise, (b) gender-bias,and (c) data collection and curation procedure (i.e., train and test samplesare from separate datasets).
以前的抑郁检测工作使用在相似环境中收集的数据集来训练和测试模型。但实际上,训练和测试的分布并不能保证完全相同。由于记录环境(如背景噪声)和人口统计学(如性别、年龄等)等因素的变化,可能会导致分布偏移。这种分布偏移竟然会导致抑郁检测模型的性能严重下降。在本文中,我们分析了如何应用测试时间训练(TTT)来提高抑郁检测模型的鲁棒性。与对模型的常规测试相比,我们发现在由于以下原因导致的各种分布偏移情况下,TTT 可以显著提高模型的鲁棒性:(a) 背景噪声,(b) 性别偏差,(c) 数据收集和整理程序(即训练样本和测试样本来自不同的数据集)。
{"title":"Test-Time Training for Depression Detection","authors":"Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore","doi":"arxiv-2404.05071","DOIUrl":"https://doi.org/arxiv-2404.05071","url":null,"abstract":"Previous works on depression detection use datasets collected in similar\u0000environments to train and test the models. In practice, however, the train and\u0000test distributions cannot be guaranteed to be identical. Distribution shifts\u0000can be introduced due to variations such as recording environment (e.g.,\u0000background noise) and demographics (e.g., gender, age, etc). Such\u0000distributional shifts can surprisingly lead to severe performance degradation\u0000of the depression detection models. In this paper, we analyze the application\u0000of test-time training (TTT) to improve robustness of models trained for\u0000depression detection. When compared to regular testing of the models, we find\u0000TTT can significantly improve the robustness of the model under a variety of\u0000distributional shifts introduced due to: (a) background-noise, (b) gender-bias,\u0000and (c) data collection and curation procedure (i.e., train and test samples\u0000are from separate datasets).","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"94 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Sound
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1