Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman
Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.
{"title":"Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model","authors":"Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman","doi":"arxiv-2405.01730","DOIUrl":"https://doi.org/arxiv-2405.01730","url":null,"abstract":"Expressive voice conversion (VC) conducts speaker identity conversion for\u0000emotional speakers by jointly converting speaker identity and emotional style.\u0000Emotional style modeling for arbitrary speakers in expressive VC has not been\u0000extensively explored. Previous approaches have relied on vocoders for speech\u0000reconstruction, which makes speech quality heavily dependent on the performance\u0000of vocoders. A major challenge of expressive VC lies in emotion prosody\u0000modeling. To address these challenges, this paper proposes a fully end-to-end\u0000expressive VC framework based on a conditional denoising diffusion\u0000probabilistic model (DDPM). We utilize speech units derived from\u0000self-supervised speech models as content conditioning, along with deep features\u0000extracted from speech emotion recognition and speaker verification systems to\u0000model emotional style and speaker identity. Objective and subjective\u0000evaluations show the effectiveness of our framework. Codes and samples are\u0000publicly available.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.
{"title":"USAT: A Universal Speaker-Adaptive Text-to-Speech Approach","authors":"Wenbin Wang, Yang Song, Sanjay Jha","doi":"arxiv-2404.18094","DOIUrl":"https://doi.org/arxiv-2404.18094","url":null,"abstract":"Conventional text-to-speech (TTS) research has predominantly focused on\u0000enhancing the quality of synthesized speech for speakers in the training\u0000dataset. The challenge of synthesizing lifelike speech for unseen,\u0000out-of-dataset speakers, especially those with limited reference data, remains\u0000a significant and unresolved problem. While zero-shot or few-shot\u0000speaker-adaptive TTS approaches have been explored, they have many limitations.\u0000Zero-shot approaches tend to suffer from insufficient generalization\u0000performance to reproduce the voice of speakers with heavy accents. While\u0000few-shot methods can reproduce highly varying accents, they bring a significant\u0000storage burden and the risk of overfitting and catastrophic forgetting. In\u0000addition, prior approaches only provide either zero-shot or few-shot\u0000adaptation, constraining their utility across varied real-world scenarios with\u0000different demands. Besides, most current evaluations of speaker-adaptive TTS\u0000are conducted only on datasets of native speakers, inadvertently neglecting a\u0000vast portion of non-native speakers with diverse accents. Our proposed\u0000framework unifies both zero-shot and few-shot speaker adaptation strategies,\u0000which we term as \"instant\" and \"fine-grained\" adaptations based on their\u0000merits. To alleviate the insufficient generalization performance observed in\u0000zero-shot speaker adaptation, we designed two innovative discriminators and\u0000introduced a memory mechanism for the speech decoder. To prevent catastrophic\u0000forgetting and reduce storage implications for few-shot speaker adaptation, we\u0000designed two adapters and a unique adaptation procedure.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transferring the weights of a pre-trained model to assist another task has become a crucial part of modern deep learning, particularly in data-scarce scenarios. Pre-training refers to the initial step of training models outside the current task of interest, typically on another dataset. It can be done via supervised models using human-annotated datasets or self-supervised models trained on unlabeled datasets. In both cases, many pre-trained models are available to fine-tune for the task of interest. Interestingly, research has shown that pre-trained models from ImageNet can be helpful for audio tasks despite being trained on image datasets. Hence, it's unclear whether in-domain models would be advantageous compared to competent out-domain models, such as convolutional neural networks from ImageNet. Our experiments will demonstrate the usefulness of in-domain models and datasets for bird species recognition by leveraging VICReg, a recent and powerful self-supervised method.
{"title":"Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition","authors":"Houtan Ghaffari, Paul Devos","doi":"arxiv-2404.17252","DOIUrl":"https://doi.org/arxiv-2404.17252","url":null,"abstract":"Transferring the weights of a pre-trained model to assist another task has\u0000become a crucial part of modern deep learning, particularly in data-scarce\u0000scenarios. Pre-training refers to the initial step of training models outside\u0000the current task of interest, typically on another dataset. It can be done via\u0000supervised models using human-annotated datasets or self-supervised models\u0000trained on unlabeled datasets. In both cases, many pre-trained models are\u0000available to fine-tune for the task of interest. Interestingly, research has\u0000shown that pre-trained models from ImageNet can be helpful for audio tasks\u0000despite being trained on image datasets. Hence, it's unclear whether in-domain\u0000models would be advantageous compared to competent out-domain models, such as\u0000convolutional neural networks from ImageNet. Our experiments will demonstrate\u0000the usefulness of in-domain models and datasets for bird species recognition by\u0000leveraging VICReg, a recent and powerful self-supervised method.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Pavel Ircing, Jan Lehečka
Oral history is about oral sources of witnesses and commentors on historical events. Speech technology is an important instrument to process such recordings in order to obtain transcription and further enhancements to structure the oral account In this contribution we address the transcription portal and the webservices associated with speech processing at BAS, speech solutions developed at LINDAT, how to do it yourself with Whisper, remaining challenges, and future developments.
口述历史是关于历史事件的证人和评论者的口头资料。语音技术是处理此类录音的重要工具,可以获得转录并进一步增强口述资料的结构。在本文中,我们将讨论转录门户网站和与 BAS 语音处理相关的网络服务、LINDAT 开发的语音解决方案、如何使用 Whisper 自己操作、仍然存在的挑战以及未来的发展。
{"title":"Speech Technology Services for Oral History Research","authors":"Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Pavel Ircing, Jan Lehečka","doi":"arxiv-2405.02333","DOIUrl":"https://doi.org/arxiv-2405.02333","url":null,"abstract":"Oral history is about oral sources of witnesses and commentors on historical\u0000events. Speech technology is an important instrument to process such recordings\u0000in order to obtain transcription and further enhancements to structure the oral\u0000account In this contribution we address the transcription portal and the\u0000webservices associated with speech processing at BAS, speech solutions\u0000developed at LINDAT, how to do it yourself with Whisper, remaining challenges,\u0000and future developments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Previous studies on music style transfer have mainly focused on one-to-one style conversion, which is relatively limited. When considering the conversion between multiple styles, previous methods required designing multiple modes to disentangle the complex style of the music, resulting in large computational costs and slow audio generation. The existing music style transfer methods generate spectrograms with artifacts, leading to significant noise in the generated audio. To address these issues, this study proposes a music style transfer framework based on diffusion models (DM) and uses spectrogram-based methods to achieve multi-to-multi music style transfer. The GuideDiff method is used to restore spectrograms to high-fidelity audio, accelerating audio generation speed and reducing noise in the generated audio. Experimental results show that our model has good performance in multi-mode music style transfer compared to the baseline and can generate high-quality audio in real-time on consumer-grade GPUs.
{"title":"Music Style Transfer With Diffusion Model","authors":"Hong Huang, Yuyi Wang, Luyao Li, Jun Lin","doi":"arxiv-2404.14771","DOIUrl":"https://doi.org/arxiv-2404.14771","url":null,"abstract":"Previous studies on music style transfer have mainly focused on one-to-one\u0000style conversion, which is relatively limited. When considering the conversion\u0000between multiple styles, previous methods required designing multiple modes to\u0000disentangle the complex style of the music, resulting in large computational\u0000costs and slow audio generation. The existing music style transfer methods\u0000generate spectrograms with artifacts, leading to significant noise in the\u0000generated audio. To address these issues, this study proposes a music style\u0000transfer framework based on diffusion models (DM) and uses spectrogram-based\u0000methods to achieve multi-to-multi music style transfer. The GuideDiff method is\u0000used to restore spectrograms to high-fidelity audio, accelerating audio\u0000generation speed and reducing noise in the generated audio. Experimental\u0000results show that our model has good performance in multi-mode music style\u0000transfer compared to the baseline and can generate high-quality audio in\u0000real-time on consumer-grade GPUs.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140806306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article discusses the application of single vector hydrophones in the field of underwater acoustic signal processing for Direction Of Arrival (DOA) estimation. Addressing the limitations of traditional DOA estimation methods in multi-source environments and under noise interference, this study introduces a Vector Signal Reconstruction Sparse and Parametric Approach (VSRSPA). This method involves reconstructing the signal model of a single vector hydrophone, converting its covariance matrix into a Toeplitz structure suitable for the Sparse and Parametric Approach (SPA) algorithm. The process then optimizes it using the SPA algorithm to achieve more accurate DOA estimation. Through detailed simulation analysis, this research has confirmed the performance of the proposed algorithm in single and dual-target DOA estimation scenarios, especially under various signal-to-noise ratio(SNR) conditions. The simulation results show that, compared to traditional DOA estimation methods, this algorithm has significant advantages in estimation accuracy and resolution, particularly in multi-source signals and low SNR environments. The contribution of this study lies in providing an effective new method for DOA estimation with single vector hydrophones in complex environments, introducing new research directions and solutions in the field of vector hydrophone signal processing.
本文讨论了单矢量水听器在水下声学信号处理领域中对到达方向(DOA)估计的应用。针对传统 DOA 估算方法在多声源环境和噪声干扰下的局限性,本研究引入了矢量信号稀疏和参数重构方法(VSRSPA)。该方法包括重建单个矢量水听器的信号模型,将其协方差矩阵转换为适合稀疏和参数方法(SPA)算法的托普利兹结构。然后利用 SPA 算法对其进行优化,以实现更精确的 DOA 估计。通过详细的仿真分析,本研究证实了所提算法在单目标和双目标 DOA 估计场景中的性能,尤其是在各种信噪比(SNR)条件下。仿真结果表明,与传统的 DOA 估计方法相比,该算法在估计精度和分辨率方面具有显著优势,尤其是在多源信号和低信噪比环境下。本研究的贡献在于为复杂环境下的单矢量水听器 DOA 估计提供了一种有效的新方法,为矢量水听器信号处理领域引入了新的研究方向和解决方案。
{"title":"Vector Signal Reconstruction Sparse and Parametric Approach of direction of arrival Using Single Vector Hydrophone","authors":"Jiabin Guo","doi":"arxiv-2404.15160","DOIUrl":"https://doi.org/arxiv-2404.15160","url":null,"abstract":"This article discusses the application of single vector hydrophones in the\u0000field of underwater acoustic signal processing for Direction Of Arrival (DOA)\u0000estimation. Addressing the limitations of traditional DOA estimation methods in\u0000multi-source environments and under noise interference, this study introduces a\u0000Vector Signal Reconstruction Sparse and Parametric Approach (VSRSPA). This\u0000method involves reconstructing the signal model of a single vector hydrophone,\u0000converting its covariance matrix into a Toeplitz structure suitable for the\u0000Sparse and Parametric Approach (SPA) algorithm. The process then optimizes it\u0000using the SPA algorithm to achieve more accurate DOA estimation. Through\u0000detailed simulation analysis, this research has confirmed the performance of\u0000the proposed algorithm in single and dual-target DOA estimation scenarios,\u0000especially under various signal-to-noise ratio(SNR) conditions. The simulation\u0000results show that, compared to traditional DOA estimation methods, this\u0000algorithm has significant advantages in estimation accuracy and resolution,\u0000particularly in multi-source signals and low SNR environments. The contribution\u0000of this study lies in providing an effective new method for DOA estimation with\u0000single vector hydrophones in complex environments, introducing new research\u0000directions and solutions in the field of vector hydrophone signal processing.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image reconstruction is an essential step of every medical imaging method, including Photoacoustic Tomography (PAT), which is a promising modality of imaging, that unites the benefits of both ultrasound and optical imaging methods. Reconstruction of PAT images using conventional methods results in rough artifacts, especially when applied directly to sparse PAT data. In recent years, generative adversarial networks (GANs) have shown a powerful performance in image generation as well as translation, rendering them a smart choice to be applied to reconstruction tasks. In this study, we proposed an end-to-end method called DensePANet to solve the problem of PAT image reconstruction from sparse data. The proposed model employs a novel modification of UNet in its generator, called FD-UNet++, which considerably improves the reconstruction performance. We evaluated the method on various in-vivo and simulated datasets. Quantitative and qualitative results show the better performance of our model over other prevalent deep learning techniques.
图像重建是每种医学成像方法的基本步骤,包括光声断层扫描(PAT),它是一种很有前途的成像模式,融合了超声和光学成像方法的优点。使用传统方法重建 PAT 图像会产生伪影,尤其是直接应用于稀疏的 PAT 数据时。近年来,生成式对抗网络(GANs)在图像生成和翻译方面表现出了强大的性能,使其成为应用于重建任务的明智选择。在这项研究中,我们提出了一种名为 DensePANet 的端到端方法,用于解决从稀疏数据重建 PAT 图像的问题。所提出的模型在其生成器中采用了一种名为 FD-UNet++ 的对 UNet 的新型修改,从而大大提高了重建性能。我们在各种体内和模拟数据集上对该方法进行了评估。定量和定性结果表明,我们的模型比其他流行的深度学习技术性能更好。
{"title":"DensePANet: An improved generative adversarial network for photoacoustic tomography image reconstruction from sparse data","authors":"Hesam Hakimnejad, Zohreh Azimifar, Narjes Goshtasbi","doi":"arxiv-2404.13101","DOIUrl":"https://doi.org/arxiv-2404.13101","url":null,"abstract":"Image reconstruction is an essential step of every medical imaging method,\u0000including Photoacoustic Tomography (PAT), which is a promising modality of\u0000imaging, that unites the benefits of both ultrasound and optical imaging\u0000methods. Reconstruction of PAT images using conventional methods results in\u0000rough artifacts, especially when applied directly to sparse PAT data. In recent\u0000years, generative adversarial networks (GANs) have shown a powerful performance\u0000in image generation as well as translation, rendering them a smart choice to be\u0000applied to reconstruction tasks. In this study, we proposed an end-to-end\u0000method called DensePANet to solve the problem of PAT image reconstruction from\u0000sparse data. The proposed model employs a novel modification of UNet in its\u0000generator, called FD-UNet++, which considerably improves the reconstruction\u0000performance. We evaluated the method on various in-vivo and simulated datasets.\u0000Quantitative and qualitative results show the better performance of our model\u0000over other prevalent deep learning techniques.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"121 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when any one pair does not. We show our approach can successfully discover how the long tail of human actions sound from egocentric video, outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks.
{"title":"SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos","authors":"Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman","doi":"arxiv-2404.05206","DOIUrl":"https://doi.org/arxiv-2404.05206","url":null,"abstract":"We propose a novel self-supervised embedding to learn how actions sound from\u0000narrated in-the-wild egocentric videos. Whereas existing methods rely on\u0000curated data with known audio-visual correspondence, our multimodal\u0000contrastive-consensus coding (MC3) embedding reinforces the associations\u0000between audio, language, and vision when all modality pairs agree, while\u0000diminishing those associations when any one pair does not. We show our approach\u0000can successfully discover how the long tail of human actions sound from\u0000egocentric video, outperforming an array of recent multimodal embedding\u0000techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal\u0000tasks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech language models have significantly advanced in generating realistic speech, with neural codec language models standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec language models, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech language model. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.
{"title":"SpeechAlign: Aligning Speech Generation to Human Preferences","authors":"Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu","doi":"arxiv-2404.05600","DOIUrl":"https://doi.org/arxiv-2404.05600","url":null,"abstract":"Speech language models have significantly advanced in generating realistic\u0000speech, with neural codec language models standing out. However, the\u0000integration of human feedback to align speech outputs to human preferences is\u0000often neglected. This paper addresses this gap by first analyzing the\u0000distribution gap in codec language models, highlighting how it leads to\u0000discrepancies between the training and inference phases, which negatively\u0000affects performance. Then we explore leveraging learning from human feedback to\u0000bridge the distribution gap. We introduce SpeechAlign, an iterative\u0000self-improvement strategy that aligns speech language models to human\u0000preferences. SpeechAlign involves constructing a preference codec dataset\u0000contrasting golden codec tokens against synthetic tokens, followed by\u0000preference optimization to improve the codec language model. This cycle of\u0000improvement is carried out iteratively to steadily convert weak models to\u0000strong ones. Through both subjective and objective evaluations, we show that\u0000SpeechAlign can bridge the distribution gap and facilitating continuous\u0000self-improvement of the speech language model. Moreover, SpeechAlign exhibits\u0000robust generalization capabilities and works for smaller models. Code and\u0000models will be available at https://github.com/0nutation/SpeechGPT.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore
Previous works on depression detection use datasets collected in similar environments to train and test the models. In practice, however, the train and test distributions cannot be guaranteed to be identical. Distribution shifts can be introduced due to variations such as recording environment (e.g., background noise) and demographics (e.g., gender, age, etc). Such distributional shifts can surprisingly lead to severe performance degradation of the depression detection models. In this paper, we analyze the application of test-time training (TTT) to improve robustness of models trained for depression detection. When compared to regular testing of the models, we find TTT can significantly improve the robustness of the model under a variety of distributional shifts introduced due to: (a) background-noise, (b) gender-bias, and (c) data collection and curation procedure (i.e., train and test samples are from separate datasets).
{"title":"Test-Time Training for Depression Detection","authors":"Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore","doi":"arxiv-2404.05071","DOIUrl":"https://doi.org/arxiv-2404.05071","url":null,"abstract":"Previous works on depression detection use datasets collected in similar\u0000environments to train and test the models. In practice, however, the train and\u0000test distributions cannot be guaranteed to be identical. Distribution shifts\u0000can be introduced due to variations such as recording environment (e.g.,\u0000background noise) and demographics (e.g., gender, age, etc). Such\u0000distributional shifts can surprisingly lead to severe performance degradation\u0000of the depression detection models. In this paper, we analyze the application\u0000of test-time training (TTT) to improve robustness of models trained for\u0000depression detection. When compared to regular testing of the models, we find\u0000TTT can significantly improve the robustness of the model under a variety of\u0000distributional shifts introduced due to: (a) background-noise, (b) gender-bias,\u0000and (c) data collection and curation procedure (i.e., train and test samples\u0000are from separate datasets).","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"94 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}