首页 > 最新文献

arXiv - EE - Audio and Speech Processing最新文献

英文 中文
Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment 作为黄金语音生成器的零点文本到语音:系统框架及其在自动发音评估中的应用
Pub Date : 2024-09-11 DOI: arxiv-2409.07151
Tien-Hong Lo, Meng-Ting Tsai, Berlin Chen
Second language (L2) learners can improve their pronunciation by imitatinggolden speech, especially when the speech that aligns with their respectivespeech characteristics. This study explores the hypothesis thatlearner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS)techniques can be harnessed as an effective metric for measuring thepronunciation proficiency of L2 learners. Building on this exploration, thecontributions of this study are at least two-fold: 1) design and development ofa systematic framework for assessing the ability of a synthesis model togenerate golden speech, and 2) in-depth investigations of the effectiveness ofusing golden speech in automatic pronunciation assessment (APA). Comprehensiveexperiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasetssuggest that our proposed modeling can yield significant performanceimprovements with respect to various assessment metrics in relation to someprior arts. To our knowledge, this study is the first to explore the role ofgolden speech in both ZS-TTS and APA, offering a promising regime forcomputer-assisted pronunciation training (CAPT).
第二语言(L2)学习者可以通过模仿金色语音来提高发音水平,尤其是当模仿的语音符合他们各自的语音特点时。本研究探讨了一个假设,即利用零镜头文本到语音(ZS-TTS)技术生成的针对学习者的金句语音可以作为衡量第二语言学习者发音水平的有效指标。在这一探索的基础上,本研究至少有两方面的贡献:1)设计和开发了一个系统框架,用于评估合成模型生成黄金语音的能力;2)深入研究了在自动发音评估(APA)中使用黄金语音的有效性。在L2-ARCTIC和Speechocean762基准数据集上进行的综合实验表明,我们提出的建模方法可以在各种评估指标上显著提高性能。据我们所知,这项研究首次探讨了金色语音在 ZS-TTS 和 APA 中的作用,为计算机辅助发音训练(CAPT)提供了一种前景广阔的机制。
{"title":"Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment","authors":"Tien-Hong Lo, Meng-Ting Tsai, Berlin Chen","doi":"arxiv-2409.07151","DOIUrl":"https://doi.org/arxiv-2409.07151","url":null,"abstract":"Second language (L2) learners can improve their pronunciation by imitating\u0000golden speech, especially when the speech that aligns with their respective\u0000speech characteristics. This study explores the hypothesis that\u0000learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS)\u0000techniques can be harnessed as an effective metric for measuring the\u0000pronunciation proficiency of L2 learners. Building on this exploration, the\u0000contributions of this study are at least two-fold: 1) design and development of\u0000a systematic framework for assessing the ability of a synthesis model to\u0000generate golden speech, and 2) in-depth investigations of the effectiveness of\u0000using golden speech in automatic pronunciation assessment (APA). Comprehensive\u0000experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets\u0000suggest that our proposed modeling can yield significant performance\u0000improvements with respect to various assessment metrics in relation to some\u0000prior arts. To our knowledge, this study is the first to explore the role of\u0000golden speech in both ZS-TTS and APA, offering a promising regime for\u0000computer-assisted pronunciation training (CAPT).","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos VMAS:通过网络音乐视频中的语义对齐实现视频到音乐的生成
Pub Date : 2024-09-11 DOI: arxiv-2409.07450
Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang
We present a framework for learning to generate background music from videoinputs. Unlike existing works that rely on symbolic musical annotations, whichare limited in quantity and diversity, our method leverages large-scale webvideos accompanied by background music. This enables our model to learn togenerate realistic and diverse music. To accomplish this goal, we develop agenerative video-music Transformer with a novel semantic video-music alignmentscheme. Our model uses a joint autoregressive and contrastive learningobjective, which encourages the generation of music aligned with high-levelvideo content. We also introduce a novel video-beat alignment scheme to matchthe generated music beats with the low-level motions in the video. Lastly, tocapture fine-grained visual cues in a video needed for realistic backgroundmusic generation, we introduce a new temporal video encoder architecture,allowing us to efficiently process videos consisting of many densely sampledframes. We train our framework on our newly curated DISCO-MV dataset,consisting of 2.2M video-music samples, which is orders of magnitude largerthan any prior datasets used for video music generation. Our method outperformsexisting approaches on the DISCO-MV and MusicCaps datasets according to variousmusic generation evaluation metrics, including human evaluation. Results areavailable at https://genjib.github.io/project_page/VMAs/index.html
我们提出了一个从视频输入中学习生成背景音乐的框架。与依赖数量和多样性都有限的符号音乐注释的现有工作不同,我们的方法利用了伴有背景音乐的大规模网络视频。这使我们的模型能够学习生成逼真、多样的音乐。为了实现这一目标,我们开发了具有新颖语义视频-音乐配准方案的生成视频-音乐转换器。我们的模型采用联合自回归和对比学习目标,鼓励生成与高水平视频内容相匹配的音乐。我们还引入了一种新颖的视频-节拍配准方案,使生成的音乐节拍与视频中的低级动作相匹配。最后,为了捕捉视频中逼真的背景音乐生成所需的细粒度视觉线索,我们引入了一种新的时序视频编码器架构,使我们能够高效处理由许多密集采样帧组成的视频。我们在新策划的 DISCO-MV 数据集上训练我们的框架,该数据集由 220 万个视频音乐样本组成,比之前用于视频音乐生成的任何数据集都要大得多。根据各种音乐生成评估指标(包括人工评估),我们的方法在 DISCO-MV 和 MusicCaps 数据集上的表现优于现有方法。结果可在 https://genjib.github.io/project_page/VMAs/index.html 上查阅。
{"title":"VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos","authors":"Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang","doi":"arxiv-2409.07450","DOIUrl":"https://doi.org/arxiv-2409.07450","url":null,"abstract":"We present a framework for learning to generate background music from video\u0000inputs. Unlike existing works that rely on symbolic musical annotations, which\u0000are limited in quantity and diversity, our method leverages large-scale web\u0000videos accompanied by background music. This enables our model to learn to\u0000generate realistic and diverse music. To accomplish this goal, we develop a\u0000generative video-music Transformer with a novel semantic video-music alignment\u0000scheme. Our model uses a joint autoregressive and contrastive learning\u0000objective, which encourages the generation of music aligned with high-level\u0000video content. We also introduce a novel video-beat alignment scheme to match\u0000the generated music beats with the low-level motions in the video. Lastly, to\u0000capture fine-grained visual cues in a video needed for realistic background\u0000music generation, we introduce a new temporal video encoder architecture,\u0000allowing us to efficiently process videos consisting of many densely sampled\u0000frames. We train our framework on our newly curated DISCO-MV dataset,\u0000consisting of 2.2M video-music samples, which is orders of magnitude larger\u0000than any prior datasets used for video music generation. Our method outperforms\u0000existing approaches on the DISCO-MV and MusicCaps datasets according to various\u0000music generation evaluation metrics, including human evaluation. Results are\u0000available at https://genjib.github.io/project_page/VMAs/index.html","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm Muskits-ESPnet:新范式歌唱语音合成综合工具包
Pub Date : 2024-09-11 DOI: arxiv-2409.07226
Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin
This research presents Muskits-ESPnet, a versatile toolkit that introducesnew paradigms to Singing Voice Synthesis (SVS) through the application ofpretrained audio models in both continuous and discrete approaches.Specifically, we explore discrete representations derived from SSL models andaudio codecs and offer significant advantages in versatility and intelligence,supporting multi-format inputs and adaptable data processing workflows forvarious SVS models. The toolkit features automatic music score error detectionand correction, as well as a perception auto-evaluation module to imitate humansubjective evaluating scores. Muskits-ESPnet is available aturl{https://github.com/espnet/espnet}.
本研究介绍了多功能工具包 Muskits-ESPnet,该工具包通过应用连续和离散方法中经过训练的音频模型,为歌声合成(SVS)引入了新的范例。具体而言,我们探索了从 SSL 模型和音频编解码器中衍生出来的离散表示法,在多功能性和智能性方面具有显著优势,支持多格式输入和适用于各种 SVS 模型的数据处理工作流。该工具包具有自动乐谱错误检测和纠正功能,以及一个感知自动评估模块,可模仿人类对乐谱的主观评估。Muskits-ESPnet的网址为:url{https://github.com/espnet/espnet}。
{"title":"Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm","authors":"Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin","doi":"arxiv-2409.07226","DOIUrl":"https://doi.org/arxiv-2409.07226","url":null,"abstract":"This research presents Muskits-ESPnet, a versatile toolkit that introduces\u0000new paradigms to Singing Voice Synthesis (SVS) through the application of\u0000pretrained audio models in both continuous and discrete approaches.\u0000Specifically, we explore discrete representations derived from SSL models and\u0000audio codecs and offer significant advantages in versatility and intelligence,\u0000supporting multi-format inputs and adaptable data processing workflows for\u0000various SVS models. The toolkit features automatic music score error detection\u0000and correction, as well as a perception auto-evaluation module to imitate human\u0000subjective evaluating scores. Muskits-ESPnet is available at\u0000url{https://github.com/espnet/espnet}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing CTC-Based Visual Speech Recognition 增强基于 CTC 的视觉语音识别能力
Pub Date : 2024-09-11 DOI: arxiv-2409.07210
Hendrik Laux, Anke Schmeink
This paper presents LiteVSR2, an enhanced version of our previouslyintroduced efficient approach to Visual Speech Recognition (VSR). Building uponour knowledge distillation framework from a pre-trained Automatic SpeechRecognition (ASR) model, we introduce two key improvements: a stabilized videopreprocessing technique and feature normalization in the distillation process.These improvements yield substantial performance gains on the LRS2 and LRS3benchmarks, positioning LiteVSR2 as the current best CTC-based VSR modelwithout increasing the volume of training data or computational resourcesutilized. Furthermore, we explore the scalability of our approach by examiningperformance metrics across varying model complexities and training datavolumes. LiteVSR2 maintains the efficiency of its predecessor whilesignificantly enhancing accuracy, thereby demonstrating the potential forresource-efficient advancements in VSR technology.
本文介绍了 LiteVSR2,它是我们之前推出的视觉语音识别(VSR)高效方法的增强版。这些改进使 LiteVSR2 在 LRS2 和 LRS3 基准测试中的性能大幅提升,在不增加训练数据量或计算资源的情况下,成为目前基于 CTC 的最佳 VSR 模型。此外,我们还通过检查不同模型复杂度和训练数据量下的性能指标,探索了我们方法的可扩展性。LiteVSR2 保持了其前代产品的效率,同时显著提高了准确性,从而证明了 VSR 技术在资源效率方面的发展潜力。
{"title":"Enhancing CTC-Based Visual Speech Recognition","authors":"Hendrik Laux, Anke Schmeink","doi":"arxiv-2409.07210","DOIUrl":"https://doi.org/arxiv-2409.07210","url":null,"abstract":"This paper presents LiteVSR2, an enhanced version of our previously\u0000introduced efficient approach to Visual Speech Recognition (VSR). Building upon\u0000our knowledge distillation framework from a pre-trained Automatic Speech\u0000Recognition (ASR) model, we introduce two key improvements: a stabilized video\u0000preprocessing technique and feature normalization in the distillation process.\u0000These improvements yield substantial performance gains on the LRS2 and LRS3\u0000benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model\u0000without increasing the volume of training data or computational resources\u0000utilized. Furthermore, we explore the scalability of our approach by examining\u0000performance metrics across varying model complexities and training data\u0000volumes. LiteVSR2 maintains the efficiency of its predecessor while\u0000significantly enhancing accuracy, thereby demonstrating the potential for\u0000resource-efficient advancements in VSR technology.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking Mamba in Speech Processing by Self-Supervised Models 通过自监督模型反思语音处理中的 Mamba
Pub Date : 2024-09-11 DOI: arxiv-2409.07273
Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, Julien Epps
The Mamba-based model has demonstrated outstanding performance across tasksin computer vision, natural language processing, and speech processing.However, in the realm of speech processing, the Mamba-based model's performancevaries across different tasks. For instance, in tasks such as speechenhancement and spectrum reconstruction, the Mamba model performs well whenused independently. However, for tasks like speech recognition, additionalmodules are required to surpass the performance of attention-based models. Wepropose the hypothesis that the Mamba-based model excels in "reconstruction"tasks within speech processing. However, for "classification tasks" such asSpeech Recognition, additional modules are necessary to accomplish the"reconstruction" step. To validate our hypothesis, we analyze the previousMamba-based Speech Models from an information theory perspective. Furthermore,we leveraged the properties of HuBERT in our study. We trained a Mamba-basedHuBERT model, and the mutual information patterns, along with the model'sperformance metrics, confirmed our assumptions.
基于 Mamba 的模型在计算机视觉、自然语言处理和语音处理等任务中表现出色。然而,在语音处理领域,基于 Mamba 的模型在不同任务中的表现各不相同。例如,在语音增强和频谱重建等任务中,Mamba 模型在独立使用时表现出色。然而,在语音识别等任务中,需要额外的模块才能超越基于注意力的模型。我们提出的假设是,基于 Mamba 的模型在语音处理的 "重建 "任务中表现出色。然而,对于语音识别等 "分类任务",则需要额外的模块来完成 "重构 "步骤。为了验证我们的假设,我们从信息论的角度分析了之前基于 Mamba 的语音模型。此外,我们还在研究中利用了 HuBERT 的特性。我们训练了一个基于 Mamba 的 HuBERT 模型,其互信息模式和模型的性能指标证实了我们的假设。
{"title":"Rethinking Mamba in Speech Processing by Self-Supervised Models","authors":"Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, Julien Epps","doi":"arxiv-2409.07273","DOIUrl":"https://doi.org/arxiv-2409.07273","url":null,"abstract":"The Mamba-based model has demonstrated outstanding performance across tasks\u0000in computer vision, natural language processing, and speech processing.\u0000However, in the realm of speech processing, the Mamba-based model's performance\u0000varies across different tasks. For instance, in tasks such as speech\u0000enhancement and spectrum reconstruction, the Mamba model performs well when\u0000used independently. However, for tasks like speech recognition, additional\u0000modules are required to surpass the performance of attention-based models. We\u0000propose the hypothesis that the Mamba-based model excels in \"reconstruction\"\u0000tasks within speech processing. However, for \"classification tasks\" such as\u0000Speech Recognition, additional modules are necessary to accomplish the\u0000\"reconstruction\" step. To validate our hypothesis, we analyze the previous\u0000Mamba-based Speech Models from an information theory perspective. Furthermore,\u0000we leveraged the properties of HuBERT in our study. We trained a Mamba-based\u0000HuBERT model, and the mutual information patterns, along with the model's\u0000performance metrics, confirmed our assumptions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Suite for Acoustic Language Model Evaluation 声学语言模型评估套件
Pub Date : 2024-09-11 DOI: arxiv-2409.07437
Gallil Maimon, Amit Roth, Yossi Adi
Speech language models have recently demonstrated great potential asuniversal speech processing systems. Such models have the ability to model therich acoustic information existing in audio signals, beyond spoken content,such as emotion, background noise, etc. Despite this, evaluation benchmarkswhich evaluate awareness to a wide range of acoustic aspects, are lacking. Tohelp bridge this gap, we introduce SALMon, a novel evaluation suiteencompassing background noise, emotion, speaker identity and room impulseresponse. The proposed benchmarks both evaluate the consistency of theinspected element and how much it matches the spoken text. We follow amodelling based approach, measuring whether a model gives correct sampleshigher scores than incorrect ones. This approach makes the benchmark fast tocompute even for large models. We evaluated several speech language models onSALMon, thus highlighting the strengths and weaknesses of each evaluatedmethod. Code and data are publicly available athttps://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .
最近,语音语言模型作为通用语音处理系统已显示出巨大的潜力。这些模型有能力对音频信号中存在的丰富声学信息进行建模,而不局限于语音内容,如情感、背景噪声等。尽管如此,目前还缺乏对广泛的声学方面进行评估的基准。为了填补这一空白,我们推出了 SALMon,这是一个新颖的评估套件,包含背景噪声、情感、说话者身份和房间脉冲响应。所提出的基准既能评估检测元素的一致性,也能评估其与口语文本的匹配程度。我们采用的是基于建模的方法,衡量模型给出的正确样本得分是否高于错误样本。这种方法即使对于大型模型也能快速计算基准。我们在 SALMon 上评估了几种语音语言模型,从而突出了每种评估方法的优缺点。代码和数据可在https://pages.cs.huji.ac.il/adiyoss-lab/salmon/。
{"title":"A Suite for Acoustic Language Model Evaluation","authors":"Gallil Maimon, Amit Roth, Yossi Adi","doi":"arxiv-2409.07437","DOIUrl":"https://doi.org/arxiv-2409.07437","url":null,"abstract":"Speech language models have recently demonstrated great potential as\u0000universal speech processing systems. Such models have the ability to model the\u0000rich acoustic information existing in audio signals, beyond spoken content,\u0000such as emotion, background noise, etc. Despite this, evaluation benchmarks\u0000which evaluate awareness to a wide range of acoustic aspects, are lacking. To\u0000help bridge this gap, we introduce SALMon, a novel evaluation suite\u0000encompassing background noise, emotion, speaker identity and room impulse\u0000response. The proposed benchmarks both evaluate the consistency of the\u0000inspected element and how much it matches the spoken text. We follow a\u0000modelling based approach, measuring whether a model gives correct samples\u0000higher scores than incorrect ones. This approach makes the benchmark fast to\u0000compute even for large models. We evaluated several speech language models on\u0000SALMon, thus highlighting the strengths and weaknesses of each evaluated\u0000method. Code and data are publicly available at\u0000https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages ManaTTS 波斯语:为低资源语言创建 TTS 数据集的秘诀
Pub Date : 2024-09-11 DOI: arxiv-2409.07259
Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
In this study, we introduce ManaTTS, the most extensive publicly accessiblesingle-speaker Persian corpus, and a comprehensive framework for collectingtranscribed speech datasets for the Persian language. ManaTTS, released underthe open CC-0 license, comprises approximately 86 hours of audio with asampling rate of 44.1 kHz. Alongside ManaTTS, we also generated theVirgoolInformal dataset to evaluate Persian speech recognition models used forforced alignment, extending over 5 hours of audio. The datasets are supportedby a fully transparent, MIT-licensed pipeline, a testament to innovation in thefield. It includes unique tools for sentence tokenization, bounded audiosegmentation, and a novel forced alignment method. This alignment technique isspecifically designed for low-resource languages, addressing a crucial need inthe field. With this dataset, we trained a Tacotron2-based TTS model, achievinga Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of3.86 for the utterances generated by the same vocoder and natural spectrogram,and the MOS of 4.01 for the natural waveform, demonstrating the exceptionalquality and effectiveness of the corpus.
在本研究中,我们介绍了 ManaTTS,它是最广泛的可公开访问的单人波斯语语料库,也是收集波斯语转录语音数据集的综合框架。ManaTTS 在开放的 CC-0 许可下发布,包含约 86 小时的音频,采样率为 44.1 kHz。除 ManaTTS 外,我们还生成了 VirgoolInformal 数据集,以评估用于强制对齐的波斯语语音识别模型,该数据集的音频时长超过 5 小时。这些数据集由麻省理工学院授权的完全透明的管道提供支持,证明了该领域的创新。它包括用于句子标记化、边界音频分割的独特工具,以及一种新颖的强制对齐方法。这种对齐技术专为低资源语言设计,满足了该领域的关键需求。利用该数据集,我们训练了一个基于 Tacotron2 的 TTS 模型,其平均意见得分(MOS)达到了 3.76,与相同声码器和自然频谱图生成的语篇的 MOS 3.86 和自然波形的 MOS 4.01 非常接近,证明了该语料库的卓越质量和有效性。
{"title":"ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages","authors":"Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee","doi":"arxiv-2409.07259","DOIUrl":"https://doi.org/arxiv-2409.07259","url":null,"abstract":"In this study, we introduce ManaTTS, the most extensive publicly accessible\u0000single-speaker Persian corpus, and a comprehensive framework for collecting\u0000transcribed speech datasets for the Persian language. ManaTTS, released under\u0000the open CC-0 license, comprises approximately 86 hours of audio with a\u0000sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the\u0000VirgoolInformal dataset to evaluate Persian speech recognition models used for\u0000forced alignment, extending over 5 hours of audio. The datasets are supported\u0000by a fully transparent, MIT-licensed pipeline, a testament to innovation in the\u0000field. It includes unique tools for sentence tokenization, bounded audio\u0000segmentation, and a novel forced alignment method. This alignment technique is\u0000specifically designed for low-resource languages, addressing a crucial need in\u0000the field. With this dataset, we trained a Tacotron2-based TTS model, achieving\u0000a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of\u00003.86 for the utterances generated by the same vocoder and natural spectrogram,\u0000and the MOS of 4.01 for the natural waveform, demonstrating the exceptional\u0000quality and effectiveness of the corpus.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack D-CAPTCHA++:Deepfake 验证码在可转移不可感知对抗性攻击下的复原力研究
Pub Date : 2024-09-11 DOI: arxiv-2409.07390
Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
The advancements in generative AI have enabled the improvement of audiosynthesis models, including text-to-speech and voice conversion. This raisesconcerns about its potential misuse in social manipulation and politicalinterference, as synthetic speech has become indistinguishable from naturalhuman speech. Several speech-generation programs are utilized for maliciouspurposes, especially impersonating individuals through phone calls. Therefore,detecting fake audio is crucial to maintain social security and safeguard theintegrity of information. Recent research has proposed a D-CAPTCHA system basedon the challenge-response protocol to differentiate fake phone calls from realones. In this work, we study the resilience of this system and introduce a morerobust version, D-CAPTCHA++, to defend against fake calls. Specifically, wefirst expose the vulnerability of the D-CAPTCHA system under transferableimperceptible adversarial attack. Secondly, we mitigate such vulnerability byimproving the robustness of the system by using adversarial training inD-CAPTCHA deepfake detectors and task classifiers.
人工智能生成技术的进步使声音合成模型得以改进,包括文本到语音和语音转换。这引发了人们对其可能被滥用于社会操纵和政治干预的担忧,因为合成语音已经无法与自然人的语音相区分。一些语音生成程序被用于恶意目的,特别是通过电话冒充个人。因此,检测虚假音频对于维护社会安全和信息完整性至关重要。最近的研究提出了一种基于挑战-响应协议的 D-CAPTCHA 系统,用于区分虚假电话和真实电话。在这项工作中,我们研究了该系统的弹性,并推出了一个更稳健的版本--D-CAPTCHA++,以抵御虚假电话。具体来说,我们首先揭示了 D-CAPTCHA 系统在可转移、可感知的对抗性攻击下的脆弱性。其次,我们通过在 D-CAPTCHA 深度假冒检测器和任务分类器中使用对抗训练来提高系统的鲁棒性,从而缓解这种脆弱性。
{"title":"D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack","authors":"Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac","doi":"arxiv-2409.07390","DOIUrl":"https://doi.org/arxiv-2409.07390","url":null,"abstract":"The advancements in generative AI have enabled the improvement of audio\u0000synthesis models, including text-to-speech and voice conversion. This raises\u0000concerns about its potential misuse in social manipulation and political\u0000interference, as synthetic speech has become indistinguishable from natural\u0000human speech. Several speech-generation programs are utilized for malicious\u0000purposes, especially impersonating individuals through phone calls. Therefore,\u0000detecting fake audio is crucial to maintain social security and safeguard the\u0000integrity of information. Recent research has proposed a D-CAPTCHA system based\u0000on the challenge-response protocol to differentiate fake phone calls from real\u0000ones. In this work, we study the resilience of this system and introduce a more\u0000robust version, D-CAPTCHA++, to defend against fake calls. Specifically, we\u0000first expose the vulnerability of the D-CAPTCHA system under transferable\u0000imperceptible adversarial attack. Secondly, we mitigate such vulnerability by\u0000improving the robustness of the system by using adversarial training in\u0000D-CAPTCHA deepfake detectors and task classifiers.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction 2024 年 VoiceMOS 挑战赛:超越语音质量预测
Pub Date : 2024-09-11 DOI: arxiv-2409.07001
Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao
We present the third edition of the VoiceMOS Challenge, a scientificinitiative designed to advance research into automatic prediction of humanspeech ratings. There were three tracks. The first track was on predicting thequality of ``zoomed-in'' high-quality samples from speech synthesis systems.The second track was to predict ratings of samples from singing voice synthesisand voice conversion with a large variety of systems, listeners, and languages.The third track was semi-supervised quality prediction for noisy, clean, andenhanced speech, where a very small amount of labeled training data wasprovided. Among the eight teams from both academia and industry, we found thatmany were able to outperform the baseline systems. Successful techniquesincluded retrieval-based methods and the use of non-self-supervisedrepresentations like spectrograms and pitch histograms. These results showedthat the challenge has advanced the field of subjective speech ratingprediction.
我们推出了第三届 VoiceMOS 挑战赛,这是一项旨在推动人类语音评分自动预测研究的科学倡议。比赛分为三个赛道。第一赛道是预测来自语音合成系统的 "放大 "高质量样本的质量;第二赛道是预测来自歌唱语音合成和语音转换的样本的评分,其中涉及大量系统、听众和语言;第三赛道是针对有噪声、干净和增强语音的半监督质量预测,其中需要提供极少量的标注训练数据。在来自学术界和工业界的八个团队中,我们发现许多团队都能超越基准系统。成功的技术包括基于检索的方法和使用非自我监督表示法,如频谱图和音高直方图。这些结果表明,挑战赛推动了主观语音评分预测领域的发展。
{"title":"The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction","authors":"Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao","doi":"arxiv-2409.07001","DOIUrl":"https://doi.org/arxiv-2409.07001","url":null,"abstract":"We present the third edition of the VoiceMOS Challenge, a scientific\u0000initiative designed to advance research into automatic prediction of human\u0000speech ratings. There were three tracks. The first track was on predicting the\u0000quality of ``zoomed-in'' high-quality samples from speech synthesis systems.\u0000The second track was to predict ratings of samples from singing voice synthesis\u0000and voice conversion with a large variety of systems, listeners, and languages.\u0000The third track was semi-supervised quality prediction for noisy, clean, and\u0000enhanced speech, where a very small amount of labeled training data was\u0000provided. Among the eight teams from both academia and industry, we found that\u0000many were able to outperform the baseline systems. Successful techniques\u0000included retrieval-based methods and the use of non-self-supervised\u0000representations like spectrograms and pitch histograms. These results showed\u0000that the challenge has advanced the field of subjective speech rating\u0000prediction.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition 用于流式语音识别的线性时间复杂性拟合与摘要混音技术
Pub Date : 2024-09-11 DOI: arxiv-2409.07165
Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya
Automatic speech recognition (ASR) with an encoder equipped withself-attention, whether streaming or non-streaming, takes quadratic time in thelength of the speech utterance. This slows down training and decoding, increasetheir cost, and limit the deployment of the ASR in constrained devices.SummaryMixing is a promising linear-time complexity alternative toself-attention for non-streaming speech recognition that, for the first time,preserves or outperforms the accuracy of self-attention models. Unfortunately,the original definition of SummaryMixing is not suited to streaming speechrecognition. Hence, this work extends SummaryMixing to a Conformer Transducerthat works in both a streaming and an offline mode. It shows that this newlinear-time complexity speech encoder outperforms self-attention in bothscenarios while requiring less compute and memory during training and decoding.
无论是流式还是非流式语音识别,使用配备自注意功能的编码器进行自动语音识别(ASR)所需的时间都是语音长度的二次方。摘要混合(SummaryMixing)是一种很有前途的线性时间复杂度替代自注意的非流式语音识别方法,它首次保持或超越了自注意模型的准确性。遗憾的是,SummaryMixing 的原始定义并不适合流式语音识别。因此,这项工作将 SummaryMixing 扩展到了 Conformer Transducer,它可以在流媒体和离线模式下工作。研究表明,这种新的线性时间复杂度语音编码器在这两种情况下的性能都优于自注意,同时在训练和解码过程中需要的计算量和内存更少。
{"title":"Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition","authors":"Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya","doi":"arxiv-2409.07165","DOIUrl":"https://doi.org/arxiv-2409.07165","url":null,"abstract":"Automatic speech recognition (ASR) with an encoder equipped with\u0000self-attention, whether streaming or non-streaming, takes quadratic time in the\u0000length of the speech utterance. This slows down training and decoding, increase\u0000their cost, and limit the deployment of the ASR in constrained devices.\u0000SummaryMixing is a promising linear-time complexity alternative to\u0000self-attention for non-streaming speech recognition that, for the first time,\u0000preserves or outperforms the accuracy of self-attention models. Unfortunately,\u0000the original definition of SummaryMixing is not suited to streaming speech\u0000recognition. Hence, this work extends SummaryMixing to a Conformer Transducer\u0000that works in both a streaming and an offline mode. It shows that this new\u0000linear-time complexity speech encoder outperforms self-attention in both\u0000scenarios while requiring less compute and memory during training and decoding.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - EE - Audio and Speech Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1