Second language (L2) learners can improve their pronunciation by imitating golden speech, especially when the speech that aligns with their respective speech characteristics. This study explores the hypothesis that learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS) techniques can be harnessed as an effective metric for measuring the pronunciation proficiency of L2 learners. Building on this exploration, the contributions of this study are at least two-fold: 1) design and development of a systematic framework for assessing the ability of a synthesis model to generate golden speech, and 2) in-depth investigations of the effectiveness of using golden speech in automatic pronunciation assessment (APA). Comprehensive experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets suggest that our proposed modeling can yield significant performance improvements with respect to various assessment metrics in relation to some prior arts. To our knowledge, this study is the first to explore the role of golden speech in both ZS-TTS and APA, offering a promising regime for computer-assisted pronunciation training (CAPT).
第二语言(L2)学习者可以通过模仿金色语音来提高发音水平,尤其是当模仿的语音符合他们各自的语音特点时。本研究探讨了一个假设,即利用零镜头文本到语音(ZS-TTS)技术生成的针对学习者的金句语音可以作为衡量第二语言学习者发音水平的有效指标。在这一探索的基础上,本研究至少有两方面的贡献:1)设计和开发了一个系统框架,用于评估合成模型生成黄金语音的能力;2)深入研究了在自动发音评估(APA)中使用黄金语音的有效性。在L2-ARCTIC和Speechocean762基准数据集上进行的综合实验表明,我们提出的建模方法可以在各种评估指标上显著提高性能。据我们所知,这项研究首次探讨了金色语音在 ZS-TTS 和 APA 中的作用,为计算机辅助发音训练(CAPT)提供了一种前景广阔的机制。
{"title":"Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment","authors":"Tien-Hong Lo, Meng-Ting Tsai, Berlin Chen","doi":"arxiv-2409.07151","DOIUrl":"https://doi.org/arxiv-2409.07151","url":null,"abstract":"Second language (L2) learners can improve their pronunciation by imitating\u0000golden speech, especially when the speech that aligns with their respective\u0000speech characteristics. This study explores the hypothesis that\u0000learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS)\u0000techniques can be harnessed as an effective metric for measuring the\u0000pronunciation proficiency of L2 learners. Building on this exploration, the\u0000contributions of this study are at least two-fold: 1) design and development of\u0000a systematic framework for assessing the ability of a synthesis model to\u0000generate golden speech, and 2) in-depth investigations of the effectiveness of\u0000using golden speech in automatic pronunciation assessment (APA). Comprehensive\u0000experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets\u0000suggest that our proposed modeling can yield significant performance\u0000improvements with respect to various assessment metrics in relation to some\u0000prior arts. To our knowledge, this study is the first to explore the role of\u0000golden speech in both ZS-TTS and APA, offering a promising regime for\u0000computer-assisted pronunciation training (CAPT).","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang
We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation. Results are available at https://genjib.github.io/project_page/VMAs/index.html
{"title":"VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos","authors":"Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang","doi":"arxiv-2409.07450","DOIUrl":"https://doi.org/arxiv-2409.07450","url":null,"abstract":"We present a framework for learning to generate background music from video\u0000inputs. Unlike existing works that rely on symbolic musical annotations, which\u0000are limited in quantity and diversity, our method leverages large-scale web\u0000videos accompanied by background music. This enables our model to learn to\u0000generate realistic and diverse music. To accomplish this goal, we develop a\u0000generative video-music Transformer with a novel semantic video-music alignment\u0000scheme. Our model uses a joint autoregressive and contrastive learning\u0000objective, which encourages the generation of music aligned with high-level\u0000video content. We also introduce a novel video-beat alignment scheme to match\u0000the generated music beats with the low-level motions in the video. Lastly, to\u0000capture fine-grained visual cues in a video needed for realistic background\u0000music generation, we introduce a new temporal video encoder architecture,\u0000allowing us to efficiently process videos consisting of many densely sampled\u0000frames. We train our framework on our newly curated DISCO-MV dataset,\u0000consisting of 2.2M video-music samples, which is orders of magnitude larger\u0000than any prior datasets used for video music generation. Our method outperforms\u0000existing approaches on the DISCO-MV and MusicCaps datasets according to various\u0000music generation evaluation metrics, including human evaluation. Results are\u0000available at https://genjib.github.io/project_page/VMAs/index.html","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at url{https://github.com/espnet/espnet}.
{"title":"Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm","authors":"Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin","doi":"arxiv-2409.07226","DOIUrl":"https://doi.org/arxiv-2409.07226","url":null,"abstract":"This research presents Muskits-ESPnet, a versatile toolkit that introduces\u0000new paradigms to Singing Voice Synthesis (SVS) through the application of\u0000pretrained audio models in both continuous and discrete approaches.\u0000Specifically, we explore discrete representations derived from SSL models and\u0000audio codecs and offer significant advantages in versatility and intelligence,\u0000supporting multi-format inputs and adaptable data processing workflows for\u0000various SVS models. The toolkit features automatic music score error detection\u0000and correction, as well as a perception auto-evaluation module to imitate human\u0000subjective evaluating scores. Muskits-ESPnet is available at\u0000url{https://github.com/espnet/espnet}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"110 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). Building upon our knowledge distillation framework from a pre-trained Automatic Speech Recognition (ASR) model, we introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process. These improvements yield substantial performance gains on the LRS2 and LRS3 benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model without increasing the volume of training data or computational resources utilized. Furthermore, we explore the scalability of our approach by examining performance metrics across varying model complexities and training data volumes. LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy, thereby demonstrating the potential for resource-efficient advancements in VSR technology.
{"title":"Enhancing CTC-Based Visual Speech Recognition","authors":"Hendrik Laux, Anke Schmeink","doi":"arxiv-2409.07210","DOIUrl":"https://doi.org/arxiv-2409.07210","url":null,"abstract":"This paper presents LiteVSR2, an enhanced version of our previously\u0000introduced efficient approach to Visual Speech Recognition (VSR). Building upon\u0000our knowledge distillation framework from a pre-trained Automatic Speech\u0000Recognition (ASR) model, we introduce two key improvements: a stabilized video\u0000preprocessing technique and feature normalization in the distillation process.\u0000These improvements yield substantial performance gains on the LRS2 and LRS3\u0000benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model\u0000without increasing the volume of training data or computational resources\u0000utilized. Furthermore, we explore the scalability of our approach by examining\u0000performance metrics across varying model complexities and training data\u0000volumes. LiteVSR2 maintains the efficiency of its predecessor while\u0000significantly enhancing accuracy, thereby demonstrating the potential for\u0000resource-efficient advancements in VSR technology.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Mamba-based model has demonstrated outstanding performance across tasks in computer vision, natural language processing, and speech processing. However, in the realm of speech processing, the Mamba-based model's performance varies across different tasks. For instance, in tasks such as speech enhancement and spectrum reconstruction, the Mamba model performs well when used independently. However, for tasks like speech recognition, additional modules are required to surpass the performance of attention-based models. We propose the hypothesis that the Mamba-based model excels in "reconstruction" tasks within speech processing. However, for "classification tasks" such as Speech Recognition, additional modules are necessary to accomplish the "reconstruction" step. To validate our hypothesis, we analyze the previous Mamba-based Speech Models from an information theory perspective. Furthermore, we leveraged the properties of HuBERT in our study. We trained a Mamba-based HuBERT model, and the mutual information patterns, along with the model's performance metrics, confirmed our assumptions.
{"title":"Rethinking Mamba in Speech Processing by Self-Supervised Models","authors":"Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, Julien Epps","doi":"arxiv-2409.07273","DOIUrl":"https://doi.org/arxiv-2409.07273","url":null,"abstract":"The Mamba-based model has demonstrated outstanding performance across tasks\u0000in computer vision, natural language processing, and speech processing.\u0000However, in the realm of speech processing, the Mamba-based model's performance\u0000varies across different tasks. For instance, in tasks such as speech\u0000enhancement and spectrum reconstruction, the Mamba model performs well when\u0000used independently. However, for tasks like speech recognition, additional\u0000modules are required to surpass the performance of attention-based models. We\u0000propose the hypothesis that the Mamba-based model excels in \"reconstruction\"\u0000tasks within speech processing. However, for \"classification tasks\" such as\u0000Speech Recognition, additional modules are necessary to accomplish the\u0000\"reconstruction\" step. To validate our hypothesis, we analyze the previous\u0000Mamba-based Speech Models from an information theory perspective. Furthermore,\u0000we leveraged the properties of HuBERT in our study. We trained a Mamba-based\u0000HuBERT model, and the mutual information patterns, along with the model's\u0000performance metrics, confirmed our assumptions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech language models have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap, we introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response. The proposed benchmarks both evaluate the consistency of the inspected element and how much it matches the spoken text. We follow a modelling based approach, measuring whether a model gives correct samples higher scores than incorrect ones. This approach makes the benchmark fast to compute even for large models. We evaluated several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method. Code and data are publicly available at https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .
{"title":"A Suite for Acoustic Language Model Evaluation","authors":"Gallil Maimon, Amit Roth, Yossi Adi","doi":"arxiv-2409.07437","DOIUrl":"https://doi.org/arxiv-2409.07437","url":null,"abstract":"Speech language models have recently demonstrated great potential as\u0000universal speech processing systems. Such models have the ability to model the\u0000rich acoustic information existing in audio signals, beyond spoken content,\u0000such as emotion, background noise, etc. Despite this, evaluation benchmarks\u0000which evaluate awareness to a wide range of acoustic aspects, are lacking. To\u0000help bridge this gap, we introduce SALMon, a novel evaluation suite\u0000encompassing background noise, emotion, speaker identity and room impulse\u0000response. The proposed benchmarks both evaluate the consistency of the\u0000inspected element and how much it matches the spoken text. We follow a\u0000modelling based approach, measuring whether a model gives correct samples\u0000higher scores than incorrect ones. This approach makes the benchmark fast to\u0000compute even for large models. We evaluated several speech language models on\u0000SALMon, thus highlighting the strengths and weaknesses of each evaluated\u0000method. Code and data are publicly available at\u0000https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method. This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field. With this dataset, we trained a Tacotron2-based TTS model, achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for the utterances generated by the same vocoder and natural spectrogram, and the MOS of 4.01 for the natural waveform, demonstrating the exceptional quality and effectiveness of the corpus.
{"title":"ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages","authors":"Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee","doi":"arxiv-2409.07259","DOIUrl":"https://doi.org/arxiv-2409.07259","url":null,"abstract":"In this study, we introduce ManaTTS, the most extensive publicly accessible\u0000single-speaker Persian corpus, and a comprehensive framework for collecting\u0000transcribed speech datasets for the Persian language. ManaTTS, released under\u0000the open CC-0 license, comprises approximately 86 hours of audio with a\u0000sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the\u0000VirgoolInformal dataset to evaluate Persian speech recognition models used for\u0000forced alignment, extending over 5 hours of audio. The datasets are supported\u0000by a fully transparent, MIT-licensed pipeline, a testament to innovation in the\u0000field. It includes unique tools for sentence tokenization, bounded audio\u0000segmentation, and a novel forced alignment method. This alignment technique is\u0000specifically designed for low-resource languages, addressing a crucial need in\u0000the field. With this dataset, we trained a Tacotron2-based TTS model, achieving\u0000a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of\u00003.86 for the utterances generated by the same vocoder and natural spectrogram,\u0000and the MOS of 4.01 for the natural waveform, demonstrating the exceptional\u0000quality and effectiveness of the corpus.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"273 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The advancements in generative AI have enabled the improvement of audio synthesis models, including text-to-speech and voice conversion. This raises concerns about its potential misuse in social manipulation and political interference, as synthetic speech has become indistinguishable from natural human speech. Several speech-generation programs are utilized for malicious purposes, especially impersonating individuals through phone calls. Therefore, detecting fake audio is crucial to maintain social security and safeguard the integrity of information. Recent research has proposed a D-CAPTCHA system based on the challenge-response protocol to differentiate fake phone calls from real ones. In this work, we study the resilience of this system and introduce a more robust version, D-CAPTCHA++, to defend against fake calls. Specifically, we first expose the vulnerability of the D-CAPTCHA system under transferable imperceptible adversarial attack. Secondly, we mitigate such vulnerability by improving the robustness of the system by using adversarial training in D-CAPTCHA deepfake detectors and task classifiers.
{"title":"D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack","authors":"Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac","doi":"arxiv-2409.07390","DOIUrl":"https://doi.org/arxiv-2409.07390","url":null,"abstract":"The advancements in generative AI have enabled the improvement of audio\u0000synthesis models, including text-to-speech and voice conversion. This raises\u0000concerns about its potential misuse in social manipulation and political\u0000interference, as synthetic speech has become indistinguishable from natural\u0000human speech. Several speech-generation programs are utilized for malicious\u0000purposes, especially impersonating individuals through phone calls. Therefore,\u0000detecting fake audio is crucial to maintain social security and safeguard the\u0000integrity of information. Recent research has proposed a D-CAPTCHA system based\u0000on the challenge-response protocol to differentiate fake phone calls from real\u0000ones. In this work, we study the resilience of this system and introduce a more\u0000robust version, D-CAPTCHA++, to defend against fake calls. Specifically, we\u0000first expose the vulnerability of the D-CAPTCHA system under transferable\u0000imperceptible adversarial attack. Secondly, we mitigate such vulnerability by\u0000improving the robustness of the system by using adversarial training in\u0000D-CAPTCHA deepfake detectors and task classifiers.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion with a large variety of systems, listeners, and languages. The third track was semi-supervised quality prediction for noisy, clean, and enhanced speech, where a very small amount of labeled training data was provided. Among the eight teams from both academia and industry, we found that many were able to outperform the baseline systems. Successful techniques included retrieval-based methods and the use of non-self-supervised representations like spectrograms and pitch histograms. These results showed that the challenge has advanced the field of subjective speech rating prediction.
{"title":"The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction","authors":"Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao","doi":"arxiv-2409.07001","DOIUrl":"https://doi.org/arxiv-2409.07001","url":null,"abstract":"We present the third edition of the VoiceMOS Challenge, a scientific\u0000initiative designed to advance research into automatic prediction of human\u0000speech ratings. There were three tracks. The first track was on predicting the\u0000quality of ``zoomed-in'' high-quality samples from speech synthesis systems.\u0000The second track was to predict ratings of samples from singing voice synthesis\u0000and voice conversion with a large variety of systems, listeners, and languages.\u0000The third track was semi-supervised quality prediction for noisy, clean, and\u0000enhanced speech, where a very small amount of labeled training data was\u0000provided. Among the eight teams from both academia and industry, we found that\u0000many were able to outperform the baseline systems. Successful techniques\u0000included retrieval-based methods and the use of non-self-supervised\u0000representations like spectrograms and pitch histograms. These results showed\u0000that the challenge has advanced the field of subjective speech rating\u0000prediction.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya
Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.
{"title":"Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition","authors":"Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya","doi":"arxiv-2409.07165","DOIUrl":"https://doi.org/arxiv-2409.07165","url":null,"abstract":"Automatic speech recognition (ASR) with an encoder equipped with\u0000self-attention, whether streaming or non-streaming, takes quadratic time in the\u0000length of the speech utterance. This slows down training and decoding, increase\u0000their cost, and limit the deployment of the ASR in constrained devices.\u0000SummaryMixing is a promising linear-time complexity alternative to\u0000self-attention for non-streaming speech recognition that, for the first time,\u0000preserves or outperforms the accuracy of self-attention models. Unfortunately,\u0000the original definition of SummaryMixing is not suited to streaming speech\u0000recognition. Hence, this work extends SummaryMixing to a Conformer Transducer\u0000that works in both a streaming and an offline mode. It shows that this new\u0000linear-time complexity speech encoder outperforms self-attention in both\u0000scenarios while requiring less compute and memory during training and decoding.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"114 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}