Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng
The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding and generation to address this issue. We train a multi-scale neural codec, CoFi-Codec, to encode speech into a multi-scale discrete representation, comprising multiple token sequences with different time resolutions. Then, we propose CoFi-LM that can generate this representation in two modes: the single-LM-based chain-of-scale generation and the multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech significantly outperforms single-scale baseline systems on naturalness and speaker similarity in zero-shot TTS. The analysis of multi-scale coding demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete speech representations while keeping high-quality speech reconstruction. The coarse-to-fine multi-scale generation, especially for the stack-of-scale approach, is also validated as a crucial approach in pursuing a high-quality neural codec language model for TTS.
{"title":"Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation","authors":"Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng","doi":"arxiv-2409.11630","DOIUrl":"https://doi.org/arxiv-2409.11630","url":null,"abstract":"The neural codec language model (CLM) has demonstrated remarkable performance\u0000in text-to-speech (TTS) synthesis. However, troubled by ``recency bias\", CLM\u0000lacks sufficient attention to coarse-grained information at a higher temporal\u0000scale, often producing unnatural or even unintelligible speech. This work\u0000proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale\u0000speech coding and generation to address this issue. We train a multi-scale\u0000neural codec, CoFi-Codec, to encode speech into a multi-scale discrete\u0000representation, comprising multiple token sequences with different time\u0000resolutions. Then, we propose CoFi-LM that can generate this representation in\u0000two modes: the single-LM-based chain-of-scale generation and the\u0000multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech\u0000significantly outperforms single-scale baseline systems on naturalness and\u0000speaker similarity in zero-shot TTS. The analysis of multi-scale coding\u0000demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete\u0000speech representations while keeping high-quality speech reconstruction. The\u0000coarse-to-fine multi-scale generation, especially for the stack-of-scale\u0000approach, is also validated as a crucial approach in pursuing a high-quality\u0000neural codec language model for TTS.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
{"title":"Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference","authors":"Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee","doi":"arxiv-2409.12117","DOIUrl":"https://doi.org/arxiv-2409.12117","url":null,"abstract":"Large language models (LLMs) have significantly advanced audio processing\u0000through audio codecs that convert audio into discrete tokens, enabling the\u0000application of language modeling techniques to audio data. However, audio\u0000codecs often operate at high frame rates, resulting in slow training and\u0000inference, especially for autoregressive models. To address this challenge, we\u0000present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that\u0000leverages finite scalar quantization and adversarial training with large speech\u0000language models to achieve high-quality audio compression with a 1.89 kbps\u0000bitrate and 21.5 frames per second. We demonstrate that our novel codec can\u0000make the inference of LLM-based text-to-speech models around three times faster\u0000while improving intelligibility and producing quality comparable to previous\u0000models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-quality speech that is nearly indistinguishable from real human speech and facilitating individuals to customize the speech content according to their own needs. Specifically, we first introduce Takin TTS, a neural codec language model that builds upon an enhanced neural speech codec and a multi-task training framework, capable of generating high-fidelity natural speech in a zero-shot way. For Takin VC, we advocate an effective content and timbre joint modeling approach to improve the speaker similarity, while advocating for a conditional flow matching based decoder to further enhance its naturalness and expressiveness. Last, we propose the Takin Morphing system with highly decoupled and advanced timbre and prosody modeling approaches, which enables individuals to customize speech production with their preferred timbre and prosody in a precise and controllable manner. Extensive experiments validate the effectiveness and robustness of our Takin AudioLLM series models. For detailed demos, please refer to https://takinaudiollm.github.io.
{"title":"Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models","authors":"EverestAI, :, Sijin Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jingjing Yin, Jianhao Ye, Jixun Yao, Quanlei Yan, Yuguang Yang","doi":"arxiv-2409.12139","DOIUrl":"https://doi.org/arxiv-2409.12139","url":null,"abstract":"With the advent of the big data and large language model era, zero-shot\u0000personalized rapid customization has emerged as a significant trend. In this\u0000report, we introduce Takin AudioLLM, a series of techniques and models, mainly\u0000including Takin TTS, Takin VC, and Takin Morphing, specifically designed for\u0000audiobook production. These models are capable of zero-shot speech production,\u0000generating high-quality speech that is nearly indistinguishable from real human\u0000speech and facilitating individuals to customize the speech content according\u0000to their own needs. Specifically, we first introduce Takin TTS, a neural codec\u0000language model that builds upon an enhanced neural speech codec and a\u0000multi-task training framework, capable of generating high-fidelity natural\u0000speech in a zero-shot way. For Takin VC, we advocate an effective content and\u0000timbre joint modeling approach to improve the speaker similarity, while\u0000advocating for a conditional flow matching based decoder to further enhance its\u0000naturalness and expressiveness. Last, we propose the Takin Morphing system with\u0000highly decoupled and advanced timbre and prosody modeling approaches, which\u0000enables individuals to customize speech production with their preferred timbre\u0000and prosody in a precise and controllable manner. Extensive experiments\u0000validate the effectiveness and robustness of our Takin AudioLLM series models.\u0000For detailed demos, please refer to https://takinaudiollm.github.io.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces the Pareto Data Framework, an approach for identifying and selecting the Minimum Viable Data (MVD) required for enabling machine learning applications on constrained platforms such as embedded systems, mobile devices, and Internet of Things (IoT) devices. We demonstrate that strategic data reduction can maintain high performance while significantly reducing bandwidth, energy, computation, and storage costs. The framework identifies Minimum Viable Data (MVD) to optimize efficiency across resource-constrained environments without sacrificing performance. It addresses common inefficient practices in an IoT application such as overprovisioning of sensors and overprecision, and oversampling of signals, proposing scalable solutions for optimal sensor selection, signal extraction and transmission, and data representation. An experimental methodology demonstrates effective acoustic data characterization after downsampling, quantization, and truncation to simulate reduced-fidelity sensors and network and storage constraints; results shows that performance can be maintained up to 95% with sample rates reduced by 75% and bit depths and clip length reduced by 50% which translates into substantial cost and resource reduction. These findings have implications on the design and development of constrained systems. The paper also discusses broader implications of the framework, including the potential to democratize advanced AI technologies across IoT applications and sectors such as agriculture, transportation, and manufacturing to improve access and multiply the benefits of data-driven insights.
{"title":"Pareto Data Framework: Steps Towards Resource-Efficient Decision Making Using Minimum Viable Data (MVD)","authors":"Tashfain Ahmed, Josh Siegel","doi":"arxiv-2409.12112","DOIUrl":"https://doi.org/arxiv-2409.12112","url":null,"abstract":"This paper introduces the Pareto Data Framework, an approach for identifying\u0000and selecting the Minimum Viable Data (MVD) required for enabling machine\u0000learning applications on constrained platforms such as embedded systems, mobile\u0000devices, and Internet of Things (IoT) devices. We demonstrate that strategic\u0000data reduction can maintain high performance while significantly reducing\u0000bandwidth, energy, computation, and storage costs. The framework identifies\u0000Minimum Viable Data (MVD) to optimize efficiency across resource-constrained\u0000environments without sacrificing performance. It addresses common inefficient\u0000practices in an IoT application such as overprovisioning of sensors and\u0000overprecision, and oversampling of signals, proposing scalable solutions for\u0000optimal sensor selection, signal extraction and transmission, and data\u0000representation. An experimental methodology demonstrates effective acoustic\u0000data characterization after downsampling, quantization, and truncation to\u0000simulate reduced-fidelity sensors and network and storage constraints; results\u0000shows that performance can be maintained up to 95% with sample rates reduced\u0000by 75% and bit depths and clip length reduced by 50% which translates into\u0000substantial cost and resource reduction. These findings have implications on\u0000the design and development of constrained systems. The paper also discusses\u0000broader implications of the framework, including the potential to democratize\u0000advanced AI technologies across IoT applications and sectors such as\u0000agriculture, transportation, and manufacturing to improve access and multiply\u0000the benefits of data-driven insights.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. Building on the principles of in-context learning (ICL) and retrieval-augmented techniques, our method employs sentence-level ICL in the pre-processing stage to harness contextual information, while integrating token-level k-Nearest Neighbors (kNN) retrieval as a post-processing step to further refine the final output distribution. By synergistically combining sentence-level and token-level retrieval strategies, M2R-whisper effectively mitigates various types of recognition errors. Experiments conducted on Mandarin and subdialect datasets, including AISHELL-1 and KeSpeech, demonstrate substantial improvements in ASR accuracy, all achieved without any parameter updates.
{"title":"M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper","authors":"Jiaming Zhou, Shiwan Zhao, Jiabei He, Hui Wang, Wenjia Zeng, Yong Chen, Haoqin Sun, Aobo Kong, Yong Qin","doi":"arxiv-2409.11889","DOIUrl":"https://doi.org/arxiv-2409.11889","url":null,"abstract":"State-of-the-art models like OpenAI's Whisper exhibit strong performance in\u0000multilingual automatic speech recognition (ASR), but they still face challenges\u0000in accurately recognizing diverse subdialects. In this paper, we propose\u0000M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation\u0000approach designed to enhance ASR performance in low-resource settings. Building\u0000on the principles of in-context learning (ICL) and retrieval-augmented\u0000techniques, our method employs sentence-level ICL in the pre-processing stage\u0000to harness contextual information, while integrating token-level k-Nearest\u0000Neighbors (kNN) retrieval as a post-processing step to further refine the final\u0000output distribution. By synergistically combining sentence-level and\u0000token-level retrieval strategies, M2R-whisper effectively mitigates various\u0000types of recognition errors. Experiments conducted on Mandarin and subdialect\u0000datasets, including AISHELL-1 and KeSpeech, demonstrate substantial\u0000improvements in ASR accuracy, all achieved without any parameter updates.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Western music is often characterized by a homophonic texture, in which the musical content can be organized into a melody and an accompaniment. In orchestral music, in particular, the composer can select specific characteristics for each instrument's part within the accompaniment, while also needing to adapt the melody to suit the capabilities of the instruments performing it. In this work, we propose METEOR, a model for Melody-aware Texture-controllable Orchestral music generation. This model performs symbolic multi-track music style transfer with a focus on melodic fidelity. We allow bar- and track-level controllability of the accompaniment with various textural attributes while keeping a homophonic texture. We show that the model can achieve controllability performances similar to strong baselines while greatly improve melodic fidelity.
{"title":"METEOR: Melody-aware Texture-controllable Symbolic Orchestral Music Generation","authors":"Dinh-Viet-Toan Le, Yi-Hsuan Yang","doi":"arxiv-2409.11753","DOIUrl":"https://doi.org/arxiv-2409.11753","url":null,"abstract":"Western music is often characterized by a homophonic texture, in which the\u0000musical content can be organized into a melody and an accompaniment. In\u0000orchestral music, in particular, the composer can select specific\u0000characteristics for each instrument's part within the accompaniment, while also\u0000needing to adapt the melody to suit the capabilities of the instruments\u0000performing it. In this work, we propose METEOR, a model for Melody-aware\u0000Texture-controllable Orchestral music generation. This model performs symbolic\u0000multi-track music style transfer with a focus on melodic fidelity. We allow\u0000bar- and track-level controllability of the accompaniment with various textural\u0000attributes while keeping a homophonic texture. We show that the model can\u0000achieve controllability performances similar to strong baselines while greatly\u0000improve melodic fidelity.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine listening systems often rely on fixed taxonomies to organize and label audio data, key for training and evaluating deep neural networks (DNNs) and other supervised algorithms. However, such taxonomies face significant constraints: they are composed of application-dependent predefined categories, which hinders the integration of new or varied sounds, and exhibits limited cross-dataset compatibility due to inconsistent labeling standards. To overcome these limitations, we introduce SALT: Standardized Audio event Label Taxonomy. Building upon the hierarchical structure of AudioSet's ontology, our taxonomy extends and standardizes labels across 24 publicly available environmental sound datasets, allowing the mapping of class labels from diverse datasets to a unified system. Our proposal comes with a new Python package designed for navigating and utilizing this taxonomy, easing cross-dataset label searching and hierarchical exploration. Notably, our package allows effortless data aggregation from diverse sources, hence easy experimentation with combined datasets.
{"title":"SALT: Standardized Audio event Label Taxonomy","authors":"Paraskevas StamatiadisIDS, S2A, LTCI, Michel OlveraIDS, S2A, LTCI, Slim EssidIDS, S2A, LTCI","doi":"arxiv-2409.11746","DOIUrl":"https://doi.org/arxiv-2409.11746","url":null,"abstract":"Machine listening systems often rely on fixed taxonomies to organize and\u0000label audio data, key for training and evaluating deep neural networks (DNNs)\u0000and other supervised algorithms. However, such taxonomies face significant\u0000constraints: they are composed of application-dependent predefined categories,\u0000which hinders the integration of new or varied sounds, and exhibits limited\u0000cross-dataset compatibility due to inconsistent labeling standards. To overcome\u0000these limitations, we introduce SALT: Standardized Audio event Label Taxonomy.\u0000Building upon the hierarchical structure of AudioSet's ontology, our taxonomy\u0000extends and standardizes labels across 24 publicly available environmental\u0000sound datasets, allowing the mapping of class labels from diverse datasets to a\u0000unified system. Our proposal comes with a new Python package designed for\u0000navigating and utilizing this taxonomy, easing cross-dataset label searching\u0000and hierarchical exploration. Notably, our package allows effortless data\u0000aggregation from diverse sources, hence easy experimentation with combined\u0000datasets.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sentences in Indian languages are generally longer than those in English. Indian languages are also considered to be phrase-based, wherein semantically complete phrases are concatenated to make up sentences. Long utterances lead to poor training of text-to-speech models and result in poor prosody during synthesis. In this work, we explore an inter-pausal unit (IPU) based approach in the end-to-end (E2E) framework, focusing on synthesising conversational-style text. We consider both autoregressive Tacotron2 and non-autoregressive FastSpeech2 architectures in our study and perform experiments with three Indian languages, namely, Hindi, Tamil and Telugu. With the IPU-based Tacotron2 approach, we see a reduction in insertion and deletion errors in the synthesised audio, providing an alternative approach to the FastSpeech(2) network in terms of error reduction. The IPU-based approach requires less computational resources and produces prosodically richer synthesis compared to conventional sentence-based systems.
{"title":"Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems","authors":"Anusha Prakash, Hema A Murthy","doi":"arxiv-2409.11915","DOIUrl":"https://doi.org/arxiv-2409.11915","url":null,"abstract":"Sentences in Indian languages are generally longer than those in English.\u0000Indian languages are also considered to be phrase-based, wherein semantically\u0000complete phrases are concatenated to make up sentences. Long utterances lead to\u0000poor training of text-to-speech models and result in poor prosody during\u0000synthesis. In this work, we explore an inter-pausal unit (IPU) based approach\u0000in the end-to-end (E2E) framework, focusing on synthesising\u0000conversational-style text. We consider both autoregressive Tacotron2 and\u0000non-autoregressive FastSpeech2 architectures in our study and perform\u0000experiments with three Indian languages, namely, Hindi, Tamil and Telugu. With\u0000the IPU-based Tacotron2 approach, we see a reduction in insertion and deletion\u0000errors in the synthesised audio, providing an alternative approach to the\u0000FastSpeech(2) network in terms of error reduction. The IPU-based approach\u0000requires less computational resources and produces prosodically richer\u0000synthesis compared to conventional sentence-based systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evaluating speech intelligibility is a critical task in computer-aided language learning systems. Traditional methods often rely on word error rates (WER) provided by automatic speech recognition (ASR) as intelligibility scores. However, this approach has significant limitations due to notable differences between human speech recognition (HSR) and ASR. A promising alternative is to involve a native (L1) speaker in shadowing what nonnative (L2) speakers say. Breakdowns or mispronunciations in the L1 speaker's shadowing utterance can serve as indicators for assessing L2 speech intelligibility. In this study, we propose a speech generation system that simulates the L1 shadowing process using voice conversion (VC) techniques and latent speech representations. Our experimental results demonstrate that this method effectively replicates the L1 shadowing process, offering an innovative tool to evaluate L2 speech intelligibility. Notably, systems that utilize self-supervised speech representations (S3R) show a higher degree of similarity to real L1 shadowing utterances in both linguistic accuracy and naturalness.
{"title":"Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations","authors":"Haopeng Geng, Daisuke Saito, Minematsu Nobuaki","doi":"arxiv-2409.11742","DOIUrl":"https://doi.org/arxiv-2409.11742","url":null,"abstract":"Evaluating speech intelligibility is a critical task in computer-aided\u0000language learning systems. Traditional methods often rely on word error rates\u0000(WER) provided by automatic speech recognition (ASR) as intelligibility scores.\u0000However, this approach has significant limitations due to notable differences\u0000between human speech recognition (HSR) and ASR. A promising alternative is to\u0000involve a native (L1) speaker in shadowing what nonnative (L2) speakers say.\u0000Breakdowns or mispronunciations in the L1 speaker's shadowing utterance can\u0000serve as indicators for assessing L2 speech intelligibility. In this study, we\u0000propose a speech generation system that simulates the L1 shadowing process\u0000using voice conversion (VC) techniques and latent speech representations. Our\u0000experimental results demonstrate that this method effectively replicates the L1\u0000shadowing process, offering an innovative tool to evaluate L2 speech\u0000intelligibility. Notably, systems that utilize self-supervised speech\u0000representations (S3R) show a higher degree of similarity to real L1 shadowing\u0000utterances in both linguistic accuracy and naturalness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we propose to challenge this and introduce adaptive computations for LLM-like setups, which allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism, thereby introducing computational textbf{attention shortcuts}. These shortcuts can thus make the architecture depth and context adaptive. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture. We give evidence via attention maps that the models learn complex dependencies across layers that are adaptive in context and depth depending on the input tokens.
{"title":"Adaptive Large Language Models By Layerwise Attention Shortcuts","authors":"Prateek Verma, Mert Pilanci","doi":"arxiv-2409.10870","DOIUrl":"https://doi.org/arxiv-2409.10870","url":null,"abstract":"Transformer architectures are the backbone of the modern AI revolution.\u0000However, they are based on simply stacking the same blocks in dozens of layers\u0000and processing information sequentially from one block to another. In this\u0000paper, we propose to challenge this and introduce adaptive computations for\u0000LLM-like setups, which allow the final layer to attend to all of the\u0000intermediate layers as it deems fit through the attention mechanism, thereby\u0000introducing computational textbf{attention shortcuts}. These shortcuts can\u0000thus make the architecture depth and context adaptive. We showcase four\u0000different datasets, namely acoustic tokens, natural language, and symbolic\u0000music, and we achieve superior performance for GPT-like architecture. We give\u0000evidence via attention maps that the models learn complex dependencies across\u0000layers that are adaptive in context and depth depending on the input tokens.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}