arXiv - EE - Audio and Speech Processing最新文献

英文中文

Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation 从粗到细：通过多尺度语音编码和生成改进神经编解码器语言模型

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11630

Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng

The neural codec language model (CLM) has demonstrated remarkable performancein text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLMlacks sufficient attention to coarse-grained information at a higher temporalscale, often producing unnatural or even unintelligible speech. This workproposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scalespeech coding and generation to address this issue. We train a multi-scaleneural codec, CoFi-Codec, to encode speech into a multi-scale discreterepresentation, comprising multiple token sequences with different timeresolutions. Then, we propose CoFi-LM that can generate this representation intwo modes: the single-LM-based chain-of-scale generation and themultiple-LM-based stack-of-scale generation. In experiments, CoFi-Speechsignificantly outperforms single-scale baseline systems on naturalness andspeaker similarity in zero-shot TTS. The analysis of multi-scale codingdemonstrates the effectiveness of CoFi-Codec in learning multi-scale discretespeech representations while keeping high-quality speech reconstruction. Thecoarse-to-fine multi-scale generation, especially for the stack-of-scaleapproach, is also validated as a crucial approach in pursuing a high-qualityneural codec language model for TTS.

神经编解码语言模型（CLM）在文本到语音（TTS）合成中表现出了卓越的性能。然而，受 "时间偏差 "的困扰，CLM 对更高时间范围内的粗粒度信息缺乏足够的关注，往往会产生不自然甚至无法理解的语音。本研究提出了一种从粗到细的 CLM-TTS 方法 CoFi-Speech，它采用多尺度语音编码和生成来解决这一问题。我们训练了一种多尺度神经编解码器 CoFi-Codec，将语音编码为多尺度离散表示，包括具有不同时间分辨率的多个标记序列。然后，我们提出了 CoFi-LM，它能以两种模式生成这种表示：基于单 LM 的尺度链生成和基于多 LM 的尺度堆叠生成。在实验中，CoFi-Speech 在零镜头 TTS 的自然度和说话人相似度方面明显优于单尺度基准系统。对多尺度编码的分析表明，CoFi-Codec 能有效学习多尺度离散语音表示，同时保持高质量语音重建。从粗到细的多尺度生成，尤其是尺度堆栈方法，也被证实是为 TTS 建立高质量神经编解码语言模型的关键方法。

{"title":"Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation","authors":"Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng","doi":"arxiv-2409.11630","DOIUrl":"https://doi.org/arxiv-2409.11630","url":null,"abstract":"The neural codec language model (CLM) has demonstrated remarkable performance\u0000in text-to-speech (TTS) synthesis. However, troubled by ``recency bias\", CLM\u0000lacks sufficient attention to coarse-grained information at a higher temporal\u0000scale, often producing unnatural or even unintelligible speech. This work\u0000proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale\u0000speech coding and generation to address this issue. We train a multi-scale\u0000neural codec, CoFi-Codec, to encode speech into a multi-scale discrete\u0000representation, comprising multiple token sequences with different time\u0000resolutions. Then, we propose CoFi-LM that can generate this representation in\u0000two modes: the single-LM-based chain-of-scale generation and the\u0000multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech\u0000significantly outperforms single-scale baseline systems on naturalness and\u0000speaker similarity in zero-shot TTS. The analysis of multi-scale coding\u0000demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete\u0000speech representations while keeping high-quality speech reconstruction. The\u0000coarse-to-fine multi-scale generation, especially for the stack-of-scale\u0000approach, is also validated as a crucial approach in pursuing a high-quality\u0000neural codec language model for TTS.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference 低帧率语音编解码器：专为快速高质量语音 LLM 训练和推理而设计的编解码器

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.12117

Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee

Large language models (LLMs) have significantly advanced audio processingthrough audio codecs that convert audio into discrete tokens, enabling theapplication of language modeling techniques to audio data. However, audiocodecs often operate at high frame rates, resulting in slow training andinference, especially for autoregressive models. To address this challenge, wepresent the Low Frame-rate Speech Codec (LFSC): a neural audio codec thatleverages finite scalar quantization and adversarial training with large speechlanguage models to achieve high-quality audio compression with a 1.89 kbpsbitrate and 21.5 frames per second. We demonstrate that our novel codec canmake the inference of LLM-based text-to-speech models around three times fasterwhile improving intelligibility and producing quality comparable to previousmodels.

大语言模型（LLM）通过音频编解码器将音频转换为离散的词块，大大推进了音频处理，从而使语言建模技术能够应用于音频数据。然而，音频编解码器通常以高帧率运行，导致训练和推理速度缓慢，尤其是自回归模型。为了应对这一挑战，我们提出了低帧率语音编解码器（LFSC）：一种神经音频编解码器，它利用有限标量量化和对抗训练以及大型语音语言模型，以 1.89 kbps 的比特率和每秒 21.5 帧的速度实现高质量音频压缩。我们证明，我们的新型编解码器可将基于 LLM 的文本到语音模型的推理速度提高约三倍，同时提高可懂度，并产生与以前的模型相当的质量。

引用次数: 0

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models Takin：高质量零镜头语音生成模型群

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.12139

EverestAI, :, Sijin Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jingjing Yin, Jianhao Ye, Jixun Yao, Quanlei Yan, Yuguang Yang

With the advent of the big data and large language model era, zero-shotpersonalized rapid customization has emerged as a significant trend. In thisreport, we introduce Takin AudioLLM, a series of techniques and models, mainlyincluding Takin TTS, Takin VC, and Takin Morphing, specifically designed foraudiobook production. These models are capable of zero-shot speech production,generating high-quality speech that is nearly indistinguishable from real humanspeech and facilitating individuals to customize the speech content accordingto their own needs. Specifically, we first introduce Takin TTS, a neural codeclanguage model that builds upon an enhanced neural speech codec and amulti-task training framework, capable of generating high-fidelity naturalspeech in a zero-shot way. For Takin VC, we advocate an effective content andtimbre joint modeling approach to improve the speaker similarity, whileadvocating for a conditional flow matching based decoder to further enhance itsnaturalness and expressiveness. Last, we propose the Takin Morphing system withhighly decoupled and advanced timbre and prosody modeling approaches, whichenables individuals to customize speech production with their preferred timbreand prosody in a precise and controllable manner. Extensive experimentsvalidate the effectiveness and robustness of our Takin AudioLLM series models.For detailed demos, please refer to https://takinaudiollm.github.io.

随着大数据和大语言模型时代的到来，零镜头个性化快速定制已成为一个重要趋势。在本报告中，我们介绍了专为有声读物制作而设计的一系列技术和模型，主要包括 Takin TTS、Takin VC 和 Takin Morphing。这些模型能够实现零镜头语音制作，生成与真人语音几乎无异的高质量语音，并方便个人根据自己的需要定制语音内容。具体来说，我们首先介绍了 Takin TTS，它是一种神经编解码语言模型，建立在增强型神经语音编解码器和多任务训练框架之上，能够以 "零镜头 "方式生成高保真自然语音。对于 Takin VC，我们主张采用有效的内容和音调联合建模方法来提高说话者的相似性，同时主张采用基于条件流匹配的解码器来进一步提高其自然度和表现力。最后，我们提出了 Takin Morphing 系统，该系统采用了高度解耦的先进音色和前音建模方法，可以让个人以精确可控的方式定制自己喜欢的音色和前音。大量实验验证了我们的 Takin AudioLLM 系列模型的有效性和稳健性。详细演示请参阅 https://takinaudiollm.github.io。

{"title":"Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models","authors":"EverestAI, :, Sijin Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jingjing Yin, Jianhao Ye, Jixun Yao, Quanlei Yan, Yuguang Yang","doi":"arxiv-2409.12139","DOIUrl":"https://doi.org/arxiv-2409.12139","url":null,"abstract":"With the advent of the big data and large language model era, zero-shot\u0000personalized rapid customization has emerged as a significant trend. In this\u0000report, we introduce Takin AudioLLM, a series of techniques and models, mainly\u0000including Takin TTS, Takin VC, and Takin Morphing, specifically designed for\u0000audiobook production. These models are capable of zero-shot speech production,\u0000generating high-quality speech that is nearly indistinguishable from real human\u0000speech and facilitating individuals to customize the speech content according\u0000to their own needs. Specifically, we first introduce Takin TTS, a neural codec\u0000language model that builds upon an enhanced neural speech codec and a\u0000multi-task training framework, capable of generating high-fidelity natural\u0000speech in a zero-shot way. For Takin VC, we advocate an effective content and\u0000timbre joint modeling approach to improve the speaker similarity, while\u0000advocating for a conditional flow matching based decoder to further enhance its\u0000naturalness and expressiveness. Last, we propose the Takin Morphing system with\u0000highly decoupled and advanced timbre and prosody modeling approaches, which\u0000enables individuals to customize speech production with their preferred timbre\u0000and prosody in a precise and controllable manner. Extensive experiments\u0000validate the effectiveness and robustness of our Takin AudioLLM series models.\u0000For detailed demos, please refer to https://takinaudiollm.github.io.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pareto Data Framework: Steps Towards Resource-Efficient Decision Making Using Minimum Viable Data (MVD) 帕累托数据框架：利用最小可行数据 (MVD) 进行资源节约型决策的步骤

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.12112

Tashfain Ahmed, Josh Siegel

This paper introduces the Pareto Data Framework, an approach for identifyingand selecting the Minimum Viable Data (MVD) required for enabling machinelearning applications on constrained platforms such as embedded systems, mobiledevices, and Internet of Things (IoT) devices. We demonstrate that strategicdata reduction can maintain high performance while significantly reducingbandwidth, energy, computation, and storage costs. The framework identifiesMinimum Viable Data (MVD) to optimize efficiency across resource-constrainedenvironments without sacrificing performance. It addresses common inefficientpractices in an IoT application such as overprovisioning of sensors andoverprecision, and oversampling of signals, proposing scalable solutions foroptimal sensor selection, signal extraction and transmission, and datarepresentation. An experimental methodology demonstrates effective acousticdata characterization after downsampling, quantization, and truncation tosimulate reduced-fidelity sensors and network and storage constraints; resultsshows that performance can be maintained up to 95% with sample rates reducedby 75% and bit depths and clip length reduced by 50% which translates intosubstantial cost and resource reduction. These findings have implications onthe design and development of constrained systems. The paper also discussesbroader implications of the framework, including the potential to democratizeadvanced AI technologies across IoT applications and sectors such asagriculture, transportation, and manufacturing to improve access and multiplythe benefits of data-driven insights.

本文介绍了帕累托数据框架，这是一种在嵌入式系统、移动设备和物联网（IoT）设备等受限平台上识别和选择机器学习应用所需的最小可行数据（MVD）的方法。我们证明，战略性地减少数据可以在保持高性能的同时显著降低带宽、能源、计算和存储成本。该框架可识别最小可行数据（MVD），从而在不牺牲性能的情况下优化资源受限环境的效率。它解决了物联网应用中常见的低效做法，如传感器配置过多、精度过高和信号采样过多，为优化传感器选择、信号提取和传输以及数据呈现提出了可扩展的解决方案。实验方法展示了在降低采样率、量化和截断之后有效的声学数据特征描述，以模拟保真度降低的传感器以及网络和存储限制；结果表明，在采样率降低 75%、比特深度和剪辑长度降低 50% 的情况下，性能可以保持 95%，这意味着成本和资源的大幅降低。这些发现对受限系统的设计和开发具有重要意义。本文还讨论了该框架更广泛的意义，包括在物联网应用以及农业、交通和制造业等领域实现高级人工智能技术民主化的潜力，以改善数据驱动的洞察力的获取和效益倍增。

{"title":"Pareto Data Framework: Steps Towards Resource-Efficient Decision Making Using Minimum Viable Data (MVD)","authors":"Tashfain Ahmed, Josh Siegel","doi":"arxiv-2409.12112","DOIUrl":"https://doi.org/arxiv-2409.12112","url":null,"abstract":"This paper introduces the Pareto Data Framework, an approach for identifying\u0000and selecting the Minimum Viable Data (MVD) required for enabling machine\u0000learning applications on constrained platforms such as embedded systems, mobile\u0000devices, and Internet of Things (IoT) devices. We demonstrate that strategic\u0000data reduction can maintain high performance while significantly reducing\u0000bandwidth, energy, computation, and storage costs. The framework identifies\u0000Minimum Viable Data (MVD) to optimize efficiency across resource-constrained\u0000environments without sacrificing performance. It addresses common inefficient\u0000practices in an IoT application such as overprovisioning of sensors and\u0000overprecision, and oversampling of signals, proposing scalable solutions for\u0000optimal sensor selection, signal extraction and transmission, and data\u0000representation. An experimental methodology demonstrates effective acoustic\u0000data characterization after downsampling, quantization, and truncation to\u0000simulate reduced-fidelity sensors and network and storage constraints; results\u0000shows that performance can be maintained up to 95% with sample rates reduced\u0000by 75% and bit depths and clip length reduced by 50% which translates into\u0000substantial cost and resource reduction. These findings have implications on\u0000the design and development of constrained systems. The paper also discusses\u0000broader implications of the framework, including the potential to democratize\u0000advanced AI technologies across IoT applications and sectors such as\u0000agriculture, transportation, and manufacturing to improve access and multiply\u0000the benefits of data-driven insights.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper M2R-Whisper：多阶段、多尺度检索增强技术，用于增强耳语功能

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11889

Jiaming Zhou, Shiwan Zhao, Jiabei He, Hui Wang, Wenjia Zeng, Yong Chen, Haoqin Sun, Aobo Kong, Yong Qin

State-of-the-art models like OpenAI's Whisper exhibit strong performance inmultilingual automatic speech recognition (ASR), but they still face challengesin accurately recognizing diverse subdialects. In this paper, we proposeM2R-whisper, a novel multi-stage and multi-scale retrieval augmentationapproach designed to enhance ASR performance in low-resource settings. Buildingon the principles of in-context learning (ICL) and retrieval-augmentedtechniques, our method employs sentence-level ICL in the pre-processing stageto harness contextual information, while integrating token-level k-NearestNeighbors (kNN) retrieval as a post-processing step to further refine the finaloutput distribution. By synergistically combining sentence-level andtoken-level retrieval strategies, M2R-whisper effectively mitigates varioustypes of recognition errors. Experiments conducted on Mandarin and subdialectdatasets, including AISHELL-1 and KeSpeech, demonstrate substantialimprovements in ASR accuracy, all achieved without any parameter updates.

OpenAI 的 Whisper 等最先进的模型在多语种自动语音识别（ASR）方面表现出强劲的性能，但在准确识别不同的子方言方面仍面临挑战。在本文中，我们提出了M2R-whisper，这是一种新颖的多阶段、多尺度检索增强方法，旨在提高低资源环境下的自动语音识别性能。基于上下文学习（ICL）和检索增强技术的原理，我们的方法在预处理阶段采用句子级 ICL 来利用上下文信息，同时将标记级 k-最近邻（kNN）检索整合为后处理步骤，以进一步完善最终输出分布。通过协同结合句子级和标记级检索策略，M2R-whisper 有效地减少了各种类型的识别错误。在普通话和亚方言数据集（包括 AISHELL-1 和 KeSpeech）上进行的实验表明，M2R-whisper 的 ASR 准确率大幅提高，而这一切都无需任何参数更新。

{"title":"M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper","authors":"Jiaming Zhou, Shiwan Zhao, Jiabei He, Hui Wang, Wenjia Zeng, Yong Chen, Haoqin Sun, Aobo Kong, Yong Qin","doi":"arxiv-2409.11889","DOIUrl":"https://doi.org/arxiv-2409.11889","url":null,"abstract":"State-of-the-art models like OpenAI's Whisper exhibit strong performance in\u0000multilingual automatic speech recognition (ASR), but they still face challenges\u0000in accurately recognizing diverse subdialects. In this paper, we propose\u0000M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation\u0000approach designed to enhance ASR performance in low-resource settings. Building\u0000on the principles of in-context learning (ICL) and retrieval-augmented\u0000techniques, our method employs sentence-level ICL in the pre-processing stage\u0000to harness contextual information, while integrating token-level k-Nearest\u0000Neighbors (kNN) retrieval as a post-processing step to further refine the final\u0000output distribution. By synergistically combining sentence-level and\u0000token-level retrieval strategies, M2R-whisper effectively mitigates various\u0000types of recognition errors. Experiments conducted on Mandarin and subdialect\u0000datasets, including AISHELL-1 and KeSpeech, demonstrate substantial\u0000improvements in ASR accuracy, all achieved without any parameter updates.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

METEOR: Melody-aware Texture-controllable Symbolic Orchestral Music Generation METEOR：旋律感知、纹理可控的符号管弦乐生成器

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11753

Dinh-Viet-Toan Le, Yi-Hsuan Yang

Western music is often characterized by a homophonic texture, in which themusical content can be organized into a melody and an accompaniment. Inorchestral music, in particular, the composer can select specificcharacteristics for each instrument's part within the accompaniment, while alsoneeding to adapt the melody to suit the capabilities of the instrumentsperforming it. In this work, we propose METEOR, a model for Melody-awareTexture-controllable Orchestral music generation. This model performs symbolicmulti-track music style transfer with a focus on melodic fidelity. We allowbar- and track-level controllability of the accompaniment with various texturalattributes while keeping a homophonic texture. We show that the model canachieve controllability performances similar to strong baselines while greatlyimprove melodic fidelity.

西方音乐通常以同音结构为特征，音乐内容可分为旋律和伴奏。特别是在管弦乐中，作曲家可以为伴奏中的每件乐器选择特定的特征，同时还可以根据演奏乐器的能力调整旋律。在这项工作中，我们提出了 METEOR 模型，这是一种用于生成旋律感知纹理可控管弦乐的模型。该模型以旋律的保真度为重点，进行多轨音乐风格的符号转换。我们允许在保持同音纹理的同时，在伴奏的小节和音轨层面对各种纹理属性进行控制。我们的研究表明，该模型可以实现与强基线相似的可控性，同时大大提高旋律的保真度。

引用次数: 0

SALT: Standardized Audio event Label Taxonomy SALT：标准化音频事件标签分类法

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11746

Paraskevas StamatiadisIDS, S2A, LTCI, Michel OlveraIDS, S2A, LTCI, Slim EssidIDS, S2A, LTCI

Machine listening systems often rely on fixed taxonomies to organize andlabel audio data, key for training and evaluating deep neural networks (DNNs)and other supervised algorithms. However, such taxonomies face significantconstraints: they are composed of application-dependent predefined categories,which hinders the integration of new or varied sounds, and exhibits limitedcross-dataset compatibility due to inconsistent labeling standards. To overcomethese limitations, we introduce SALT: Standardized Audio event Label Taxonomy.Building upon the hierarchical structure of AudioSet's ontology, our taxonomyextends and standardizes labels across 24 publicly available environmentalsound datasets, allowing the mapping of class labels from diverse datasets to aunified system. Our proposal comes with a new Python package designed fornavigating and utilizing this taxonomy, easing cross-dataset label searchingand hierarchical exploration. Notably, our package allows effortless dataaggregation from diverse sources, hence easy experimentation with combineddatasets.

机器听音系统通常依靠固定的分类标准来组织和标记音频数据，这是训练和评估深度神经网络（DNN）和其他监督算法的关键。然而，这些分类标准面临着很大的限制：它们由依赖于应用的预定义类别组成，这阻碍了新声音或各种声音的整合，而且由于标签标准不一致，跨数据集的兼容性也很有限。为了克服这些限制，我们引入了 SALT：标准化音频事件标签分类法。在 AudioSet 本体的分层结构基础上，我们的分类法扩展并标准化了 24 个公开可用的环境声音数据集的标签，允许将不同数据集的类标签映射到统一的系统中。我们的提案还附带了一个新的 Python 软件包，该软件包专为导航和使用该分类法而设计，可简化跨数据集标签搜索和分层探索。值得注意的是，我们的软件包可以毫不费力地对不同来源的数据进行聚合，从而轻松地对组合数据集进行实验。

{"title":"SALT: Standardized Audio event Label Taxonomy","authors":"Paraskevas StamatiadisIDS, S2A, LTCI, Michel OlveraIDS, S2A, LTCI, Slim EssidIDS, S2A, LTCI","doi":"arxiv-2409.11746","DOIUrl":"https://doi.org/arxiv-2409.11746","url":null,"abstract":"Machine listening systems often rely on fixed taxonomies to organize and\u0000label audio data, key for training and evaluating deep neural networks (DNNs)\u0000and other supervised algorithms. However, such taxonomies face significant\u0000constraints: they are composed of application-dependent predefined categories,\u0000which hinders the integration of new or varied sounds, and exhibits limited\u0000cross-dataset compatibility due to inconsistent labeling standards. To overcome\u0000these limitations, we introduce SALT: Standardized Audio event Label Taxonomy.\u0000Building upon the hierarchical structure of AudioSet's ontology, our taxonomy\u0000extends and standardizes labels across 24 publicly available environmental\u0000sound datasets, allowing the mapping of class labels from diverse datasets to a\u0000unified system. Our proposal comes with a new Python package designed for\u0000navigating and utilizing this taxonomy, easing cross-dataset label searching\u0000and hierarchical exploration. Notably, our package allows effortless data\u0000aggregation from diverse sources, hence easy experimentation with combined\u0000datasets.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems 探索基于因果单元（IPU）的端到端智能语音识别系统方法

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11915

Anusha Prakash, Hema A Murthy

Sentences in Indian languages are generally longer than those in English.Indian languages are also considered to be phrase-based, wherein semanticallycomplete phrases are concatenated to make up sentences. Long utterances lead topoor training of text-to-speech models and result in poor prosody duringsynthesis. In this work, we explore an inter-pausal unit (IPU) based approachin the end-to-end (E2E) framework, focusing on synthesisingconversational-style text. We consider both autoregressive Tacotron2 andnon-autoregressive FastSpeech2 architectures in our study and performexperiments with three Indian languages, namely, Hindi, Tamil and Telugu. Withthe IPU-based Tacotron2 approach, we see a reduction in insertion and deletionerrors in the synthesised audio, providing an alternative approach to theFastSpeech(2) network in terms of error reduction. The IPU-based approachrequires less computational resources and produces prosodically richersynthesis compared to conventional sentence-based systems.

印度语言的句子通常比英语的句子长。印度语言也被认为是以短语为基础的语言，语义完整的短语被连接起来构成句子。长语句导致文本到语音模型的训练时间过长，并造成合成时的前音不佳。在这项工作中，我们在端到端（E2E）框架内探索了一种基于停顿间单元（IPU）的方法，重点是合成对话式文本。我们在研究中考虑了自回归 Tacotron2 和非自回归 FastSpeech2 架构，并对三种印度语言（印地语、泰米尔语和泰卢固语）进行了实验。通过基于 IPU 的 Tacotron2 方法，我们发现合成音频中的插入和删除错误有所减少，在减少错误方面为 FastSpeech(2) 网络提供了一种替代方法。与传统的基于句子的系统相比，基于 IPU 的方法所需的计算资源更少，合成的前音也更丰富。

引用次数: 0

Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations 利用潜在语音表征模拟母语演讲者影子，进行非母语语音评估

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11742

Haopeng Geng, Daisuke Saito, Minematsu Nobuaki

Evaluating speech intelligibility is a critical task in computer-aidedlanguage learning systems. Traditional methods often rely on word error rates(WER) provided by automatic speech recognition (ASR) as intelligibility scores.However, this approach has significant limitations due to notable differencesbetween human speech recognition (HSR) and ASR. A promising alternative is toinvolve a native (L1) speaker in shadowing what nonnative (L2) speakers say.Breakdowns or mispronunciations in the L1 speaker's shadowing utterance canserve as indicators for assessing L2 speech intelligibility. In this study, wepropose a speech generation system that simulates the L1 shadowing processusing voice conversion (VC) techniques and latent speech representations. Ourexperimental results demonstrate that this method effectively replicates the L1shadowing process, offering an innovative tool to evaluate L2 speechintelligibility. Notably, systems that utilize self-supervised speechrepresentations (S3R) show a higher degree of similarity to real L1 shadowingutterances in both linguistic accuracy and naturalness.

评估语音可懂度是计算机辅助语言学习系统中的一项重要任务。传统方法通常依赖自动语音识别（ASR）提供的单词错误率（WER）作为可懂度评分，但由于人类语音识别（HSR）和自动语音识别（ASR）之间存在显著差异，这种方法存在很大局限性。一个很有前途的替代方法是让母语（L1）说话者对非母语（L2）说话者所说的话进行跟读，母语（L1）说话者跟读语篇中的断句或错误发音可以作为评估 L2 语音可懂度的指标。在这项研究中，我们提出了一种语音生成系统，该系统利用语音转换（VC）技术和潜在语音表征来模拟 L1 阴影过程。实验结果表明，这种方法有效地复制了 L1 阴影过程，为评估 L2 语音智能提供了一种创新工具。值得注意的是，利用自我监督语音表征（S3R）的系统在语言准确性和自然度方面都与真实的 L1 阴影表现出更高的相似度。

{"title":"Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations","authors":"Haopeng Geng, Daisuke Saito, Minematsu Nobuaki","doi":"arxiv-2409.11742","DOIUrl":"https://doi.org/arxiv-2409.11742","url":null,"abstract":"Evaluating speech intelligibility is a critical task in computer-aided\u0000language learning systems. Traditional methods often rely on word error rates\u0000(WER) provided by automatic speech recognition (ASR) as intelligibility scores.\u0000However, this approach has significant limitations due to notable differences\u0000between human speech recognition (HSR) and ASR. A promising alternative is to\u0000involve a native (L1) speaker in shadowing what nonnative (L2) speakers say.\u0000Breakdowns or mispronunciations in the L1 speaker's shadowing utterance can\u0000serve as indicators for assessing L2 speech intelligibility. In this study, we\u0000propose a speech generation system that simulates the L1 shadowing process\u0000using voice conversion (VC) techniques and latent speech representations. Our\u0000experimental results demonstrate that this method effectively replicates the L1\u0000shadowing process, offering an innovative tool to evaluate L2 speech\u0000intelligibility. Notably, systems that utilize self-supervised speech\u0000representations (S3R) show a higher degree of similarity to real L1 shadowing\u0000utterances in both linguistic accuracy and naturalness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive Large Language Models By Layerwise Attention Shortcuts 分层注意力快捷方式自适应大型语言模型

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.10870

Prateek Verma, Mert Pilanci

Transformer architectures are the backbone of the modern AI revolution.However, they are based on simply stacking the same blocks in dozens of layersand processing information sequentially from one block to another. In thispaper, we propose to challenge this and introduce adaptive computations forLLM-like setups, which allow the final layer to attend to all of theintermediate layers as it deems fit through the attention mechanism, therebyintroducing computational textbf{attention shortcuts}. These shortcuts canthus make the architecture depth and context adaptive. We showcase fourdifferent datasets, namely acoustic tokens, natural language, and symbolicmusic, and we achieve superior performance for GPT-like architecture. We giveevidence via attention maps that the models learn complex dependencies acrosslayers that are adaptive in context and depth depending on the input tokens.

变压器架构是现代人工智能革命的支柱。然而，它们的基础是简单地将相同的区块堆叠成数十层，并按顺序从一个区块处理信息到另一个区块。在本文中，我们提议挑战这一点，并为类似于LLM的设置引入自适应计算，允许最后一层通过注意力机制，在它认为合适的时候关注所有中间层，从而引入计算（textbf{注意力捷径}）。这些捷径可以使架构具有深度和上下文自适应能力。我们展示了四个不同的数据集，即声学标记、自然语言和符号音乐，我们发现类似于 GPT 的架构具有更优越的性能。我们通过注意力图提供了证据，证明模型学习了跨层的复杂依赖关系，这种依赖关系在上下文和深度上都是自适应的，这取决于输入的标记。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - EE - Audio and Speech Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀