IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献_第3页

FxLMS/F Based Tap Decomposed Adaptive Filter for Decentralized Active Noise Control System 基于 FxLMS/F 的分接分解自适应滤波器用于分散式主动噪声控制系统

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-31 DOI: 10.1109/TASLP.2024.3473294

Munukutla L. N. Srinivas Karthik;Joel S.;Nithin V. George

Decentralized systems are appealing due to their reduced complexity and flexibility. A class of decentralized multi-channel active noise control (MCANC) systems has been developed in this paper. In the first part of the study, a modified filtered-x least mean square/fourth (FxLMS/F) algorithm, which offers improved noise reduction performance over the conventional FxLMS/F algorithm, was developed for MCANC. Further, to reduce the computational complexity of the proposed MCANC system, a nearest Kronecker product (NKP) decomposition strategy has been incorporated to develop decentralized versions of FxLMS/F algorithms. The proposed algorithms have been shown to offer enhanced noise reduction at reduced computational complexity when applied for noise control for narrowband noise, bandlimited white noise, traffic noise and wind noise.

分散式系统因其复杂性和灵活性较低而颇具吸引力。本文开发了一类分散式多通道有源噪声控制系统（MCANC）。在研究的第一部分，为 MCANC 开发了一种改进的滤波-x 最小均方/四次方（FxLMS/F）算法，与传统的 FxLMS/F 算法相比，该算法具有更好的降噪性能。此外，为了降低拟议 MCANC 系统的计算复杂度，还采用了最近克朗克积（NKP）分解策略来开发分散版本的 FxLMS/F 算法。在应用于窄带噪声、带限白噪声、交通噪声和风噪声的噪声控制时，已证明所提出的算法能在降低计算复杂度的同时增强降噪效果。

引用次数: 0

DeFTAN-II: Efficient Multichannel Speech Enhancement With Subgroup Processing DeFTAN-II：利用分组处理实现高效多通道语音增强

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-30 DOI: 10.1109/TASLP.2024.3488564

Dongheon Lee;Jung-Woo Choi

In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features. The subgroup processing is implemented in several blocks of the proposed network. In the proposed split dense blocks extracting spatial features, a pair of subgroups is sequentially concatenated and processed by convolution layers to effectively reduce the computational complexity and memory usage. For the F- and T-transformers extracting temporal and spectral relations, we introduce cross-attention between subgroups to identify relationships between locally emphasized and non-emphasized features. The dual-path feedforward network then aggregates attended features in terms of the gating of local features processed by dilated convolutions. Through extensive comparisons with state-of-the-art multichannel speech enhancement models, we demonstrate that DeFTAN-II with subgroup processing outperforms existing methods at significantly lower computational complexity. Moreover, we evaluate the model's generalization capability on real-world data without fine-tuning, which further demonstrates its effectiveness in practical scenarios.

在这项工作中，我们提出了 DeFTAN-II，一种基于变换器架构和子群处理的高效多通道语音增强模型。尽管变换器在语音增强方面取得了成功，但它们在捕捉局部关系、降低高计算复杂度和内存使用率方面仍面临挑战。为了解决这些局限性，我们在模型中引入了子群处理，将局部强调特征的子群与包含原始特征的其他子群结合起来。子群处理在拟议网络的多个区块中实现。在所提出的提取空间特征的分裂密集块中，一对子群按顺序被卷积层连接和处理，从而有效降低了计算复杂度和内存使用量。对于提取时间和频谱关系的 F 变换器和 T 变换器，我们引入了子群之间的交叉关注，以识别局部强调和非强调特征之间的关系。然后，双路前馈网络根据经扩张卷积处理的局部特征的门控情况，汇总被关注的特征。通过与最先进的多通道语音增强模型进行广泛比较，我们证明了采用子群处理技术的 DeFTAN-II 优于现有方法，而且计算复杂度大大降低。此外，我们还评估了该模型在真实世界数据上的泛化能力，无需进行微调，这进一步证明了它在实际应用场景中的有效性。

{"title":"DeFTAN-II: Efficient Multichannel Speech Enhancement With Subgroup Processing","authors":"Dongheon Lee;Jung-Woo Choi","doi":"10.1109/TASLP.2024.3488564","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3488564","url":null,"abstract":"In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features. The subgroup processing is implemented in several blocks of the proposed network. In the proposed split dense blocks extracting spatial features, a pair of subgroups is sequentially concatenated and processed by convolution layers to effectively reduce the computational complexity and memory usage. For the F- and T-transformers extracting temporal and spectral relations, we introduce cross-attention between subgroups to identify relationships between locally emphasized and non-emphasized features. The dual-path feedforward network then aggregates attended features in terms of the gating of local features processed by dilated convolutions. Through extensive comparisons with state-of-the-art multichannel speech enhancement models, we demonstrate that DeFTAN-II with subgroup processing outperforms existing methods at significantly lower computational complexity. Moreover, we evaluate the model's generalization capability on real-world data without fine-tuning, which further demonstrates its effectiveness in practical scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4850-4866"},"PeriodicalIF":4.1,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CL-MASR: A Continual Learning Benchmark for Multilingual ASR CL-MASR：多语种自动识别的持续学习基准

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-29 DOI: 10.1109/TASLP.2024.3487410

Luca Della Libera;Pooneh Mousavi;Salah Zaiem;Cem Subakan;Mirco Ravanelli

Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new languages without losing valuable information from previous data. Furthermore, existing continual learning benchmarks focus mostly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.

现代多语言自动语音识别（ASR）系统（如 Whisper）已经实现了用单一模型转录多语言音频。然而，目前最先进的 ASR 模型通常是在单个语言或多任务环境中进行评估，忽略了不断学习新语言的挑战。关于如何在不丢失以前数据中宝贵信息的情况下添加新语言的研究尚不充分。此外，现有的持续学习基准主要集中在视觉和语言任务上，多语言自动识别的持续学习在很大程度上尚未被探索。为了弥补这一差距，我们提出了 CL-MASR，这是一个专为在持续学习环境中研究多语种 ASR 而设计的基准。CL-MASR在大规模预训练ASR模型的基础上提供了多种持续学习方法，并提供了通用指标来评估学习新语言的效果，同时解决了灾难性遗忘的问题。据我们所知，CL-MASR 是首个针对多语言 ASR 任务的持续学习基准。

{"title":"CL-MASR: A Continual Learning Benchmark for Multilingual ASR","authors":"Luca Della Libera;Pooneh Mousavi;Salah Zaiem;Cem Subakan;Mirco Ravanelli","doi":"10.1109/TASLP.2024.3487410","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3487410","url":null,"abstract":"Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new languages without losing valuable information from previous data. Furthermore, existing continual learning benchmarks focus mostly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4931-4944"},"PeriodicalIF":4.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WEDA: Exploring Copyright Protection for Large Language Model Downstream Alignment WEDA：探索大型语言模型下游对齐的版权保护

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-29 DOI: 10.1109/TASLP.2024.3487419

Shen Wang;Jialiang Dong;Longfei Wu;Zhitao Guan

Large Language Models (LLMs) have shown incomparable representation and generalization capabilities, which have led to significant advancements in Natural Language Processing (NLP). Before deployment, the pre-trained LLMs often need to be tailored to specific downstream tasks for improved performance, which is commonly referred to as downstream alignment. This is a costly effort considering the needed manpower, training resources, and downstream-specific data. While much attention has been paid to protecting the copyright of the models themselves, the copyright protection of LLM alignment has been largely overlooked. In this paper, we present Watermark Embedding for Downstream Alignment (WEDA) scheme, which can provide effective copyright protection for two popular LLM alignment techniques parameter-efficient fine-tuning (PEFT) and in-context learning (ICL). For alignment through PEFT, we propose a Chain of Thought (CoT) based solution to embed watermarks into the PEFT weights. Furthermore, we extend this solution to safeguard alignment through ICL by utilizing the prefix-integrated CoT to watermark examples embedded within ICL prompts. We conduct an extensive experimental evaluation to demonstrate the effectiveness of our proposed scheme.

大型语言模型（LLM）具有无可比拟的表示和概括能力，在自然语言处理（NLP）领域取得了重大进展。在部署之前，预训练的 LLM 通常需要根据特定的下游任务进行调整，以提高性能，这通常被称为下游对齐。考虑到所需的人力、训练资源和下游特定数据，这是一项成本高昂的工作。虽然保护模型本身的版权受到了广泛关注，但 LLM 对齐的版权保护却在很大程度上被忽视了。在本文中，我们提出了下游配准水印嵌入（WEDA）方案，该方案可为参数高效微调（PEFT）和上下文学习（ICL）这两种流行的 LLM 配对技术提供有效的版权保护。对于通过 PEFT 进行的对齐，我们提出了一种基于思维链（CoT）的解决方案，将水印嵌入 PEFT 权重中。此外，我们还扩展了这一解决方案，利用前缀集成 CoT 对嵌入 ICL 提示中的示例进行水印处理，从而保护通过 ICL 进行的对齐。我们进行了广泛的实验评估，以证明我们提出的方案的有效性。

{"title":"WEDA: Exploring Copyright Protection for Large Language Model Downstream Alignment","authors":"Shen Wang;Jialiang Dong;Longfei Wu;Zhitao Guan","doi":"10.1109/TASLP.2024.3487419","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3487419","url":null,"abstract":"Large Language Models (LLMs) have shown incomparable representation and generalization capabilities, which have led to significant advancements in Natural Language Processing (NLP). Before deployment, the pre-trained LLMs often need to be tailored to specific downstream tasks for improved performance, which is commonly referred to as downstream alignment. This is a costly effort considering the needed manpower, training resources, and downstream-specific data. While much attention has been paid to protecting the copyright of the models themselves, the copyright protection of LLM alignment has been largely overlooked. In this paper, we present Watermark Embedding for Downstream Alignment (WEDA) scheme, which can provide effective copyright protection for two popular LLM alignment techniques parameter-efficient fine-tuning (PEFT) and in-context learning (ICL). For alignment through PEFT, we propose a Chain of Thought (CoT) based solution to embed watermarks into the PEFT weights. Furthermore, we extend this solution to safeguard alignment through ICL by utilizing the prefix-integrated CoT to watermark examples embedded within ICL prompts. We conduct an extensive experimental evaluation to demonstrate the effectiveness of our proposed scheme.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4755-4767"},"PeriodicalIF":4.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Knowledge-Guided Transformer for Joint Theme and Emotion Classification of Chinese Classical Poetry 用于中国古典诗词主题和情感联合分类的知识引导转换器

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-29 DOI: 10.1109/TASLP.2024.3487409

Yuting Wei;Linmei Hu;Yangfu Zhu;Jiaqi Zhao;Bin Wu

The classifications of the theme and emotion are essential for understanding and organizing Chinese classical poetry. Existing works often overlook the rich semantic knowledge derived from poem annotations, which contain crucial insights into themes and emotions and are instrumental in semantic understanding. Additionally, the complex interdependence and diversity of themes and emotions within poems are frequently disregarded. Hence, this paper introduces a Poetry Knowledge-augmented Joint Model (Poka) specifically designed for the multi-label classification of themes and emotions in Chinese classical poetry. Specifically, we first employ an automated approach to construct two semantic knowledge graphs for theme and emotion. These graphs facilitate a deeper understanding of the poems by bridging the semantic gap between the obscure ancient words and their modern Chinese counterparts. Representations related to themes and emotions are then acquired through a knowledge-guided mask-transformer. Moreover, Poka leverages the inherent correlations between themes and emotions by adopting a joint classification strategy with shared training parameters. Extensive experiments demonstrate that our model achieves state-of-the-art performance on both theme and emotion classifications, especially on tail labels.

主题和情感的分类对于理解和整理中国古典诗词至关重要。现有著作往往忽视了从诗歌注释中获得的丰富语义知识，而诗歌注释蕴含着对主题和情感的重要见解，有助于语义理解。此外，诗歌主题和情感之间复杂的相互依存关系和多样性也常常被忽视。因此，本文介绍了一种诗歌知识增强联合模型（Poka），该模型专为对中国古典诗歌中的主题和情感进行多标签分类而设计。具体来说，我们首先采用一种自动化方法来构建主题和情感的两个语义知识图谱。这些知识图谱有助于加深对诗词的理解，弥补了晦涩的古代词语与现代汉语对应词语之间的语义鸿沟。然后，通过知识引导的掩码转换器获得与主题和情感相关的表征。此外，Poka 通过采用共享训练参数的联合分类策略，充分利用了主题和情感之间的内在关联性。广泛的实验证明，我们的模型在主题和情感分类方面都达到了最先进的性能，尤其是在尾部标签方面。

{"title":"Knowledge-Guided Transformer for Joint Theme and Emotion Classification of Chinese Classical Poetry","authors":"Yuting Wei;Linmei Hu;Yangfu Zhu;Jiaqi Zhao;Bin Wu","doi":"10.1109/TASLP.2024.3487409","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3487409","url":null,"abstract":"The classifications of the theme and emotion are essential for understanding and organizing Chinese classical poetry. Existing works often overlook the rich semantic knowledge derived from poem annotations, which contain crucial insights into themes and emotions and are instrumental in semantic understanding. Additionally, the complex interdependence and diversity of themes and emotions within poems are frequently disregarded. Hence, this paper introduces a Poetry Knowledge-augmented Joint Model (Poka) specifically designed for the multi-label classification of themes and emotions in Chinese classical poetry. Specifically, we first employ an automated approach to construct two semantic knowledge graphs for theme and emotion. These graphs facilitate a deeper understanding of the poems by bridging the semantic gap between the obscure ancient words and their modern Chinese counterparts. Representations related to themes and emotions are then acquired through a knowledge-guided mask-transformer. Moreover, Poka leverages the inherent correlations between themes and emotions by adopting a joint classification strategy with shared training parameters. Extensive experiments demonstrate that our model achieves state-of-the-art performance on both theme and emotion classifications, especially on tail labels.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4783-4794"},"PeriodicalIF":4.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Syntax-Augmented Hierarchical Interactive Encoder for Zero-Shot Cross-Lingual Information Extraction 用于零点跨语言信息提取的语法增强型分层交互式编码器

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-28 DOI: 10.1109/TASLP.2024.3485547

Jun-Yu Ma;Jia-Chen Gu;Zhen-Hua Ling;Quan Liu;Cong Liu;Guoping Hu

Zero-shot cross-lingual information extraction (IE) aims at constructing an IE model for some low-resource target languages, given annotations exclusively in some rich-resource languages. Recent studies have shown language-universal features can bridge the gap between languages. However, prior work has neither explored the potential of establishing interactions between language-universal features and contextual representations nor incorporated features that can effectively model constituent span attributes and relationships between multiple spans. In this study, a syntax-augmented hierarchical interactive encoder (SHINE) is proposed to transfer cross-lingual IE knowledge. The proposed encoder is capable of interactively capturing complementary information between features and contextual information, to derive language-agnostic representations for various cross-lingual IE tasks. Concretely, a multi-level interaction network is designed to hierarchically interact the complementary information to strengthen domain adaptability. Besides, in addition to the well-studied word-level syntax features of part-of-speech and dependency relation, a new span-level syntax feature of constituency structure is introduced to model the constituent span information which is crucial for IE. Experiments across seven languages on three IE tasks and four benchmarks verify the effectiveness and generalization ability of the proposed method.

零点跨语言信息提取（IE）的目的是在完全使用一些资源丰富的语言进行注释的情况下，为一些资源匮乏的目标语言构建一个 IE 模型。最近的研究表明，语言通用特征可以缩小语言之间的差距。然而，之前的工作既没有探索语言通用特征与上下文表征之间建立互动关系的潜力，也没有纳入能够有效模拟组成跨度属性和多个跨度之间关系的特征。本研究提出了一种语法增强分层交互式编码器（SHINE），用于传输跨语言 IE 知识。所提出的编码器能够以交互方式捕捉特征和上下文信息之间的互补信息，从而为各种跨语言 IE 任务推导出与语言无关的表征。具体来说，设计了一个多层次的交互网络来分层交互互补信息，以加强领域适应性。此外，除了已被充分研究的词级语法特征--语音部分和依赖关系外，还引入了新的跨度级语法特征--成分结构，以模拟对 IE 至关重要的成分跨度信息。在三种 IE 任务和四种基准上对七种语言进行的实验验证了所提方法的有效性和泛化能力。

{"title":"Syntax-Augmented Hierarchical Interactive Encoder for Zero-Shot Cross-Lingual Information Extraction","authors":"Jun-Yu Ma;Jia-Chen Gu;Zhen-Hua Ling;Quan Liu;Cong Liu;Guoping Hu","doi":"10.1109/TASLP.2024.3485547","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485547","url":null,"abstract":"Zero-shot cross-lingual information extraction (IE) aims at constructing an IE model for some low-resource target languages, given annotations exclusively in some rich-resource languages. Recent studies have shown language-universal features can bridge the gap between languages. However, prior work has neither explored the potential of establishing interactions between language-universal features and contextual representations nor incorporated features that can effectively model constituent span attributes and relationships between multiple spans. In this study, a \u0000<bold>s\u0000yntax-augmented \u0000<bold>h\u0000ierarchical \u0000<bold>in\u0000teractive \u0000<bold>e\u0000ncoder (SHINE) is proposed to transfer cross-lingual IE knowledge. The proposed encoder is capable of interactively capturing complementary information between features and contextual information, to derive language-agnostic representations for various cross-lingual IE tasks. Concretely, a multi-level interaction network is designed to hierarchically interact the complementary information to strengthen domain adaptability. Besides, in addition to the well-studied word-level syntax features of part-of-speech and dependency relation, a new span-level syntax feature of constituency structure is introduced to model the constituent span information which is crucial for IE. Experiments across seven languages on three IE tasks and four benchmarks verify the effectiveness and generalization ability of the proposed method.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4795-4809"},"PeriodicalIF":4.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Disfluency Detection From Untranscribed Speech 从未记录的语音中自动检测不流畅语句

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485465

Amrit Romana;Kazuhito Koishida;Emily Mower Provost

Speech disfluencies, such as filled pauses or repetitions, are disruptions in the typical flow of speech. All speakers experience disfluencies at times, and the rate at which we produce disfluencies may be increased by certain speaker or environmental characteristics. Modeling disfluencies has been shown to be useful for a range of downstream tasks, and as a result, disfluency detection has many potential applications. In this work, we investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization. Each of these methods relies on audio as an input. First, we evaluate several automatic speech recognition (ASR) systems in terms of their ability to transcribe disfluencies, measured using disfluency error rates. We then use these ASR transcripts as input to a language-based disfluency detection model. We find that disfluency detection performance is largely limited by the quality of transcripts and alignments. We find that an acoustic-based approach that does not require transcription as an intermediate step outperforms the ASR language approach. Finally, we present multimodal architectures which we find improve disfluency detection performance over the unimodal approaches. Ultimately, this work introduces novel approaches for automatic frame-level disfluency and categorization. In the long term, this will help researchers incorporate automatic disfluency detection into a range of applications.

说话不流畅，如停顿或重复，是典型语流的中断。所有说话者有时都会出现语无伦次的情况，某些说话者或环境特征可能会增加我们产生语无伦次的速度。对不连贯现象进行建模已被证明对一系列下游任务有用，因此，不连贯现象检测有许多潜在的应用。在这项工作中，我们研究了语言、声学和多模态方法，用于帧级不流畅语自动检测和分类。每种方法都依赖音频作为输入。首先，我们评估了几种自动语音识别（ASR）系统转录不流畅语句的能力，衡量标准是不流畅语句错误率。然后，我们将这些 ASR 转录结果作为基于语言的不流畅检测模型的输入。我们发现，不流畅语检测性能在很大程度上受到转录本和对齐质量的限制。我们发现，无需转录作为中间步骤的声学方法优于 ASR 语言方法。最后，我们提出了多模态架构，发现这种架构比单模态方法更能提高不流利检测性能。最终，这项工作为自动帧级不流畅和分类引入了新方法。从长远来看，这将有助于研究人员将不流畅自动检测纳入一系列应用中。

{"title":"Automatic Disfluency Detection From Untranscribed Speech","authors":"Amrit Romana;Kazuhito Koishida;Emily Mower Provost","doi":"10.1109/TASLP.2024.3485465","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485465","url":null,"abstract":"Speech disfluencies, such as filled pauses or repetitions, are disruptions in the typical flow of speech. All speakers experience disfluencies at times, and the rate at which we produce disfluencies may be increased by certain speaker or environmental characteristics. Modeling disfluencies has been shown to be useful for a range of downstream tasks, and as a result, disfluency detection has many potential applications. In this work, we investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization. Each of these methods relies on audio as an input. First, we evaluate several automatic speech recognition (ASR) systems in terms of their ability to transcribe disfluencies, measured using disfluency error rates. We then use these ASR transcripts as input to a language-based disfluency detection model. We find that disfluency detection performance is largely limited by the quality of transcripts and alignments. We find that an acoustic-based approach that does not require transcription as an intermediate step outperforms the ASR language approach. Finally, we present multimodal architectures which we find improve disfluency detection performance over the unimodal approaches. Ultimately, this work introduces novel approaches for automatic frame-level disfluency and categorization. In the long term, this will help researchers incorporate automatic disfluency detection into a range of applications.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4727-4740"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation Auffusion：利用扩散和大型语言模型的力量进行文本到音频生成

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485485

Jinlong Xue;Yayue Deng;Yingming Gao;Ya Li

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of generation tasks. Text-to-Audio (TTA), a burgeoning generation application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resources. Furthermore, the text encoder serves as a critical bridge between text and audio, since it acts as an instruction for the diffusion model to generate coherent content. Previous studies in T2I recognize the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments, being the first to reveal the internal mechanisms in the TTA field and intuitively explain how different text encoders influence the diffusion process. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which is further demonstrated in several related tasks, such as audio style transfer, inpainting, and other manipulations.

扩散模型和大型语言模型（LLM）的最新进展极大地推动了生成任务领域的发展。文本到音频（Text-to-Audio，TTA）是一种新兴的生成应用，旨在根据自然语言提示生成音频，正吸引着越来越多的关注。然而，现有的文本-音频生成研究往往在生成质量和文本-音频对齐方面存在问题，尤其是对于复杂的文本输入。我们从最先进的 "文本到图像"（T2I）扩散模型中汲取灵感，推出了 Auffusion 系统，该系统通过有效利用 T2I 模型固有的生成优势和精确的跨模态对齐，将 T2I 模型框架与 TTA 任务相匹配。我们的客观和主观评估结果表明，Auffusion 超越了以往使用有限数据和计算资源的 TTA 方法。此外，文本编码器是文本和音频之间的重要桥梁，因为它是扩散模型生成连贯内容的指令。以往的 T2I 研究认识到编码器的选择对跨模态对齐（如细粒度细节和对象绑定）的重大影响，而以往的 TTA 研究则缺乏类似的评估。通过全面的消融研究和创新的交叉注意图可视化，我们提供了具有洞察力的评估，首次揭示了 TTA 领域的内部机制，并直观地解释了不同文本编码器如何影响扩散过程。我们的研究结果揭示了 Auffusion 在生成与文本描述精确匹配的音频方面的卓越能力，这一点在音频风格转移、内画和其他操作等多个相关任务中得到了进一步证明。

{"title":"Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation","authors":"Jinlong Xue;Yayue Deng;Yingming Gao;Ya Li","doi":"10.1109/TASLP.2024.3485485","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485485","url":null,"abstract":"Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of generation tasks. Text-to-Audio (TTA), a burgeoning generation application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resources. Furthermore, the text encoder serves as a critical bridge between text and audio, since it acts as an instruction for the diffusion model to generate coherent content. Previous studies in T2I recognize the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments, being the first to reveal the internal mechanisms in the TTA field and intuitively explain how different text encoders influence the diffusion process. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which is further demonstrated in several related tasks, such as audio style transfer, inpainting, and other manipulations.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4700-4712"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications E$^{3}$TTS：端到端基于文本的语音编辑 TTS 系统及其应用

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485466

Zheng Liang;Ziyang Ma;Chenpeng Du;Kai Yu;Xie Chen

Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E

$^{3}$

TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E

$^{3}$

TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E

$^{3}$

TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios¹. E

$^{3}$

TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.²

基于文本的语音编辑旨在通过修改相应的转录文本，在人类听觉系统无法识别的情况下，对部分真实音频进行处理。随着神经文本到语音（TTS）功能的增强，研究人员尝试用 TTS 方法来解决语音编辑问题。本文提出了 E$^{3}$TTS，即端到端基于文本的语音编辑 TTS 系统，它结合了文本编码器、语音编码器以及用于语音合成和语音编辑的联合网络。E$^{3}$TTS 可以通过处理给定文本，随意插入、替换和删除语音内容。实验表明，我们的语音编辑功能在 HiFiTTS 和 LibriTTS 数据集上的表现优于强大的基线，这两个数据集的说话者分别是见过或没见过的。此外，我们还将 E$^{3}$TTS 引入到自动语音识别（ASR）的数据增强中，以缓解代码转换和命名实体识别场景中的数据不足问题1。与过去的数据增强方法相比，E$^{3}$TTS 保留了录制音频的连贯性和真实性。实验结果表明，与传统的基于 TTS 的数据增强系统相比，E$^{3}$TTS 的性能有了显著提高。建议的语音编辑模型的代码和样本可在此资源库中获取2。

{"title":"E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications","authors":"Zheng Liang;Ziyang Ma;Chenpeng Du;Kai Yu;Xie Chen","doi":"10.1109/TASLP.2024.3485466","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485466","url":null,"abstract":"Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios\u00001\u0000. E\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.\u00002","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4810-4821"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EchoScan: Scanning Complex Room Geometries via Acoustic Echoes EchoScan：通过声学回声扫描复杂的房间几何结构

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485516

Inmo Yeon;Iljoo Jeong;Seungchul Lee;Jung-Woo Choi

Accurate estimation of indoor space geometries is vital for constructing precise digital twins, whose broad industrial applications include navigation in unfamiliar environments and efficient evacuation planning, particularly in low-light conditions. This study introduces EchoScan, a deep neural network model that utilizes acoustic echoes to perform room geometry inference. Conventional sound-based techniques rely on estimating geometry-related room parameters such as wall position and room size, thereby limiting the diversity of inferable room geometries. Contrarily, EchoScan overcomes this limitation by directly inferring room floorplan maps and height maps, thereby enabling it to handle rooms with complex shapes, including curved walls. The segmentation task for predicting floorplan and height maps enables the model to leverage both low- and high-order reflections. The use of high-order reflections further allows EchoScan to infer complex room shapes when some walls of the room are unobservable from the position of an audio device. Herein, EchoScan was trained and evaluated using RIRs synthesized from complex environments, including the Manhattan and Atlanta layouts, employing a practical audio device configuration compatible with commercial, off-the-shelf devices.

准确估计室内空间几何形状对于构建精确的数字孪生体至关重要，其广泛的工业应用包括在陌生环境中导航和高效的疏散规划，尤其是在弱光条件下。本研究介绍了一种利用声学回声进行房间几何推断的深度神经网络模型--EchoScan。传统的基于声音的技术依赖于估计与房间几何相关的参数，如墙壁位置和房间大小，从而限制了可推断房间几何形状的多样性。与此相反，EchoScan 通过直接推断房间平面图和高度图来克服这一限制，从而能够处理形状复杂的房间，包括弯曲的墙壁。预测平面图和高度图的分割任务使模型能够利用低阶和高阶反射。当从音频设备的位置无法观察到房间的某些墙壁时，高阶反射的使用进一步使 EchoScan 能够推断出复杂的房间形状。在这里，EchoScan 使用从复杂环境（包括曼哈顿和亚特兰大布局）合成的 RIR 进行了训练和评估，并采用了与商用现成设备兼容的实用音频设备配置。

{"title":"EchoScan: Scanning Complex Room Geometries via Acoustic Echoes","authors":"Inmo Yeon;Iljoo Jeong;Seungchul Lee;Jung-Woo Choi","doi":"10.1109/TASLP.2024.3485516","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485516","url":null,"abstract":"Accurate estimation of indoor space geometries is vital for constructing precise digital twins, whose broad industrial applications include navigation in unfamiliar environments and efficient evacuation planning, particularly in low-light conditions. This study introduces EchoScan, a deep neural network model that utilizes acoustic echoes to perform room geometry inference. Conventional sound-based techniques rely on estimating geometry-related room parameters such as wall position and room size, thereby limiting the diversity of inferable room geometries. Contrarily, EchoScan overcomes this limitation by directly inferring room floorplan maps and height maps, thereby enabling it to handle rooms with complex shapes, including curved walls. The segmentation task for predicting floorplan and height maps enables the model to leverage both low- and high-order reflections. The use of high-order reflections further allows EchoScan to infer complex room shapes when some walls of the room are unobservable from the position of an audio device. Herein, EchoScan was trained and evaluated using RIRs synthesized from complex environments, including the Manhattan and Atlanta layouts, employing a practical audio device configuration compatible with commercial, off-the-shelf devices.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4768-4782"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0