Pub Date : 2024-10-31DOI: 10.1109/TASLP.2024.3473294
Munukutla L. N. Srinivas Karthik;Joel S.;Nithin V. George
Decentralized systems are appealing due to their reduced complexity and flexibility. A class of decentralized multi-channel active noise control (MCANC) systems has been developed in this paper. In the first part of the study, a modified filtered-x least mean square/fourth (FxLMS/F) algorithm, which offers improved noise reduction performance over the conventional FxLMS/F algorithm, was developed for MCANC. Further, to reduce the computational complexity of the proposed MCANC system, a nearest Kronecker product (NKP) decomposition strategy has been incorporated to develop decentralized versions of FxLMS/F algorithms. The proposed algorithms have been shown to offer enhanced noise reduction at reduced computational complexity when applied for noise control for narrowband noise, bandlimited white noise, traffic noise and wind noise.
{"title":"FxLMS/F Based Tap Decomposed Adaptive Filter for Decentralized Active Noise Control System","authors":"Munukutla L. N. Srinivas Karthik;Joel S.;Nithin V. George","doi":"10.1109/TASLP.2024.3473294","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473294","url":null,"abstract":"Decentralized systems are appealing due to their reduced complexity and flexibility. A class of decentralized multi-channel active noise control (MCANC) systems has been developed in this paper. In the first part of the study, a modified filtered-x least mean square/fourth (FxLMS/F) algorithm, which offers improved noise reduction performance over the conventional FxLMS/F algorithm, was developed for MCANC. Further, to reduce the computational complexity of the proposed MCANC system, a nearest Kronecker product (NKP) decomposition strategy has been incorporated to develop decentralized versions of FxLMS/F algorithms. The proposed algorithms have been shown to offer enhanced noise reduction at reduced computational complexity when applied for noise control for narrowband noise, bandlimited white noise, traffic noise and wind noise.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4691-4699"},"PeriodicalIF":4.1,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-30DOI: 10.1109/TASLP.2024.3488564
Dongheon Lee;Jung-Woo Choi
In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features. The subgroup processing is implemented in several blocks of the proposed network. In the proposed split dense blocks extracting spatial features, a pair of subgroups is sequentially concatenated and processed by convolution layers to effectively reduce the computational complexity and memory usage. For the F- and T-transformers extracting temporal and spectral relations, we introduce cross-attention between subgroups to identify relationships between locally emphasized and non-emphasized features. The dual-path feedforward network then aggregates attended features in terms of the gating of local features processed by dilated convolutions. Through extensive comparisons with state-of-the-art multichannel speech enhancement models, we demonstrate that DeFTAN-II with subgroup processing outperforms existing methods at significantly lower computational complexity. Moreover, we evaluate the model's generalization capability on real-world data without fine-tuning, which further demonstrates its effectiveness in practical scenarios.
在这项工作中,我们提出了 DeFTAN-II,一种基于变换器架构和子群处理的高效多通道语音增强模型。尽管变换器在语音增强方面取得了成功,但它们在捕捉局部关系、降低高计算复杂度和内存使用率方面仍面临挑战。为了解决这些局限性,我们在模型中引入了子群处理,将局部强调特征的子群与包含原始特征的其他子群结合起来。子群处理在拟议网络的多个区块中实现。在所提出的提取空间特征的分裂密集块中,一对子群按顺序被卷积层连接和处理,从而有效降低了计算复杂度和内存使用量。对于提取时间和频谱关系的 F 变换器和 T 变换器,我们引入了子群之间的交叉关注,以识别局部强调和非强调特征之间的关系。然后,双路前馈网络根据经扩张卷积处理的局部特征的门控情况,汇总被关注的特征。通过与最先进的多通道语音增强模型进行广泛比较,我们证明了采用子群处理技术的 DeFTAN-II 优于现有方法,而且计算复杂度大大降低。此外,我们还评估了该模型在真实世界数据上的泛化能力,无需进行微调,这进一步证明了它在实际应用场景中的有效性。
{"title":"DeFTAN-II: Efficient Multichannel Speech Enhancement With Subgroup Processing","authors":"Dongheon Lee;Jung-Woo Choi","doi":"10.1109/TASLP.2024.3488564","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3488564","url":null,"abstract":"In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features. The subgroup processing is implemented in several blocks of the proposed network. In the proposed split dense blocks extracting spatial features, a pair of subgroups is sequentially concatenated and processed by convolution layers to effectively reduce the computational complexity and memory usage. For the F- and T-transformers extracting temporal and spectral relations, we introduce cross-attention between subgroups to identify relationships between locally emphasized and non-emphasized features. The dual-path feedforward network then aggregates attended features in terms of the gating of local features processed by dilated convolutions. Through extensive comparisons with state-of-the-art multichannel speech enhancement models, we demonstrate that DeFTAN-II with subgroup processing outperforms existing methods at significantly lower computational complexity. Moreover, we evaluate the model's generalization capability on real-world data without fine-tuning, which further demonstrates its effectiveness in practical scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4850-4866"},"PeriodicalIF":4.1,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29DOI: 10.1109/TASLP.2024.3487410
Luca Della Libera;Pooneh Mousavi;Salah Zaiem;Cem Subakan;Mirco Ravanelli
Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new languages without losing valuable information from previous data. Furthermore, existing continual learning benchmarks focus mostly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.
现代多语言自动语音识别(ASR)系统(如 Whisper)已经实现了用单一模型转录多语言音频。然而,目前最先进的 ASR 模型通常是在单个语言或多任务环境中进行评估,忽略了不断学习新语言的挑战。关于如何在不丢失以前数据中宝贵信息的情况下添加新语言的研究尚不充分。此外,现有的持续学习基准主要集中在视觉和语言任务上,多语言自动识别的持续学习在很大程度上尚未被探索。为了弥补这一差距,我们提出了 CL-MASR,这是一个专为在持续学习环境中研究多语种 ASR 而设计的基准。CL-MASR在大规模预训练ASR模型的基础上提供了多种持续学习方法,并提供了通用指标来评估学习新语言的效果,同时解决了灾难性遗忘的问题。据我们所知,CL-MASR 是首个针对多语言 ASR 任务的持续学习基准。
{"title":"CL-MASR: A Continual Learning Benchmark for Multilingual ASR","authors":"Luca Della Libera;Pooneh Mousavi;Salah Zaiem;Cem Subakan;Mirco Ravanelli","doi":"10.1109/TASLP.2024.3487410","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3487410","url":null,"abstract":"Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new languages without losing valuable information from previous data. Furthermore, existing continual learning benchmarks focus mostly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4931-4944"},"PeriodicalIF":4.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29DOI: 10.1109/TASLP.2024.3487419
Shen Wang;Jialiang Dong;Longfei Wu;Zhitao Guan
Large Language Models (LLMs) have shown incomparable representation and generalization capabilities, which have led to significant advancements in Natural Language Processing (NLP). Before deployment, the pre-trained LLMs often need to be tailored to specific downstream tasks for improved performance, which is commonly referred to as downstream alignment. This is a costly effort considering the needed manpower, training resources, and downstream-specific data. While much attention has been paid to protecting the copyright of the models themselves, the copyright protection of LLM alignment has been largely overlooked. In this paper, we present Watermark Embedding for Downstream Alignment (WEDA) scheme, which can provide effective copyright protection for two popular LLM alignment techniques parameter-efficient fine-tuning (PEFT) and in-context learning (ICL). For alignment through PEFT, we propose a Chain of Thought (CoT) based solution to embed watermarks into the PEFT weights. Furthermore, we extend this solution to safeguard alignment through ICL by utilizing the prefix-integrated CoT to watermark examples embedded within ICL prompts. We conduct an extensive experimental evaluation to demonstrate the effectiveness of our proposed scheme.
{"title":"WEDA: Exploring Copyright Protection for Large Language Model Downstream Alignment","authors":"Shen Wang;Jialiang Dong;Longfei Wu;Zhitao Guan","doi":"10.1109/TASLP.2024.3487419","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3487419","url":null,"abstract":"Large Language Models (LLMs) have shown incomparable representation and generalization capabilities, which have led to significant advancements in Natural Language Processing (NLP). Before deployment, the pre-trained LLMs often need to be tailored to specific downstream tasks for improved performance, which is commonly referred to as downstream alignment. This is a costly effort considering the needed manpower, training resources, and downstream-specific data. While much attention has been paid to protecting the copyright of the models themselves, the copyright protection of LLM alignment has been largely overlooked. In this paper, we present Watermark Embedding for Downstream Alignment (WEDA) scheme, which can provide effective copyright protection for two popular LLM alignment techniques parameter-efficient fine-tuning (PEFT) and in-context learning (ICL). For alignment through PEFT, we propose a Chain of Thought (CoT) based solution to embed watermarks into the PEFT weights. Furthermore, we extend this solution to safeguard alignment through ICL by utilizing the prefix-integrated CoT to watermark examples embedded within ICL prompts. We conduct an extensive experimental evaluation to demonstrate the effectiveness of our proposed scheme.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4755-4767"},"PeriodicalIF":4.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29DOI: 10.1109/TASLP.2024.3487409
Yuting Wei;Linmei Hu;Yangfu Zhu;Jiaqi Zhao;Bin Wu
The classifications of the theme and emotion are essential for understanding and organizing Chinese classical poetry. Existing works often overlook the rich semantic knowledge derived from poem annotations, which contain crucial insights into themes and emotions and are instrumental in semantic understanding. Additionally, the complex interdependence and diversity of themes and emotions within poems are frequently disregarded. Hence, this paper introduces a Poetry Knowledge-augmented Joint Model (Poka) specifically designed for the multi-label classification of themes and emotions in Chinese classical poetry. Specifically, we first employ an automated approach to construct two semantic knowledge graphs for theme and emotion. These graphs facilitate a deeper understanding of the poems by bridging the semantic gap between the obscure ancient words and their modern Chinese counterparts. Representations related to themes and emotions are then acquired through a knowledge-guided mask-transformer. Moreover, Poka leverages the inherent correlations between themes and emotions by adopting a joint classification strategy with shared training parameters. Extensive experiments demonstrate that our model achieves state-of-the-art performance on both theme and emotion classifications, especially on tail labels.
{"title":"Knowledge-Guided Transformer for Joint Theme and Emotion Classification of Chinese Classical Poetry","authors":"Yuting Wei;Linmei Hu;Yangfu Zhu;Jiaqi Zhao;Bin Wu","doi":"10.1109/TASLP.2024.3487409","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3487409","url":null,"abstract":"The classifications of the theme and emotion are essential for understanding and organizing Chinese classical poetry. Existing works often overlook the rich semantic knowledge derived from poem annotations, which contain crucial insights into themes and emotions and are instrumental in semantic understanding. Additionally, the complex interdependence and diversity of themes and emotions within poems are frequently disregarded. Hence, this paper introduces a Poetry Knowledge-augmented Joint Model (Poka) specifically designed for the multi-label classification of themes and emotions in Chinese classical poetry. Specifically, we first employ an automated approach to construct two semantic knowledge graphs for theme and emotion. These graphs facilitate a deeper understanding of the poems by bridging the semantic gap between the obscure ancient words and their modern Chinese counterparts. Representations related to themes and emotions are then acquired through a knowledge-guided mask-transformer. Moreover, Poka leverages the inherent correlations between themes and emotions by adopting a joint classification strategy with shared training parameters. Extensive experiments demonstrate that our model achieves state-of-the-art performance on both theme and emotion classifications, especially on tail labels.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4783-4794"},"PeriodicalIF":4.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1109/TASLP.2024.3485547
Jun-Yu Ma;Jia-Chen Gu;Zhen-Hua Ling;Quan Liu;Cong Liu;Guoping Hu
Zero-shot cross-lingual information extraction (IE) aims at constructing an IE model for some low-resource target languages, given annotations exclusively in some rich-resource languages. Recent studies have shown language-universal features can bridge the gap between languages. However, prior work has neither explored the potential of establishing interactions between language-universal features and contextual representations nor incorporated features that can effectively model constituent span attributes and relationships between multiple spans. In this study, a s