首页 > 最新文献

IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

英文 中文
Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation 基于低库和稀疏插值的可扩展复杂度转向响应功率
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-11-11 DOI: 10.1109/TASLP.2024.3496317
Thomas Dietzen;Enzo De Sena;Toon van Waterschoot
The steered response power (SRP) is a popular approach to compute a map of the acoustic scene, typically used for acoustic source localization. The SRP map is obtained as the frequency-weighted output power of a beamformer steered towards a grid of candidate locations. Due to the exhaustive search over a fine grid at all frequency bins, conventional frequency domain-based SRP (conv. FD-SRP) results in a high computational complexity. Time domain-based SRP (conv. TD-SRP) implementations reduce computational complexity at the cost of accuracy using the inverse fast Fourier transform (iFFT). In this paper, to enable a more favourable complexity-performance trade-off as compared to conv. FD-SRP and conv. TD-SRP, we consider the problem of constructing a fine SRP map over the entire search space at scalable computational cost. We propose two approaches to this problem. Expressing the conv. FD-SRP map as a matrix transform of frequency-domain GCCs, we decompose the SRP matrix into a sampling matrix and an interpolation matrix. While sampling can be implemented by the iFFT, we propose to use optimal low-rank or sparse approximations of the interpolation matrix for complexity reduction. The proposed approaches, refered to as sampling + low-rank interpolation-based SRP (SLRI-SRP) and sampling + sparse interpolation-based SRP (SSPI-SRP), are evaluated in various localization scenarios with speech as source signals and compared to the state-of-the-art. The results indicate that SSPI-SRP performs better if large array apertures are used, while SLRI-SRP performs better at small array apertures or a large number of microphones. In comparison to conv. FD-SRP, two to three orders of magnitude of complexity reduction can achieved, often times enabling a more favourable complexity-performance trade-off as compared to conv. TD-SRP. A MATLAB implementation is available online.
转向响应功率(SRP)是计算声学场景地图的一种流行方法,通常用于声源定位。SRP 地图是波束成形器转向候选位置网格的频率加权输出功率。由于要在所有频带的细网格上进行穷举搜索,传统的基于频域的 SRP(conv. FD-SRP)计算复杂度很高。基于时域的 SRP(conv. TD-SRP)实现方法利用反快速傅立叶变换(iFFT)降低了计算复杂度,但却牺牲了精度。与 conv.FD-SRP 和 conv.TD-SRP 相比,我们考虑的问题是如何以可扩展的计算成本在整个搜索空间构建精细的 SRP 地图。我们针对这一问题提出了两种方法。将 conv.FD-SRP 地图表示为频域 GCC 的矩阵变换,我们将 SRP 矩阵分解为采样矩阵和插值矩阵。采样可以通过 iFFT 实现,而我们建议使用插值矩阵的最优低阶或稀疏近似来降低复杂性。我们在以语音为源信号的各种定位场景中评估了所提出的方法,分别称为基于采样 + 低秩插值的 SRP(SLRI-SRP)和基于采样 + 稀疏插值的 SRP(SSPI-SRP),并与最先进的方法进行了比较。结果表明,如果使用大的阵列孔径,SSPI-SRP 的性能更好,而 SLRI-SRP 在使用小的阵列孔径或大量麦克风时性能更好。与 conv.FD-SRP相比,复杂度可降低两到三个数量级,因此,在复杂度-性能权衡方面,TD-SRP往往更胜一筹。MATLAB 实现可在线获取。
{"title":"Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation","authors":"Thomas Dietzen;Enzo De Sena;Toon van Waterschoot","doi":"10.1109/TASLP.2024.3496317","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3496317","url":null,"abstract":"The steered response power (SRP) is a popular approach to compute a map of the acoustic scene, typically used for acoustic source localization. The SRP map is obtained as the frequency-weighted output power of a beamformer steered towards a grid of candidate locations. Due to the exhaustive search over a fine grid at all frequency bins, conventional frequency domain-based SRP (conv. FD-SRP) results in a high computational complexity. Time domain-based SRP (conv. TD-SRP) implementations reduce computational complexity at the cost of accuracy using the inverse fast Fourier transform (iFFT). In this paper, to enable a more favourable complexity-performance trade-off as compared to conv. FD-SRP and conv. TD-SRP, we consider the problem of constructing a fine SRP map over the entire search space at scalable computational cost. We propose two approaches to this problem. Expressing the conv. FD-SRP map as a matrix transform of frequency-domain GCCs, we decompose the SRP matrix into a sampling matrix and an interpolation matrix. While sampling can be implemented by the iFFT, we propose to use optimal low-rank or sparse approximations of the interpolation matrix for complexity reduction. The proposed approaches, refered to as sampling + low-rank interpolation-based SRP (SLRI-SRP) and sampling + sparse interpolation-based SRP (SSPI-SRP), are evaluated in various localization scenarios with speech as source signals and compared to the state-of-the-art. The results indicate that SSPI-SRP performs better if large array apertures are used, while SLRI-SRP performs better at small array apertures or a large number of microphones. In comparison to conv. FD-SRP, two to three orders of magnitude of complexity reduction can achieved, often times enabling a more favourable complexity-performance trade-off as compared to conv. TD-SRP. A MATLAB implementation is available online.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5024-5039"},"PeriodicalIF":4.1,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification 低资源口语识别的跨语料库泛化研究
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-11-08 DOI: 10.1109/TASLP.2024.3492807
Spandan Dey;Md Sahidullah;Goutam Saha
Low-resource spoken language identification (LID) systems are prone to poor generalization across unknown domains. In this study, using multiple widely used low-resourced South Asian LID corpora, we conduct an in-depth analysis for understanding the key non-lingual bias factors that create corpora mismatch and degrade LID generalization. To quantify the biases, we extract different data-driven and rule-based summary vectors that capture non-lingual aspects, such as speaker characteristics, spoken context, accents or dialects, recording channels, background noise, and environments. We then conduct a statistical analysis to identify the most crucial non-lingual bias factors and corpora mismatch components that impact LID performance. Following these analyses, we then propose effective bias compensation approaches for the most relevant summary vectors. We generate pseudo-labels using hierarchical clustering over language-domain-gender constrained summary vectors and use them to train adversarial networks with conditioned metric loss. The compensations learn invariance for the corpora mismatches due to the non-lingual biases and help to improve the generalization. With the proposed compensation method, we improve equal error rate up to 5.22% and 8.14% for the same-corpora and cross-corpora evaluations, respectively.
低资源口语识别系统在未知领域的泛化能力较差。在这项研究中,我们使用多个广泛使用的低资源南亚LID语料库,深入分析了导致语料库不匹配和降低LID泛化的关键非语言偏见因素。为了量化这些偏差,我们提取了不同的数据驱动和基于规则的汇总向量,这些汇总向量捕获了非语言方面的内容,如说话者特征、口语上下文、口音或方言、录音通道、背景噪声和环境。然后,我们进行了统计分析,以确定影响LID性能的最关键的非语言偏见因素和语料库不匹配成分。在这些分析之后,我们针对最相关的汇总向量提出了有效的偏差补偿方法。我们在语言领域性别约束的汇总向量上使用分层聚类生成伪标签,并使用它们来训练具有条件度量损失的对抗网络。补偿学习了非语言偏差导致的语料库不匹配的不变性,有助于提高泛化能力。利用该补偿方法,我们将同一语料库和跨语料库评价的等错误率分别提高到5.22%和8.14%。
{"title":"Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification","authors":"Spandan Dey;Md Sahidullah;Goutam Saha","doi":"10.1109/TASLP.2024.3492807","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492807","url":null,"abstract":"Low-resource spoken language identification (LID) systems are prone to poor generalization across unknown domains. In this study, using multiple widely used low-resourced South Asian LID corpora, we conduct an in-depth analysis for understanding the key non-lingual bias factors that create corpora mismatch and degrade LID generalization. To quantify the biases, we extract different data-driven and rule-based summary vectors that capture non-lingual aspects, such as speaker characteristics, spoken context, accents or dialects, recording channels, background noise, and environments. We then conduct a statistical analysis to identify the most crucial non-lingual bias factors and corpora mismatch components that impact LID performance. Following these analyses, we then propose effective bias compensation approaches for the most relevant summary vectors. We generate pseudo-labels using hierarchical clustering over language-domain-gender constrained summary vectors and use them to train adversarial networks with conditioned metric loss. The compensations learn invariance for the corpora mismatches due to the non-lingual biases and help to improve the generalization. With the proposed compensation method, we improve equal error rate up to 5.22% and 8.14% for the same-corpora and cross-corpora evaluations, respectively.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5040-5050"},"PeriodicalIF":4.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features 利用基于变压器的声学特征框架增强语音水印的鲁棒性
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-11-08 DOI: 10.1109/TASLP.2024.3486206
Chuxuan Tong;Iynkaran Natgunanathan;Yong Xiang;Jianhua Li;Tianrui Zong;Xi Zheng;Longxiang Gao
Digital watermarking serves as an effective approach for safeguarding speech signal copyrights, achieved by the incorporation of ownership information into the original signal and its subsequent extraction from the watermarked signal. While traditional watermarking methods can embed and extract watermarks successfully when the watermarked signals are not exposed to severe alterations, these methods cannot withstand attacks such as de-synchronization. In this work, we introduce a novel transformer-based framework designed to enhance the imperceptibility and robustness of speech watermarking. This framework incorporates encoders and decoders built on multi-scale transformer blocks to effectively capture local and long-range features from inputs, such as acoustic features extracted by Short-Time Fourier Transformation (STFT). Further, a deep neural networks (DNNs) based generator, notably the Transformer architecture, is employed to adaptively embed imperceptible watermarks. These perturbations serve as a step for simulating noise, thereby bolstering the watermark robustness during the training phase. Experimental results show the superiority of our proposed framework in terms of watermark imperceptibility and robustness against various watermark attacks. When compared to the currently available related techniques, the framework exhibits an eightfold increase in embedding rate. Further, it also presents superior practicality with scalability and reduced inference time of DNN models.
数字水印是保护语音信号版权的一种有效方法,它通过在原始信号中加入版权信息,然后从水印信号中提取版权信息来实现。虽然传统的水印方法能在水印信号不被严重改变的情况下成功嵌入和提取水印,但这些方法无法抵御去同步化等攻击。在这项工作中,我们介绍了一种基于变压器的新型框架,旨在增强语音水印的不可感知性和鲁棒性。该框架包含建立在多尺度变压器块上的编码器和解码器,可有效捕捉输入的局部和长程特征,如通过短时傅里叶变换(STFT)提取的声学特征。此外,还采用了基于深度神经网络(DNN)的生成器,特别是变压器架构,以自适应地嵌入不易察觉的水印。这些扰动可作为模拟噪声的步骤,从而在训练阶段增强水印的鲁棒性。实验结果表明,我们提出的框架在水印不可感知性和抵御各种水印攻击的鲁棒性方面具有优势。与目前可用的相关技术相比,该框架的嵌入率提高了八倍。此外,它还在可扩展性和减少 DNN 模型推理时间方面表现出卓越的实用性。
{"title":"Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features","authors":"Chuxuan Tong;Iynkaran Natgunanathan;Yong Xiang;Jianhua Li;Tianrui Zong;Xi Zheng;Longxiang Gao","doi":"10.1109/TASLP.2024.3486206","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3486206","url":null,"abstract":"Digital watermarking serves as an effective approach for safeguarding speech signal copyrights, achieved by the incorporation of ownership information into the original signal and its subsequent extraction from the watermarked signal. While traditional watermarking methods can embed and extract watermarks successfully when the watermarked signals are not exposed to severe alterations, these methods cannot withstand attacks such as de-synchronization. In this work, we introduce a novel transformer-based framework designed to enhance the imperceptibility and robustness of speech watermarking. This framework incorporates encoders and decoders built on multi-scale transformer blocks to effectively capture local and long-range features from inputs, such as acoustic features extracted by Short-Time Fourier Transformation (STFT). Further, a deep neural networks (DNNs) based generator, notably the Transformer architecture, is employed to adaptively embed imperceptible watermarks. These perturbations serve as a step for simulating noise, thereby bolstering the watermark robustness during the training phase. Experimental results show the superiority of our proposed framework in terms of watermark imperceptibility and robustness against various watermark attacks. When compared to the currently available related techniques, the framework exhibits an eightfold increase in embedding rate. Further, it also presents superior practicality with scalability and reduced inference time of DNN models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4822-4837"},"PeriodicalIF":4.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection FTDKD:用于低质量压缩音频深度伪造检测的频率-时间域知识提炼
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-11-07 DOI: 10.1109/TASLP.2024.3492796
Bo Wang;Yeling Tang;Fei Wei;Zhongjie Ba;Kui Ren
In recent years, the field of audio deepfake detection has witnessed significant advancements. Nonetheless, the majority of solutions have concentrated on high-quality audio, largely overlooking the challenge of low-quality compressed audio in real-world scenarios. Low-quality compressed audio typically suffers from a loss of high-frequency details and time-domain information, which significantly undermines the performance of advanced deepfake detection systems when confronted with such data. In this paper, we introduce a deepfake detection model that employs knowledge distillation across the frequency and time domains. Our approach aims to train a teacher model with high-quality data and a student model with low-quality compressed data. Subsequently, we implement frequency-domain and time-domain distillation to facilitate the student model's learning of high-frequency information and time-domain details from the teacher model. Experimental evaluations on the ASVspoof 2019 LA and ASVspoof 2021 DF datasets illustrate the effectiveness of our methodology. On the ASVspoof 2021 DF dataset, which consists of low-quality compressed audio, we achieved an Equal Error Rate (EER) of 2.82%. To our knowledge, this performance is the best among all deepfake voice detection systems tested on the ASVspoof 2021 DF dataset. Additionally, our method proves to be versatile, showing notable performance on high-quality data with an EER of 0.30% on the ASVspoof 2019 LA dataset, closely approaching state-of-the-art results.
近年来,音频深度伪造检测领域取得了重大进展。然而,大多数解决方案都集中在高质量音频上,在很大程度上忽视了现实世界中低质量压缩音频所带来的挑战。低质量压缩音频通常会丢失高频细节和时域信息,这大大削弱了高级深度防伪检测系统在面对此类数据时的性能。在本文中,我们介绍了一种采用跨频域和时域知识提炼的深度伪造检测模型。我们的方法旨在用高质量数据训练教师模型,用低质量压缩数据训练学生模型。随后,我们实施频域和时域蒸馏,以促进学生模型从教师模型中学习高频信息和时域细节。在 ASVspoof 2019 LA 和 ASVspoof 2021 DF 数据集上进行的实验评估说明了我们方法的有效性。在由低质量压缩音频组成的 ASVspoof 2021 DF 数据集上,我们取得了 2.82% 的等效错误率 (EER)。据我们所知,在 ASVspoof 2021 DF 数据集上测试的所有深度伪语音检测系统中,这一性能是最好的。此外,我们的方法还被证明具有多功能性,在高质量数据上表现突出,在 ASVspoof 2019 LA 数据集上的 EER 为 0.30%,接近最先进的结果。
{"title":"FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection","authors":"Bo Wang;Yeling Tang;Fei Wei;Zhongjie Ba;Kui Ren","doi":"10.1109/TASLP.2024.3492796","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492796","url":null,"abstract":"In recent years, the field of audio deepfake detection has witnessed significant advancements. Nonetheless, the majority of solutions have concentrated on high-quality audio, largely overlooking the challenge of low-quality compressed audio in real-world scenarios. Low-quality compressed audio typically suffers from a loss of high-frequency details and time-domain information, which significantly undermines the performance of advanced deepfake detection systems when confronted with such data. In this paper, we introduce a deepfake detection model that employs knowledge distillation across the frequency and time domains. Our approach aims to train a teacher model with high-quality data and a student model with low-quality compressed data. Subsequently, we implement frequency-domain and time-domain distillation to facilitate the student model's learning of high-frequency information and time-domain details from the teacher model. Experimental evaluations on the ASVspoof 2019 LA and ASVspoof 2021 DF datasets illustrate the effectiveness of our methodology. On the ASVspoof 2021 DF dataset, which consists of low-quality compressed audio, we achieved an Equal Error Rate (EER) of 2.82%. To our knowledge, this performance is the best among all deepfake voice detection systems tested on the ASVspoof 2021 DF dataset. Additionally, our method proves to be versatile, showing notable performance on high-quality data with an EER of 0.30% on the ASVspoof 2019 LA dataset, closely approaching state-of-the-art results.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4905-4918"},"PeriodicalIF":4.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ELSF: Entity-Level Slot Filling Framework for Joint Multiple Intent Detection and Slot Filling ELSF:用于联合多重意图检测和空隙填充的实体级空隙填充框架
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-11-07 DOI: 10.1109/TASLP.2024.3492800
Zhanbiao Zhu;Peijie Huang;Haojing Huang;Yuhong Xu;Piyuan Lin;Leyi Lao;Shaoshen Chen;Haojie Xie;Shangjian Yin
Multi-intent spoken language understanding (SLU) that can handle multiple intents in an utterance has attracted increasing attention. Previous studies treat the slot filling task as a token-level sequence labeling task, which results in a lack of entity-related information. In our paper, we propose an Entity-Level Slot Filling (ELSF) framework for joint multiple intent detection and slot filling. In our framework, two entity-oriented auxiliary tasks, entity boundary detection and entity type assignment, are introduced as the regularization to capture the entity boundary and the context of type, respectively. Besides, to better utilize the entity interaction, we design an effective entity-level coordination mechanism for modeling the interaction in both entity-entity and intent-entity relationships. Experiments on five datasets demonstrate the effectiveness and generalizability of our ELSF.
能够处理语篇中多个意图的多意图口语理解(SLU)已引起越来越多的关注。以往的研究将槽填充任务视为标记级序列标注任务,从而导致缺乏与实体相关的信息。在本文中,我们提出了一个实体级槽填充(Entity-Level Slot Filling,ELSF)框架,用于联合多意图检测和槽填充。在我们的框架中,引入了两个面向实体的辅助任务--实体边界检测和实体类型分配--作为正则化,分别捕捉实体边界和类型上下文。此外,为了更好地利用实体交互,我们设计了一种有效的实体级协调机制,用于模拟实体-实体和意图-实体关系中的交互。在五个数据集上的实验证明了我们的 ELSF 的有效性和通用性。
{"title":"ELSF: Entity-Level Slot Filling Framework for Joint Multiple Intent Detection and Slot Filling","authors":"Zhanbiao Zhu;Peijie Huang;Haojing Huang;Yuhong Xu;Piyuan Lin;Leyi Lao;Shaoshen Chen;Haojie Xie;Shangjian Yin","doi":"10.1109/TASLP.2024.3492800","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492800","url":null,"abstract":"Multi-intent spoken language understanding (SLU) that can handle multiple intents in an utterance has attracted increasing attention. Previous studies treat the slot filling task as a token-level sequence labeling task, which results in a lack of entity-related information. In our paper, we propose an \u0000<bold>E</b>\u0000ntity-\u0000<bold>L</b>\u0000evel \u0000<bold>S</b>\u0000lot \u0000<bold>F</b>\u0000illing (ELSF) framework for joint multiple intent detection and slot filling. In our framework, two entity-oriented auxiliary tasks, entity boundary detection and entity type assignment, are introduced as the regularization to capture the entity boundary and the context of type, respectively. Besides, to better utilize the entity interaction, we design an effective entity-level coordination mechanism for modeling the interaction in both entity-entity and intent-entity relationships. Experiments on five datasets demonstrate the effectiveness and generalizability of our ELSF.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4880-4893"},"PeriodicalIF":4.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proper Error Estimation and Calibration for Attention-Based Encoder-Decoder Models 基于注意力的编码器-解码器模型的正确误差估计和校准
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-11-06 DOI: 10.1109/TASLP.2024.3492799
Mun-Hak Lee;Joon-Hyuk Chang
An attention-based automatic speech recognition (ASR) model generates a probability distribution of the tokens set at each time step. Recent studies have shown that calibration errors exist in the output probability distributions of attention-based ASR models trained to minimize the negative log likelihood. This study analyzes the causes of calibration errors in ASR model outputs and their impact on model performance. Based on the analysis, we argue that conventional methods for estimating calibration errors at the token level are unsuitable for ASR tasks. Accordingly, we propose a new calibration measure that estimates the calibration error at the sequence level. Moreover, we present a new post-hoc calibration function and training objective to mitigate the calibration error of the ASR model at the sequence level. Through experiments using the ASR benchmark, we show that the proposed methods effectively alleviate the calibration error of the ASR model and improve the generalization performance.
基于注意力的自动语音识别(ASR)模型会在每个时间步生成标记集的概率分布。最近的研究表明,为最小化负对数似然而训练的注意力型 ASR 模型的输出概率分布存在校准误差。本研究分析了 ASR 模型输出校准误差的原因及其对模型性能的影响。根据分析结果,我们认为在标记水平上估计校准误差的传统方法不适合 ASR 任务。因此,我们提出了一种新的校准测量方法,可以估计序列级别的校准误差。此外,我们还提出了一种新的事后校准函数和训练目标,以减轻 ASR 模型在序列层面的校准误差。通过使用 ASR 基准进行实验,我们发现所提出的方法有效地减轻了 ASR 模型的校准误差,并提高了泛化性能。
{"title":"Proper Error Estimation and Calibration for Attention-Based Encoder-Decoder Models","authors":"Mun-Hak Lee;Joon-Hyuk Chang","doi":"10.1109/TASLP.2024.3492799","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492799","url":null,"abstract":"An attention-based automatic speech recognition (ASR) model generates a probability distribution of the tokens set at each time step. Recent studies have shown that calibration errors exist in the output probability distributions of attention-based ASR models trained to minimize the negative log likelihood. This study analyzes the causes of calibration errors in ASR model outputs and their impact on model performance. Based on the analysis, we argue that conventional methods for estimating calibration errors at the token level are unsuitable for ASR tasks. Accordingly, we propose a new calibration measure that estimates the calibration error at the sequence level. Moreover, we present a new post-hoc calibration function and training objective to mitigate the calibration error of the ASR model at the sequence level. Through experiments using the ASR benchmark, we show that the proposed methods effectively alleviate the calibration error of the ASR model and improve the generalization performance.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4919-4930"},"PeriodicalIF":4.1,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation TF-CrossNet:利用全局、跨带、窄带和位置编码实现单声道和多声道扬声器分离
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-11-06 DOI: 10.1109/TASLP.2024.3492803
Vahid Ahmadi Kalkhorani;DeLiang Wang
We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. TF-CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of TF-CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, TF-CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, TF-CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.
我们介绍的 TF-CrossNet 是一种复杂的频谱映射方法,用于在混响和噪声条件下分离和增强扬声器。所提出的架构包括一个编码器层、一个全局多头自关注模块、一个跨频段模块、一个窄频段模块和一个输出层。TF-CrossNet 可捕捉时频域中的全局、跨频带和窄频带相关性。为了解决长语篇性能下降的问题,我们引入了随机块位置编码。在多个数据集上的实验结果证明了 TF-CrossNet 的有效性和鲁棒性,在混响和嘈杂混响扬声器分离等任务中取得了最先进的性能。此外,与最近的基线相比,TF-CrossNet 的训练速度更快、更稳定。此外,TF-CrossNet 的高性能还延伸到了多麦克风条件下,证明了它在各种声学场景中的多功能性。
{"title":"TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation","authors":"Vahid Ahmadi Kalkhorani;DeLiang Wang","doi":"10.1109/TASLP.2024.3492803","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492803","url":null,"abstract":"We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. TF-CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of TF-CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, TF-CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, TF-CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4999-5009"},"PeriodicalIF":4.1,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow 流式散列:通过规范化流量平衡散列加速音频搜索
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-11-04 DOI: 10.1109/TASLP.2024.3486227
Anup Singh;Kris Demuynck;Vipul Arora
Nearest neighbor search on context representation vectors is a formidable task due to challenges posed by high dimensionality, scalability issues, and potential noise within query vectors. Our novel approach leverages normalizing flow within a self-supervised learning framework to effectively tackle these challenges, specifically in the context of audio fingerprinting tasks. Audio fingerprinting systems incorporate two key components: audio encoding and indexing. The existing systems consider these components independently, resulting in suboptimal performance. Our approach optimizes the interplay between these components, facilitating the adaptation of vectors to the indexing structure. Additionally, we distribute vectors in the latent $mathbb {R}^{K}$ space using normalizing flow, resulting in balanced $K$-bit hash codes. This allows indexing vectors using a balanced hash table, where vectors are uniformly distributed across all possible $2^{K}$ hash buckets. This significantly accelerates retrieval, achieving speedups of up to 2× and 1.4× compared to the Locality-Sensitive Hashing (LSH) and Product Quantization (PQ), respectively. We empirically demonstrate that our system is scalable, highly effective, and efficient in identifying short audio queries ($leq$2 s), particularly at high noise and reverberation levels.
由于高维度、可扩展性问题和查询向量中的潜在噪声所带来的挑战,在上下文表示向量上进行近邻搜索是一项艰巨的任务。我们的新方法利用自监督学习框架中的归一化流来有效地应对这些挑战,特别是在音频指纹识别任务中。音频指纹识别系统包含两个关键部分:音频编码和索引。现有系统单独考虑这两个部分,导致性能不理想。我们的方法优化了这些组件之间的相互作用,促进了向量对索引结构的适应。此外,我们使用归一化流在潜在的 $mathbb {R}^{K}$ 空间中分配向量,从而产生平衡的 $K$ 位散列码。这样就可以使用平衡哈希表来索引向量,其中向量均匀分布在所有可能的 2^{K}$ 哈希桶中。这大大加快了检索速度,与位置敏感散列(LSH)和乘积量化(PQ)相比,检索速度分别提高了 2 倍和 1.4 倍。我们通过经验证明,我们的系统在识别短音频查询($leq$2 s)方面是可扩展、高效和有效的,尤其是在高噪声和混响水平下。
{"title":"FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow","authors":"Anup Singh;Kris Demuynck;Vipul Arora","doi":"10.1109/TASLP.2024.3486227","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3486227","url":null,"abstract":"Nearest neighbor search on context representation vectors is a formidable task due to challenges posed by high dimensionality, scalability issues, and potential noise within query vectors. Our novel approach leverages normalizing flow within a self-supervised learning framework to effectively tackle these challenges, specifically in the context of audio fingerprinting tasks. Audio fingerprinting systems incorporate two key components: audio encoding and indexing. The existing systems consider these components independently, resulting in suboptimal performance. Our approach optimizes the interplay between these components, facilitating the adaptation of vectors to the indexing structure. Additionally, we distribute vectors in the latent \u0000<inline-formula><tex-math>$mathbb {R}^{K}$</tex-math></inline-formula>\u0000 space using normalizing flow, resulting in balanced \u0000<inline-formula><tex-math>$K$</tex-math></inline-formula>\u0000-bit hash codes. This allows indexing vectors using a balanced hash table, where vectors are uniformly distributed across all possible \u0000<inline-formula><tex-math>$2^{K}$</tex-math></inline-formula>\u0000 hash buckets. This significantly accelerates retrieval, achieving speedups of up to 2× and 1.4× compared to the Locality-Sensitive Hashing (LSH) and Product Quantization (PQ), respectively. We empirically demonstrate that our system is scalable, highly effective, and efficient in identifying short audio queries (\u0000<inline-formula><tex-math>$leq$</tex-math></inline-formula>\u00002 s), particularly at high noise and reverberation levels.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4961-4970"},"PeriodicalIF":4.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding and Mitigating the Uncertainty in Zero-Shot Translation 了解并减少零镜头翻译中的不确定性
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-10-31 DOI: 10.1109/TASLP.2024.3485555
Wenxuan Wang;Wenxiang Jiao;Shuo Wang;Zhaopeng Tu;Michael R. Lyu
Zero-shottranslation is a promising direction for building a comprehensive multilingual neural machine translation (MNMT) system. However, its quality is still not satisfactory due to off-target issues. In this paper, we aim to understand and alleviate the off-target issues from the perspective of uncertainty in zero-shot translation. By carefully examining the translation output and model confidence, we identify two uncertainties that are responsible for the off-target issues, namely, extrinsic data uncertainty and intrinsic model uncertainty. Based on the observations, we propose two lightweight and complementary approaches to denoise the training data for model training and explicitly penalize the off-target translations by unlikelihood training during model training. Extensive experiments on both balanced and imbalanced datasets show that our approaches significantly improve the performance of zero-shot translation over strong MNMT baselines.
零笔译是构建综合性多语言神经机器翻译(MNMT)系统的一个有前途的方向。然而,由于脱靶问题,其质量仍不尽如人意。本文旨在从零镜头翻译不确定性的角度来理解和缓解脱靶问题。通过仔细研究翻译输出和模型置信度,我们发现了造成脱靶问题的两个不确定性因素,即外在数据不确定性和内在模型不确定性。基于这些观察结果,我们提出了两种轻量级互补方法,即为模型训练对训练数据进行去噪处理,并在模型训练过程中通过非可能性训练对脱靶翻译进行显式惩罚。在平衡和不平衡数据集上进行的广泛实验表明,与强 MNMT 基线相比,我们的方法显著提高了零镜头翻译的性能。
{"title":"Understanding and Mitigating the Uncertainty in Zero-Shot Translation","authors":"Wenxuan Wang;Wenxiang Jiao;Shuo Wang;Zhaopeng Tu;Michael R. Lyu","doi":"10.1109/TASLP.2024.3485555","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485555","url":null,"abstract":"Zero-shottranslation is a promising direction for building a comprehensive multilingual neural machine translation (MNMT) system. However, its quality is still not satisfactory due to off-target issues. In this paper, we aim to understand and alleviate the off-target issues from the perspective of uncertainty in zero-shot translation. By carefully examining the translation output and model confidence, we identify two uncertainties that are responsible for the off-target issues, namely, extrinsic data uncertainty and intrinsic model uncertainty. Based on the observations, we propose two lightweight and complementary approaches to denoise the training data for model training and explicitly penalize the off-target translations by unlikelihood training during model training. Extensive experiments on both balanced and imbalanced datasets show that our approaches significantly improve the performance of zero-shot translation over strong MNMT baselines.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4894-4904"},"PeriodicalIF":4.1,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning MRC-PASCL:通过后训练和以答案跨度为导向的对比学习实现快速机器阅读理解的方法
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-10-31 DOI: 10.1109/TASLP.2024.3490373
Ren Li;Qiao Xiao;Jianxi Yang;Luyi Zhang;Yu Chen
The rapid development of pre-trained language models (PLMs) has significantly enhanced the performance of machine reading comprehension (MRC). Nevertheless, the traditional fine-tuning approaches necessitate extensive labeled data. MRC remains a challenging task in the few-shot settings or low-resource scenarios. This study proposes a novel few-shot MRC approach via post-training and answer span-oriented contrastive learning, termed MRC-PASCL. Specifically, in the post-training module, a novel noun-entity-aware data selection and generation strategy is proposed according to characteristics of MRC task and data, focusing on masking nouns and named entities in the context. In terms of fine-tuning, the proposed answer span-oriented contrastive learning manner selects spans around the golden answers as negative examples, and performs multi-task learning together with the standard MRC answer prediction task. Experimental results show that MRC-PASCL outperforms the PLMs-based baseline models and the 7B and 13B large language models (LLMs) cross most MRQA 2019 datasets. Further analyses show that our approach achieves better inference efficiency with lower computational resource requirement. The analysis results also indicate that the proposed method can better adapt to the domain-specific scenarios.
预训练语言模型(PLM)的快速发展大大提高了机器阅读理解(MRC)的性能。然而,传统的微调方法需要大量标注数据。在少量数据或资源匮乏的情况下,MRC 仍然是一项具有挑战性的任务。本研究提出了一种通过后训练和以答案跨度为导向的对比学习(称为 MRC-PASCL)来实现的新颖的少量 MRC 方法。具体来说,在后训练模块中,根据 MRC 任务和数据的特点,提出了一种新颖的名词实体感知数据选择和生成策略,重点是屏蔽上下文中的名词和命名实体。在微调方面,提出了以答案跨度为导向的对比学习方式,选择黄金答案周围的跨度作为负例,与标准 MRC 答案预测任务一起执行多任务学习。实验结果表明,MRC-PASCL 在大多数 MRQA 2019 数据集上的表现优于基于 PLMs 的基线模型以及 7B 和 13B 大语言模型(LLMs)。进一步的分析表明,我们的方法以更低的计算资源需求实现了更好的推理效率。分析结果还表明,所提出的方法能更好地适应特定领域的场景。
{"title":"MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning","authors":"Ren Li;Qiao Xiao;Jianxi Yang;Luyi Zhang;Yu Chen","doi":"10.1109/TASLP.2024.3490373","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3490373","url":null,"abstract":"The rapid development of pre-trained language models (PLMs) has significantly enhanced the performance of machine reading comprehension (MRC). Nevertheless, the traditional fine-tuning approaches necessitate extensive labeled data. MRC remains a challenging task in the few-shot settings or low-resource scenarios. This study proposes a novel few-shot MRC approach via post-training and answer span-oriented contrastive learning, termed MRC-PASCL. Specifically, in the post-training module, a novel noun-entity-aware data selection and generation strategy is proposed according to characteristics of MRC task and data, focusing on masking nouns and named entities in the context. In terms of fine-tuning, the proposed answer span-oriented contrastive learning manner selects spans around the golden answers as negative examples, and performs multi-task learning together with the standard MRC answer prediction task. Experimental results show that MRC-PASCL outperforms the PLMs-based baseline models and the 7B and 13B large language models (LLMs) cross most MRQA 2019 datasets. Further analyses show that our approach achieves better inference efficiency with lower computational resource requirement. The analysis results also indicate that the proposed method can better adapt to the domain-specific scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4838-4849"},"PeriodicalIF":4.1,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1