IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献_第4页

Cacophony: An Improved Contrastive Audio-Text Model Cacophony：改进的音频文本对比模型

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485170

Ge Zhu;Jordan Darefsky;Zhiyao Duan

Despite recent advancements, audio-text models still lag behind their image-text counterparts in scale and performance. In this paper, we propose to improve both the data scale and the training procedure of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then train a contrastive model with an auxiliary captioning objective with the audio encoder initialized from the MAE model. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification.

尽管最近取得了一些进步，但音频-文本模型在规模和性能上仍然落后于图像-文本模型。在本文中，我们建议改进音频-文本对比模型的数据规模和训练程序。具体来说，我们制作了一个大规模的音频-文本数据集，其中包含 13,000 个小时的文本标注音频，使用预训练语言模型来处理噪声文本描述，并使用自动字幕来获取未标注音频样本的文本描述。我们首先使用掩码自动编码器（MAE）目标对纯音频数据进行训练，这使我们能够从未标明音频数据集的可扩展性中获益。然后，我们使用从 MAE 模型初始化的音频编码器，训练一个带有辅助字幕目标的对比模型。我们的最终模型被命名为 Cacophony，它在音频文本检索任务中取得了最先进的性能，并在 HEAR 基准和其他下游任务（如零镜头分类）中表现出极具竞争力的结果。

引用次数: 0

Interference-Controlled Maximum Noise Reduction Beamformer Based on Deep-Learned Interference Manifold 基于深度学习干扰矩阵的干扰控制型最大降噪波束成形器

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485551

Yichen Yang;Ningning Pan;Wen Zhang;Chao Pan;Jacob Benesty;Jingdong Chen

Beamforming has been used in a wide range of applications to extract the signal of interest from microphone array observations, which consist of not only the signal of interest, but also noise, interference, and reverberation. The recently proposed interference-controlled maximum noise reduction (ICMR) beamformer provides a flexible way to control the specified amount of the interference attenuation and noise suppression; but it requires accurate estimation of the manifold vector of the interference sources, which is challenging to achieve in real-world applications. To address this issue, we introduce an interference-controlled maximum noise reduction network (ICMRNet) in this study, which is a deep neural network (DNN)-based method for manifold vector estimation. With densely connected modified conformer blocks and the end-to-end training strategy, the interference manifold is learned directly from the observation signals. This approach, akin to ICMR, adeptly adapts to time-varying interference and demonstrates superior convergence rate and extraction efficacy as compared to the linearly constrained minimum variance (LCMV)-based neural beamformers when appropriate attenuation factors are selected. Moreover, via learning-based extraction, ICMRNet effectively suppresses reverberation components within the target signal. Comparative analysis against baseline methods validates the efficacy of the proposed method.

波束成形已被广泛应用于从麦克风阵列观测数据中提取感兴趣的信号，这些观测数据不仅包括感兴趣的信号，还包括噪声、干扰和混响。最近提出的干扰控制最大降噪（ICMR）波束成形器提供了一种灵活的方法来控制干扰衰减和噪声抑制的指定量，但它需要精确估计干扰源的流形向量，这在实际应用中很难实现。为了解决这个问题，我们在本研究中引入了干扰控制最大降噪网络（ICMRNet），这是一种基于深度神经网络（DNN）的流形向量估计方法。通过密集连接的修正构象块和端到端训练策略，干扰流形可直接从观测信号中学习。这种方法与 ICMR 相似，能很好地适应时变干扰，与基于线性约束最小方差（LCMV）的神经波束成形器相比，在选择适当的衰减因子时，具有更高的收敛速度和提取效率。此外，通过基于学习的提取，ICMRNet 还能有效抑制目标信号中的混响成分。与基准方法的对比分析验证了所提方法的有效性。

{"title":"Interference-Controlled Maximum Noise Reduction Beamformer Based on Deep-Learned Interference Manifold","authors":"Yichen Yang;Ningning Pan;Wen Zhang;Chao Pan;Jacob Benesty;Jingdong Chen","doi":"10.1109/TASLP.2024.3485551","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485551","url":null,"abstract":"Beamforming has been used in a wide range of applications to extract the signal of interest from microphone array observations, which consist of not only the signal of interest, but also noise, interference, and reverberation. The recently proposed interference-controlled maximum noise reduction (ICMR) beamformer provides a flexible way to control the specified amount of the interference attenuation and noise suppression; but it requires accurate estimation of the manifold vector of the interference sources, which is challenging to achieve in real-world applications. To address this issue, we introduce an interference-controlled maximum noise reduction network (ICMRNet) in this study, which is a deep neural network (DNN)-based method for manifold vector estimation. With densely connected modified conformer blocks and the end-to-end training strategy, the interference manifold is learned directly from the observation signals. This approach, akin to ICMR, adeptly adapts to time-varying interference and demonstrates superior convergence rate and extraction efficacy as compared to the linearly constrained minimum variance (LCMV)-based neural beamformers when appropriate attenuation factors are selected. Moreover, via learning-based extraction, ICMRNet effectively suppresses reverberation components within the target signal. Comparative analysis against baseline methods validates the efficacy of the proposed method.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4676-4690"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning 为基于外推法的时态知识图谱推理学习动态和静态表征

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485500

Pengfei Li;Guangyou Zhou;Zhiwen Xie;Penghui Xie;Jimmy Xiangji Huang

Temporal knowledge graph reasoning aims to predict the missing links (facts) in the future timestamps. However, most existing methods have a common limitation: they focus on learning dynamic representations of temporal knowledge graphs and rarely consider static characteristics that remain unchanged over time. To address the above issues, we propose to learn the dynamic and static representations for temporal knowledge graph reasoning (DSTKG), which introduces two latent variables to capture the dynamic and static characteristics of entities in temporal knowledge graphs. First, we use a Bi-GRU-based inference network to learn the static latent representation of historical facts and a nonlinear discrete-time transition-based inference network to learn the dynamic latent representation. Then, we sample the latent variables multiple times using re-parameterization tricks to obtain high-quality embeddings and make predictions in the future timestamps. The empirical results on four benchmark datasets show that our model is more effective than state-of-the-art approaches. Compared with the strong baseline model DBKGE (RotatE), the proposed model achieves performance improvements of 2.69%,

$1.59%$

, 1.18% and 1.22% on Yago11k, Wikidata12k, ICEWS14 and ICEWS05-15 respectively, regarding the evaluation metric MRR.

时态知识图谱推理旨在预测未来时间戳中缺失的环节（事实）。然而，大多数现有方法都有一个共同的局限性：它们侧重于学习时态知识图谱的动态表征，而很少考虑随时间变化而保持不变的静态特征。为解决上述问题，我们提出了学习时态知识图推理（DSTKG）的动态和静态表征，引入两个潜变量来捕捉时态知识图中实体的动态和静态特征。首先，我们使用基于 Bi-GRU 的推理网络来学习历史事实的静态潜表征，并使用基于离散时间转换的非线性推理网络来学习动态潜表征。然后，我们使用重参数化技巧对潜变量进行多次采样，以获得高质量的嵌入，并对未来的时间戳进行预测。四个基准数据集的实证结果表明，我们的模型比最先进的方法更有效。与强大的基线模型 DBKGE (RotatE) 相比，就评估指标 MRR 而言，所提出的模型在 Yago11k、Wikidata12k、ICEWS14 和 ICEWS05-15 上的性能分别提高了 2.69%、1.59%%$、1.18% 和 1.22%。

{"title":"Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning","authors":"Pengfei Li;Guangyou Zhou;Zhiwen Xie;Penghui Xie;Jimmy Xiangji Huang","doi":"10.1109/TASLP.2024.3485500","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485500","url":null,"abstract":"Temporal knowledge graph reasoning aims to predict the missing links (facts) in the future timestamps. However, most existing methods have a common limitation: they focus on learning dynamic representations of temporal knowledge graphs and rarely consider static characteristics that remain unchanged over time. To address the above issues, we propose to learn the dynamic and static representations for temporal knowledge graph reasoning (DSTKG), which introduces two latent variables to capture the dynamic and static characteristics of entities in temporal knowledge graphs. First, we use a Bi-GRU-based inference network to learn the static latent representation of historical facts and a nonlinear discrete-time transition-based inference network to learn the dynamic latent representation. Then, we sample the latent variables multiple times using re-parameterization tricks to obtain high-quality embeddings and make predictions in the future timestamps. The empirical results on four benchmark datasets show that our model is more effective than state-of-the-art approaches. Compared with the strong baseline model DBKGE (RotatE), the proposed model achieves performance improvements of 2.69%, \u0000<inline-formula><tex-math>$1.59%$</tex-math></inline-formula>\u0000, 1.18% and 1.22% on Yago11k, Wikidata12k, ICEWS14 and ICEWS05-15 respectively, regarding the evaluation metric MRR.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4741-4754"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models 在大型语言模型中进行无衍生优化以实现低ank自适应

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-09 DOI: 10.1109/TASLP.2024.3477330

Feihu Jin;Yifan Liu;Ying Tan

Parameter-efficient tuning methods such as LoRA could achieve comparable performance to model tuning by tuning a small portion of the parameters. However, substantial computational resources are still required, as this process involves calculating gradients and performing back-propagation throughout the model. Much effort has recently been devoted to utilizing the derivative-free optimization methods to eschew the computation of gradients and showcase an augmented level of robustness in few-shot settings. In this paper, we prepend the low-rank modules into each self-attention layer of the model and employ two derivative-free optimization methods to optimize these low-rank modules at each layer alternately. Extensive results on various tasks and language models demonstrate that our proposed method achieves substantial improvement and exhibits clear advantages in memory usage and convergence speed compared to existing gradient-based parameter-efficient tuning and derivative-free optimization methods in few-shot settings.

LoRA 等参数高效调整方法只需调整一小部分参数，就能获得与模型调整相当的性能。然而，由于这一过程涉及计算梯度和在整个模型中执行反向传播，因此仍需要大量计算资源。最近，很多人致力于利用无导数优化方法，以避免梯度计算，并在少次测量设置中展示更高水平的鲁棒性。在本文中，我们将低阶模块预置到模型的每个自注意层中，并采用两种无导数优化方法交替优化各层的低阶模块。在各种任务和语言模型上取得的大量结果表明，与现有的基于梯度的参数高效调整方法和少次触发设置下的无导数优化方法相比，我们提出的方法取得了实质性的改进，并在内存使用和收敛速度方面表现出明显的优势。

引用次数: 0

Smoothed Frame-Level SINR and Its Estimation for Sensor Selection in Distributed Acoustic Sensor Networks 分布式声学传感器网络中用于传感器选择的平滑帧级 SINR 及其估算

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-09 DOI: 10.1109/TASLP.2024.3477277

Shanzheng Guan;Mou Wang;Zhongxin Bai;Jianyu Wang;Jingdong Chen;Jacob Benesty

Distributed acoustic sensor network (DASN) refers to a sound acquisition system that consists of a collection of microphones randomly distributed across a wide acoustic area. Theory and methods for DASN are gaining increasing attention as the associated technologies can be used in a broad range of applications to solve challenging problems. However, unlike traditional microphone arrays or centralized systems, properly exploiting the redundancy among different channels in DASN is facing many challenges including but not limited to variations in pre-amplification gains, clocks, sensors' response, and signal-to-interference-plus-noise ratios (SINRs). Selecting appropriate sensors relevant to the task at hand is therefore crucial in DASN. In this work, we propose a speaker-dependent smoothed frame-level SINR estimation method for sensor selection in multi-speaker scenarios, specifically addressing source movement within DASN. Additionally, we devise an approach for similarity measurement to generate dynamic speaker embeddings resilient to variations in reference speech levels. Furthermore, we introduce a novel loss function that integrates classification and ordinal regression within a unified framework. Extensive simulations are performed and the results demonstrate the efficacy of the proposed method in accurately estimating smoothed frame-level SINR dynamically, yielding state-of-the-art performance.

分布式声学传感器网络（DASN）是指一种声音采集系统，由随机分布在广阔声学区域的传声器组成。由于相关技术可广泛应用于解决各种挑战性问题，DASN 的理论和方法日益受到关注。然而，与传统的麦克风阵列或集中式系统不同，在 DASN 中正确利用不同信道之间的冗余面临着许多挑战，包括但不限于前置放大增益、时钟、传感器响应和信号干扰加噪声比 (SINR) 的变化。因此，选择与当前任务相关的适当传感器对于 DASN 至关重要。在这项工作中，我们提出了一种依赖于扬声器的平滑帧级 SINR 估算方法，用于在多扬声器场景中选择传感器，特别是解决 DASN 中的信号源移动问题。此外，我们还设计了一种相似性测量方法，用于生成动态扬声器嵌入，以适应参考语音电平的变化。此外，我们还引入了一种新的损失函数，将分类和顺序回归整合到一个统一的框架中。我们进行了广泛的模拟，结果表明所提出的方法在动态准确估计平滑帧级 SINR 方面非常有效，达到了最先进的性能。

{"title":"Smoothed Frame-Level SINR and Its Estimation for Sensor Selection in Distributed Acoustic Sensor Networks","authors":"Shanzheng Guan;Mou Wang;Zhongxin Bai;Jianyu Wang;Jingdong Chen;Jacob Benesty","doi":"10.1109/TASLP.2024.3477277","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477277","url":null,"abstract":"Distributed acoustic sensor network (DASN) refers to a sound acquisition system that consists of a collection of microphones randomly distributed across a wide acoustic area. Theory and methods for DASN are gaining increasing attention as the associated technologies can be used in a broad range of applications to solve challenging problems. However, unlike traditional microphone arrays or centralized systems, properly exploiting the redundancy among different channels in DASN is facing many challenges including but not limited to variations in pre-amplification gains, clocks, sensors' response, and signal-to-interference-plus-noise ratios (SINRs). Selecting appropriate sensors relevant to the task at hand is therefore crucial in DASN. In this work, we propose a speaker-dependent smoothed frame-level SINR estimation method for sensor selection in multi-speaker scenarios, specifically addressing source movement within DASN. Additionally, we devise an approach for similarity measurement to generate dynamic speaker embeddings resilient to variations in reference speech levels. Furthermore, we introduce a novel loss function that integrates classification and ordinal regression within a unified framework. Extensive simulations are performed and the results demonstrate the efficacy of the proposed method in accurately estimating smoothed frame-level SINR dynamically, yielding state-of-the-art performance.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4554-4568"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142517851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model 改进客观感知音频质量评估 - 第一部分：新颖的数据驱动认知模型

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-09 DOI: 10.1109/TASLP.2024.3477291

Pablo M. Delgado;Jürgen Herre

Efficient audioquality assessment is vital for streamlining audio codec development. Objective assessment tools have been developed over time to algorithmically predict quality ratings from subjective assessments, the gold standard for quality judgment. Many of these tools use perceptual auditory models to extract audio features that are mapped to a basic audio quality score prediction using machine learning algorithms and subjective scores as training data. However, existing tools struggle with generalization in quality prediction, especially when faced with unknown signal and distortion types. This is particularly evident in the presence of signals coded using non-waveform-preserving parametric techniques. Addressing these challenges, this two-part work proposes extensions to the Perceptual Evaluation of Audio Quality (PEAQ - ITU-R BS.1387-1) recommendation. Part 1 focuses on increasing generalization, while Part 2 targets accurate spatial audio quality measurement in audio coding. To enhance prediction generalization, this paper (Part 1) introduces a novel machine learning approach that uses subjective data to model cognitive aspects of audio quality perception. The proposed method models the perceived severity of audible distortions by adaptively weighting different distortion metrics. The weights are determined using an interaction cost function that captures relationships between distortion salience and cognitive effects. Compared to other machine learning methods and established tools, the proposed architecture achieves higher prediction accuracy on large databases of previously unseen subjective quality scores. The perceptually-motivated model offers a more manageable alternative to general-purpose machine learning algorithms, allowing potential extensions and improvements to multi-dimensional quality measurement without complete retraining.

高效的音频质量评估对于简化音频编解码器的开发至关重要。随着时间的推移，人们已经开发出客观的评估工具，通过算法从主观评估（质量判断的黄金标准）中预测质量评级。其中许多工具使用感知听觉模型来提取音频特征，然后使用机器学习算法和主观评分作为训练数据，将这些特征映射到基本的音频质量评分预测中。然而，现有工具在质量预测的通用性方面存在困难，尤其是在面对未知信号和失真类型时。这一点在使用非波形保留参数技术编码的信号中尤为明显。为应对这些挑战，本工作分为两部分，对音频质量感知评估（PEAQ - ITU-R BS.1387-1）建议进行了扩展。第 1 部分侧重于提高通用性，而第 2 部分则针对音频编码中精确的空间音频质量测量。为了提高预测的通用性，本文（第 1 部分）介绍了一种新颖的机器学习方法，该方法利用主观数据对音频质量感知的认知方面进行建模。所提出的方法通过对不同失真指标进行自适应加权，对可感知的听觉失真严重程度进行建模。权重使用交互成本函数确定，该函数可捕捉失真显著性与认知效果之间的关系。与其他机器学习方法和成熟的工具相比，所提出的架构在以前未见过的主观质量评分的大型数据库上实现了更高的预测准确性。以感知为导向的模型为通用机器学习算法提供了更易于管理的替代方案，使多维质量测量的潜在扩展和改进无需完全重新训练。

{"title":"Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model","authors":"Pablo M. Delgado;Jürgen Herre","doi":"10.1109/TASLP.2024.3477291","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477291","url":null,"abstract":"Efficient audioquality assessment is vital for streamlining audio codec development. Objective assessment tools have been developed over time to algorithmically predict quality ratings from subjective assessments, the gold standard for quality judgment. Many of these tools use perceptual auditory models to extract audio features that are mapped to a basic audio quality score prediction using machine learning algorithms and subjective scores as training data. However, existing tools struggle with generalization in quality prediction, especially when faced with unknown signal and distortion types. This is particularly evident in the presence of signals coded using non-waveform-preserving parametric techniques. Addressing these challenges, this two-part work proposes extensions to the Perceptual Evaluation of Audio Quality (PEAQ - ITU-R BS.1387-1) recommendation. Part 1 focuses on increasing generalization, while Part 2 targets accurate spatial audio quality measurement in audio coding. To enhance prediction generalization, this paper (Part 1) introduces a novel machine learning approach that uses subjective data to model cognitive aspects of audio quality perception. The proposed method models the perceived severity of audible distortions by adaptively weighting different distortion metrics. The weights are determined using an interaction cost function that captures relationships between distortion salience and cognitive effects. Compared to other machine learning methods and established tools, the proposed architecture achieves higher prediction accuracy on large databases of previously unseen subjective quality scores. The perceptually-motivated model offers a more manageable alternative to general-purpose machine learning algorithms, allowing potential extensions and improvements to multi-dimensional quality measurement without complete retraining.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4661-4675"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Audio-Only Phonetic Segment Classification Using Embeddings Learned From Audio and Ultrasound Tongue Imaging Data 利用从音频和超声波舌头成像数据中学习的嵌入式技术进行纯音频音段分类

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-07 DOI: 10.1109/TASLP.2024.3473316

Ilhan Aytutuldu;Yakup Genc;Yusuf Sinan Akgul

This paper presents a phonetic segment classification method based on joint embeddings learned from processing Ultrasound Tongue Imaging (UTI) and audio data. For constructing the embeddings, we compiled an ultrasound image dataset synchronized with audio that encompasses common speech scenarios. The embeddings are obtained from artificial neural network models trained on this dataset. During testing, our model processes only audio data, making it practical for speech therapy as no ultrasound imaging is required. Experiments show that our method yields similar performance compared to methods that simultaneously use both audio and UTI data. However, it outperforms the methods utilizing solely audio or UTI data in real-time classification.

本文介绍了一种音段分类方法，该方法基于从处理舌部超声波成像（UTI）和音频数据中学到的联合嵌入。为了构建嵌入式，我们编制了一个超声波图像数据集，该数据集与包含常见语音场景的音频同步。嵌入信息来自在该数据集上训练的人工神经网络模型。在测试过程中，我们的模型只处理音频数据，无需超声波成像，因此适用于语音治疗。实验表明，与同时使用音频和 UTI 数据的方法相比，我们的方法具有相似的性能。不过，在实时分类方面，它优于仅使用音频或UTI 数据的方法。

引用次数: 0

Investigating the Design Space of Diffusion Models for Speech Enhancement 调查用于语音增强的扩散模型设计空间

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-03 DOI: 10.1109/TASLP.2024.3473319

Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May

Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.

扩散模型是一类新的生成模型，在图像生成方面表现出色。因此，研究人员尝试将扩散模型应用到语音增强等其他任务中。将扩散模型应用于语音增强的一种流行方法是在干净语音信号和噪声语音信号之间建立渐进转换模型。然而，之前在图像生成文献中使用的一个流行的扩散模型框架并没有考虑到系统输入的这种转换，这就阻碍了将现有的基于扩散的语音增强系统与上述扩散模型框架联系起来。为了解决这一问题，我们扩展了这一框架，以考虑干净语音信号和噪声语音信号之间的渐进转换。这使我们能够应用图像生成文献中的最新进展，并系统地研究扩散模型的设计方面，这些方面在语音增强方面大多仍未被探索，例如神经网络预处理、训练损失加权、随机微分方程（SDE）或反向过程中注入的随机性量。我们证明，以往基于扩散的语音增强系统的性能不能归因于干净语音信号和噪声语音信号之间的渐进转换。此外，我们还证明，适当选择前置条件、训练损耗加权、SDE 和采样器，可以在使用较少采样步骤的情况下超越流行的基于扩散的语音增强系统，从而将计算成本降低四倍。

{"title":"Investigating the Design Space of Diffusion Models for Speech Enhancement","authors":"Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May","doi":"10.1109/TASLP.2024.3473319","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473319","url":null,"abstract":"Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4486-4500"},"PeriodicalIF":4.1,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704960","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-Time Multichannel Deep Speech Enhancement in Hearing Aids: Comparing Monaural and Binaural Processing in Complex Acoustic Scenarios 助听器中的实时多通道深度语音增强：比较复杂声学场景中的单声道和双声道处理技术

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473315

Nils L. Westhausen;Hendrik Kayser;Theresa Jansen;Bernd T. Meyer

Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorithms in two complex acoustic scenes. Both algorithms are evaluated with objective metrics and in experiments with hearing-impaired listeners performing a speech-in-noise test. Results are compared to two traditional enhancement strategies, i.e., adaptive differential microphone processing and binaural beamforming. While in diffuse noise, all algorithms perform similarly, the binaural deep learning approach performs best in the presence of spatial interferers. Through a post-analysis, this can be attributed to improvements at low SNRs and to precise spatial filtering.

深度学习有可能为助听器用户增强语音信号并提高其清晰度。适合实际应用的深度模型应具有计算复杂度低、处理延迟小（仅几毫秒）的特点。在本文中，我们探讨了符合这些要求的深度语音增强技术，并对比了两种复杂声学场景中的单耳和双耳处理算法。这两种算法都通过客观指标进行了评估，并在听力受损的听者进行噪声语音测试的实验中进行了评估。结果与两种传统增强策略（即自适应差分麦克风处理和双耳波束成形）进行了比较。虽然在弥散噪声中，所有算法的表现相似，但双耳深度学习方法在存在空间干扰时表现最佳。通过后期分析，这可以归因于在低信噪比时的改进和精确的空间过滤。

{"title":"Real-Time Multichannel Deep Speech Enhancement in Hearing Aids: Comparing Monaural and Binaural Processing in Complex Acoustic Scenarios","authors":"Nils L. Westhausen;Hendrik Kayser;Theresa Jansen;Bernd T. Meyer","doi":"10.1109/TASLP.2024.3473315","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473315","url":null,"abstract":"Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorithms in two complex acoustic scenes. Both algorithms are evaluated with objective metrics and in experiments with hearing-impaired listeners performing a speech-in-noise test. Results are compared to two traditional enhancement strategies, i.e., adaptive differential microphone processing and binaural beamforming. While in diffuse noise, all algorithms perform similarly, the binaural deep learning approach performs best in the presence of spatial interferers. Through a post-analysis, this can be attributed to improvements at low SNRs and to precise spatial filtering.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4596-4606"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction RISC：用于呼喊类型分类和呼喊强度预测的语料库

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473302

Takahiro Fukumori;Taito Ishida;Yoichi Yamashita

The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meaning that classifiers cannot learn to discriminate such utterances from shouts typical of less hazardous situations such as cheers. Thus, this paper presents a novel research source, the RItsumeikan Shout Corpus (RISC), which contains wide variety types of shouted speech samples collected in recording experiments. Each shouted speech sample in RISC has a shout type and is also assigned shout intensity ratings via a crowdsourcing service. We also present a comprehensive performance comparison among deep learning approaches for speech type classification tasks and a shout intensity prediction task. The results show that feature learning based on the spectral and cepstral domains achieves high performance, no matter which network architecture is used. The results also demonstrate that shout type classification and intensity prediction are still challenging tasks, and RISC is expected to contribute to further development in this research area.

在音频监控中，对喊叫语音的检测至关重要。虽然安防系统最好能够识别紧急情况，但现有的语料库只为每个语音样本提供二进制标签（即喊叫或正常），因此很难预测喊叫强度。此外，大多数语料库只包含危险情况下的典型语句，这意味着分类器无法学习如何将此类语句与欢呼等危险性较低情况下的典型喊叫区分开来。因此，本文提出了一个新的研究来源--RItsumeikan 喊声语料库（RISC），其中包含在录音实验中收集的各种类型的喊声语音样本。RISC 中的每个呐喊语音样本都有一个呐喊类型，并通过众包服务为其分配了呐喊强度评级。我们还介绍了深度学习方法在语音类型分类任务和呐喊强度预测任务中的综合性能比较。结果表明，无论使用哪种网络架构，基于频谱和倒频谱域的特征学习都能实现高性能。结果还表明，喊叫声类型分类和强度预测仍然是具有挑战性的任务，RISC有望为这一研究领域的进一步发展做出贡献。

{"title":"RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction","authors":"Takahiro Fukumori;Taito Ishida;Yoichi Yamashita","doi":"10.1109/TASLP.2024.3473302","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473302","url":null,"abstract":"The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meaning that classifiers cannot learn to discriminate such utterances from shouts typical of less hazardous situations such as cheers. Thus, this paper presents a novel research source, the RItsumeikan Shout Corpus (RISC), which contains wide variety types of shouted speech samples collected in recording experiments. Each shouted speech sample in RISC has a shout type and is also assigned shout intensity ratings via a crowdsourcing service. We also present a comprehensive performance comparison among deep learning approaches for speech type classification tasks and a shout intensity prediction task. The results show that feature learning based on the spectral and cepstral domains achieves high performance, no matter which network architecture is used. The results also demonstrate that shout type classification and intensity prediction are still challenging tasks, and RISC is expected to contribute to further development in this research area.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4434-4444"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142434604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0