IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

List of Reviewers 审稿人名单

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2025-01-08 DOI: 10.1109/TASLP.2024.3520736

引用次数: 0

IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization 一种用于声源定位的通用直接路径IPD估计网络

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-11-28 DOI: 10.1109/TASLP.2024.3507560

Yabo Wang;Bing Yang;Xiaofei Li

Extracting direct-path spatial feature is crucial for sound source localization in adverse acoustic environments. This paper proposes IPDnet, a neural network that estimates direct-path inter-channel phase difference (DP-IPD) of sound sources from microphone array signals. The estimated DP-IPD can be easily translated to source location based on the known microphone array geometry. First, a full-band and narrow-band fusion network is adopted for DP-IPD estimation, in which combined narrow-band and full-band layers are responsible for estimating the raw DP-IPD information in one frequency band and capturing the frequency correlations of DP-IPD, respectively. Second, a new multi-track DP-IPD learning target is proposed for the localization of a flexible number of sound sources. Third, the network is extended to handle variable microphone arrays. This version of IPDnet is trained with a large set of different microphone arrays, and then it is able to infer the source locations using new microphone arrays not seen at training time. Experiments with multiple number of moving speakers are conducted on both simulated and real-world data, which show that the full-band and narrow-band fusion network and the proposed multi-track DP-IPD learning target together achieve excellent sound source localization performance. Moreover, the proposed variable-array model generalizes well to unseen microphone arrays.

在不利声环境下，直接路径空间特征的提取是声源定位的关键。本文提出了一种基于麦克风阵列信号估计声源直接路径信道间相位差（DP-IPD）的神经网络IPDnet。估计的DP-IPD可以根据已知的麦克风阵列几何形状很容易地转换为源位置。首先，采用全频带和窄带融合网络进行DP-IPD估计，其中窄带和全频带联合层分别负责估计一个频带的DP-IPD原始信息和捕获DP-IPD的频率相关性。其次，提出了一种新的多轨道DP-IPD学习目标，用于定位灵活数量的声源。第三，将网络扩展到处理可变麦克风阵列。这个版本的IPDnet使用大量不同的麦克风阵列进行训练，然后它能够使用训练时未见的新麦克风阵列推断源位置。在模拟和现实数据上进行了多运动扬声器的实验，结果表明，全频带和窄带融合网络以及所提出的多轨DP-IPD学习目标共同实现了良好的声源定位性能。此外，所提出的可变阵列模型可以很好地推广到隐形麦克风阵列。

{"title":"IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization","authors":"Yabo Wang;Bing Yang;Xiaofei Li","doi":"10.1109/TASLP.2024.3507560","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507560","url":null,"abstract":"Extracting direct-path spatial feature is crucial for sound source localization in adverse acoustic environments. This paper proposes IPDnet, a neural network that estimates direct-path inter-channel phase difference (DP-IPD) of sound sources from microphone array signals. The estimated DP-IPD can be easily translated to source location based on the known microphone array geometry. First, a full-band and narrow-band fusion network is adopted for DP-IPD estimation, in which combined narrow-band and full-band layers are responsible for estimating the raw DP-IPD information in one frequency band and capturing the frequency correlations of DP-IPD, respectively. Second, a new multi-track DP-IPD learning target is proposed for the localization of a flexible number of sound sources. Third, the network is extended to handle variable microphone arrays. This version of IPDnet is trained with a large set of different microphone arrays, and then it is able to infer the source locations using new microphone arrays not seen at training time. Experiments with multiple number of moving speakers are conducted on both simulated and real-world data, which show that the full-band and narrow-band fusion network and the proposed multi-track DP-IPD learning target together achieve excellent sound source localization performance. Moreover, the proposed variable-array model generalizes well to unseen microphone arrays.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5051-5064"},"PeriodicalIF":4.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation MO-Transformer：用于神经机器翻译的词间高级关系提取

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507556

Sufeng Duan;Hai Zhao

In this paper, we propose an explanation of representation for self-attention network (SAN) based neural sequence encoders, which regards the information captured by the model and the encoding of the model as graph structure and the generation of these graph structures respectively. The proposed explanation applies to existing works on SAN-based models and can explain the relationship among the ability to capture the structural or linguistic information, depth of model, and length of sentence, and can also be extended to other models such as recurrent neural network based models. We also propose a revisited multigraph called Multi-order-Graph (MoG) based on our explanation to model the graph structures in the SAN-based model as subgraphs in MoG and convert the encoding of the SAN-based model to the generation of MoG. Based on our explanation, we further introduce an MO-Transformer by enhancing the ability to capture multiple subgraphs of different orders and focusing on subgraphs of high orders. Experimental results on multiple neural machine translation tasks show that the MO-Transformer can yield effective performance improvement.

本文提出了一种基于自注意网络（SAN）的神经序列编码器的表示解释，将模型捕获的信息和模型的编码分别视为图结构和图结构的生成。所提出的解释适用于基于san的模型的现有工作，可以解释捕获结构或语言信息的能力、模型深度和句子长度之间的关系，也可以扩展到其他模型，如基于循环神经网络的模型。在此基础上，我们提出了一种被称为多阶图（Multi-order-Graph, MoG）的多重图，将基于san的模型中的图结构建模为MoG中的子图，并将基于san的模型的编码转换为MoG的生成。根据我们的解释，我们通过增强捕获不同阶的多个子图的能力并专注于高阶子图，进一步介绍了MO-Transformer。在多个神经网络机器翻译任务上的实验结果表明，MO-Transformer可以有效地提高机器翻译的性能。

{"title":"MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation","authors":"Sufeng Duan;Hai Zhao","doi":"10.1109/TASLP.2024.3507556","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507556","url":null,"abstract":"In this paper, we propose an explanation of representation for self-attention network (SAN) based neural sequence encoders, which regards the information captured by the model and the encoding of the model as graph structure and the generation of these graph structures respectively. The proposed explanation applies to existing works on SAN-based models and can explain the relationship among the ability to capture the structural or linguistic information, depth of model, and length of sentence, and can also be extended to other models such as recurrent neural network based models. We also propose a revisited multigraph called Multi-order-Graph (MoG) based on our explanation to model the graph structures in the SAN-based model as subgraphs in MoG and convert the encoding of the SAN-based model to the generation of MoG. Based on our explanation, we further introduce an MO-Transformer by enhancing the ability to capture multiple subgraphs of different orders and focusing on subgraphs of high orders. Experimental results on multiple neural machine translation tasks show that the MO-Transformer can yield effective performance improvement.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5065-5077"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach 盲音频带宽扩展：基于扩散的零镜头方法

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507566

Eloi Moliner;Filip Elvander;Vesa Välimäki

Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/

音频带宽扩展涉及从带宽有限的观测中真实地重建高频频谱。在低通降级未知的情况下，例如在恢复历史音频记录时，这就变成了一个盲目的问题。本文介绍了一种名为BABE（盲音频带宽扩展）的新方法，该方法利用预训练无条件扩散模型的生成先验来解决零射击设置中的盲问题。在推理过程中，BABE使用了一种广义的扩散后验抽样，其中退化算子是未知的，但被参数化并迭代推断。使用客观和主观指标对所提出方法的性能进行了评估，结果表明，在使用合成数据进行测试时，与知情方法相比，BABE超越了最先进的盲带宽扩展基线，取得了具有竞争力的性能。此外，BABE在增强真实历史记录时表现出强大的泛化能力，有效地重建缺失的高频内容，同时保持与原始记录的一致性。主观偏好测试证实，BABE显著提高了历史音乐录音的音频质量。使用所建议的方法恢复的历史记录的示例可在配套网页上找到：http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/

{"title":"Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach","authors":"Eloi Moliner;Filip Elvander;Vesa Välimäki","doi":"10.1109/TASLP.2024.3507566","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507566","url":null,"abstract":"Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: \u0000<uri>http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/</uri>","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5092-5105"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10768977","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Neural Speaker Diarization With Target Speaker Tracking 基于目标说话人跟踪的在线神经说话人划分

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507559

Weiqing Wang;Ming Li

This paper proposes an online target speaker voice activity detection (TS-VAD) system for speaker diarization tasks that does not rely on prior knowledge from clustering-based diarization systems to obtain target speaker embeddings. By adapting conventional TS-VAD for real-time operation, our framework identifies speaker activities using self-generated embeddings, ensuring consistent performance and avoiding permutation inconsistencies during inference. In the inference phase, we employ a front-end model to extract frame-level speaker embeddings for each incoming signal block. Subsequently, we predict each speaker's detection state based on these frame-level embeddings and the previously estimated target speaker embeddings. The target speaker embeddings are then updated by aggregating the frame-level embeddings according to the current block's predictions. Our model predicts results block-by-block and iteratively updates target speaker embeddings until reaching the end of the signal. Experimental results demonstrate that the proposed method outperforms offline clustering-based diarization systems on the DIHARD III and AliMeeting datasets. Additionally, this approach is extended to multi-channel data, achieving comparable performance to state-of-the-art offline diarization systems.

本文提出了一种针对说话人语音化任务的在线目标说话人语音活动检测（TS-VAD）系统，该系统不依赖于基于聚类的语音化系统的先验知识来获取目标说话人嵌入。通过将传统的TS-VAD用于实时操作，我们的框架使用自生成的嵌入来识别说话人的活动，确保了一致的性能，并避免了推理过程中的排列不一致。在推理阶段，我们采用前端模型为每个输入信号块提取帧级扬声器嵌入。随后，我们基于这些帧级嵌入和先前估计的目标说话人嵌入来预测每个说话人的检测状态。然后根据当前块的预测聚合帧级嵌入来更新目标说话人嵌入。我们的模型逐块预测结果，并迭代更新目标说话人嵌入，直到到达信号的末端。实验结果表明，该方法在DIHARD III和AliMeeting数据集上优于基于离线聚类的分类系统。此外，该方法可扩展到多通道数据，实现与最先进的离线拨号系统相当的性能。

{"title":"Online Neural Speaker Diarization With Target Speaker Tracking","authors":"Weiqing Wang;Ming Li","doi":"10.1109/TASLP.2024.3507559","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507559","url":null,"abstract":"This paper proposes an online target speaker voice activity detection (TS-VAD) system for speaker diarization tasks that does not rely on prior knowledge from clustering-based diarization systems to obtain target speaker embeddings. By adapting conventional TS-VAD for real-time operation, our framework identifies speaker activities using self-generated embeddings, ensuring consistent performance and avoiding permutation inconsistencies during inference. In the inference phase, we employ a front-end model to extract frame-level speaker embeddings for each incoming signal block. Subsequently, we predict each speaker's detection state based on these frame-level embeddings and the previously estimated target speaker embeddings. The target speaker embeddings are then updated by aggregating the frame-level embeddings according to the current block's predictions. Our model predicts results block-by-block and iteratively updates target speaker embeddings until reaching the end of the signal. Experimental results demonstrate that the proposed method outperforms offline clustering-based diarization systems on the DIHARD III and AliMeeting datasets. Additionally, this approach is extended to multi-channel data, achieving comparable performance to state-of-the-art offline diarization systems.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5078-5091"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Interpretable Deep Mutual Information Curriculum Metric for a Robust and Generalized Speech Emotion Recognition System 鲁棒广义语音情感识别系统的可解释深度互信息课程度量

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507562

Wei-Cheng Lin;Kusha Sridhar;Carlos Busso

It is difficult to achieve robust and well-generalized models for tasks involving subjective concepts such as emotion. It is inevitable to deal with noisy labels, given the ambiguous nature of human perception. Methodologies relying on semi-supervised learning (SSL) and curriculum learning have been proposed to enhance the generalization of the models. This study proposes a novel deep mutual information (DeepMI) metric, built with the SSL pre-trained DeepEmoCluster framework to establish the difficulty of samples. The DeepMI metric quantifies the relationship between the acoustic patterns and emotional attributes (e.g., arousal, valence, and dominance). The DeepMI metric provides a better curriculum, achieving state-of-the-art performance that is higher than results obtained with existing curriculum metrics for speech emotion recognition (SER). We evaluate the proposed method with three emotional datasets in matched and mismatched testing conditions. The experimental evaluations systematically show that a model trained with the DeepMI metric not only obtains competitive generalization performances, but also maintains convergence stability. Furthermore, the extracted DeepMI values are highly interpretable, reflecting information ranks of the training samples.

对于涉及主观概念（如情感）的任务，很难实现鲁棒性和良好的泛化模型。考虑到人类感知的模糊性，处理有噪声的标签是不可避免的。基于半监督学习（SSL）和课程学习的方法被提出来增强模型的泛化。本研究提出了一种新的深度互信息（DeepMI）度量，该度量使用SSL预训练的DeepEmoCluster框架构建，以确定样本的难度。DeepMI度量量化了声学模式和情感属性之间的关系（例如，唤醒、效价和支配）。DeepMI指标提供了更好的课程，实现了比语音情感识别（SER）现有课程指标更高的最先进性能。我们在匹配和不匹配的测试条件下用三个情感数据集评估了所提出的方法。系统的实验评估表明，使用DeepMI度量训练的模型不仅获得了具有竞争力的泛化性能，而且保持了收敛稳定性。此外，提取的DeepMI值具有很高的可解释性，反映了训练样本的信息等级。

{"title":"An Interpretable Deep Mutual Information Curriculum Metric for a Robust and Generalized Speech Emotion Recognition System","authors":"Wei-Cheng Lin;Kusha Sridhar;Carlos Busso","doi":"10.1109/TASLP.2024.3507562","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507562","url":null,"abstract":"It is difficult to achieve robust and well-generalized models for tasks involving subjective concepts such as emotion. It is inevitable to deal with noisy labels, given the ambiguous nature of human perception. Methodologies relying on \u0000<italic>semi-supervised learning</i>\u0000 (SSL) and curriculum learning have been proposed to enhance the generalization of the models. This study proposes a novel \u0000<italic>deep mutual information</i>\u0000 (DeepMI) metric, built with the SSL pre-trained DeepEmoCluster framework to establish the difficulty of samples. The DeepMI metric quantifies the relationship between the acoustic patterns and emotional attributes (e.g., arousal, valence, and dominance). The DeepMI metric provides a better curriculum, achieving state-of-the-art performance that is higher than results obtained with existing curriculum metrics for \u0000<italic>speech emotion recognition</i>\u0000 (SER). We evaluate the proposed method with three emotional datasets in matched and mismatched testing conditions. The experimental evaluations systematically show that a model trained with the DeepMI metric not only obtains competitive generalization performances, but also maintains convergence stability. Furthermore, the extracted DeepMI values are highly interpretable, reflecting information ranks of the training samples.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5117-5130"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10768985","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models 利用神经自回归模型实现高效实时钢琴转录

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507568

Taegyun Kwon;Dasaem Jeong;Juhan Nam

In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time piano transcription with a focus on achieving both high performance and a lightweight model. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we demonstrate that the proposed components are necessary for achieving high performance in an autoregressive model. Additionally, we provide experiments on real-time latency.

近年来，神经网络设计的进步和大规模标记数据集的可用性导致钢琴转录模型的准确性有了显着提高。然而，大多数先前的工作都集中在高性能离线转录上，忽略了对模型大小的刻意考虑。这项工作的目标是实现实时钢琴转录，重点是实现高性能和轻量级模型。为此，我们提出了卷积递归神经网络的新架构，重新设计了现有的自回归钢琴转录模型。首先，我们通过在CNN模块中添加频率调节的FiLM层来扩展声学模块，以适应频率轴上的卷积滤波器。其次，我们通过使用音调方向的LSTM来改进音符状态序列建模，该LSTM专注于音符内的音符状态转换。此外，我们用增强的递归上下文增强了自回归连接。利用这些组件，我们提出了两种类型的模型；一个是高性能，另一个是高紧凑性。通过大量的实验，我们证明了所提出的组件对于实现自回归模型的高性能是必要的。此外，我们还提供了实时延迟的实验。

{"title":"Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models","authors":"Taegyun Kwon;Dasaem Jeong;Juhan Nam","doi":"10.1109/TASLP.2024.3507568","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507568","url":null,"abstract":"In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time piano transcription with a focus on achieving both high performance and a lightweight model. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we demonstrate that the proposed components are necessary for achieving high performance in an autoregressive model. Additionally, we provide experiments on real-time latency.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5106-5116"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning 扬声器建模及其应用概述：从深度扬声器表征学习的角度看扬声器建模及其应用

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-11-21 DOI: 10.1109/TASLP.2024.3492793

Shuai Wang;Zhengyang Chen;Kong Aik Lee;Yanmin Qian;Haizhou Li

Speaker individuality information is among the most critical elements within speech signals. By thoroughly and accurately modeling this information, it can be utilized in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this overview, we present a comprehensive review of neural approaches to speaker representation learning from both theoretical and practical perspectives. Theoretically, we discuss speaker encoders ranging from supervised to self-supervised learning algorithms, standalone models to large pretrained models, pure speaker embedding learning to joint optimization with downstream tasks, and efforts toward interpretability. Practically, we systematically examine approaches for robustness and effectiveness, introduce and compare various open-source toolkits in the field. Through the systematic and comprehensive review of the relevant literature, research activities, and resources, we provide a clear reference for researchers in the speaker characterization and modeling field, as well as for those who wish to apply speaker modeling techniques to specific downstream tasks.

说话人个性信息是语音信号中最关键的要素之一。通过对这些信息进行全面而准确的建模，可以将其用于各种智能语音应用中，如说话人识别、说话人日记化、语音合成和目标说话人提取。在本综述中，我们将从理论和实践两个角度全面评述扬声器表征学习的神经方法。在理论方面，我们讨论了从监督学习算法到自我监督学习算法的扬声器编码器、从独立模型到大型预训练模型、从纯粹的扬声器嵌入学习到与下游任务的联合优化，以及为实现可解释性所做的努力。在实践中，我们系统地检查了各种方法的鲁棒性和有效性，介绍并比较了该领域的各种开源工具包。通过对相关文献、研究活动和资源进行系统而全面的回顾，我们为扬声器表征和建模领域的研究人员以及希望将扬声器建模技术应用于特定下游任务的人员提供了清晰的参考。

{"title":"Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning","authors":"Shuai Wang;Zhengyang Chen;Kong Aik Lee;Yanmin Qian;Haizhou Li","doi":"10.1109/TASLP.2024.3492793","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492793","url":null,"abstract":"Speaker individuality information is among the most critical elements within speech signals. By thoroughly and accurately modeling this information, it can be utilized in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this overview, we present a comprehensive review of neural approaches to speaker representation learning from both theoretical and practical perspectives. Theoretically, we discuss speaker encoders ranging from supervised to self-supervised learning algorithms, standalone models to large pretrained models, pure speaker embedding learning to joint optimization with downstream tasks, and efforts toward interpretability. Practically, we systematically examine approaches for robustness and effectiveness, introduce and compare various open-source toolkits in the field. Through the systematic and comprehensive review of the relevant literature, research activities, and resources, we provide a clear reference for researchers in the speaker characterization and modeling field, as well as for those who wish to apply speaker modeling techniques to specific downstream tasks.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4971-4998"},"PeriodicalIF":4.1,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction CLAPSep：利用对比预训练模型进行多模态查询条件下的目标声音提取

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-11-13 DOI: 10.1109/TASLP.2024.3497586

Hao Ma;Zhiyuan Peng;Xu Li;Mingjie Shao;Xixin Wu;Ju Liu

Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation.

通用声音分离（USS）旨在从真实世界的录音中提取任意类型的声音。这可以通过语言查询目标声音提取（TSE）来实现，TSE 通常由两个部分组成：一个是将用户查询转换为条件嵌入的查询网络，另一个是相应提取目标声音的分离网络。现有方法通常从头开始训练模型。因此，要使随机初始化的模型理解声音事件并进行相应的分离，需要大量的数据和计算资源。在本文中，我们建议将预先训练好的模型集成到 TSE 模型中，以解决上述问题。具体来说，我们为 USS 定制并调整了强大的对比语言音频预训练模型（CLAP），称为 CLAPSep；CLAPSep 还接受灵活的用户输入，可接受单模态和/或多模态的正反两方面用户提示来提取目标声音。CLAPSep 的这些关键功能不仅能提高提取性能，还能改善其应用的多样性。我们在 5 个不同的数据集上进行了广泛的实验，证明了我们所提出的 CLAPSep 性能优越，具有零次和少次通用性，训练收敛速度快，大大超过了以前的方法。我们还发布了完整的代码和一些音频示例，以供复制和评估。

{"title":"CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction","authors":"Hao Ma;Zhiyuan Peng;Xu Li;Mingjie Shao;Xixin Wu;Ju Liu","doi":"10.1109/TASLP.2024.3497586","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3497586","url":null,"abstract":"Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4945-4960"},"PeriodicalIF":4.1,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

$mathcal {P}$owMix: A Versatile Regularizer for Multimodal Sentiment Analysis $mathcal {P}$owMix：用于多模态情感分析的多功能正则化器

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-11-11 DOI: 10.1109/TASLP.2024.3496316

Efthymios Georgiou;Yannis Avrithis;Alexandros Potamianos

Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to interpret the complex nature of human sentiments. Despite significant progress in multimodal architecture design, the field lacks comprehensive regularization methods. This paper introduces

$mathcal {P}$

owMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches and introduces novel algorithmic components that are specifically tailored to multimodal tasks.

$mathcal {P}$

owMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.

$mathcal {P}$

owMix consists of five components: 1) a varying number of generated mixed examples, 2) mixing factor reweighting, 3) anisotropic mixing, 4) dynamic mixing, and 5) cross-modal label mixing. Extensive experimentation across benchmark MSA datasets and a broad spectrum of diverse architectural designs demonstrate the efficacy of

$mathcal {P}$

owMix, as evidenced by consistent performance improvements over baselines and existing mixing methods. An in-depth ablation study highlights the critical contribution of each

$mathcal {P}$

owMix component and how they synergistically enhance performance. Furthermore, algorithmic analysis demonstrates how

$mathcal {P}$

owMix behaves in different scenarios, particularly comparing early versus late fusion architectures. Notably,

$mathcal {P}$

owMix enhances overall performance without sacrificing model robustness or magnifying text dominance. It also retains its strong performance in situations of limited data. Our findings position

$mathcal {P}$

owMix as a promising versatile regularization strategy for MSA.

多模态情感分析（MSA）利用异构数据源来解读人类情感的复杂本质。尽管多模态架构设计取得了重大进展，但该领域仍缺乏全面的正则化方法。本文介绍了$mathcal {P}$owMix，这是一种多功能嵌入空间正则化方法，它借鉴了基于单模态混合的正则化方法的优点，并引入了专门为多模态任务定制的新型算法组件。$mathcal {P}$owMix 集成在多模态架构的融合阶段之前，并促进模态内混合，如文本与文本混合，以充当正则化器。{P}$owMix由五个部分组成：1）生成混合示例的不同数量；2）混合因子重新加权；3）各向异性混合；4）动态混合；5）跨模态标签混合。在基准 MSA 数据集和各种不同的架构设计中进行的广泛实验证明了 $mathcal {P}$owMix 的功效，其性能比基线和现有混合方法有了持续改善。一项深入的烧蚀研究强调了每个 $mathcal {P}$owMix 组件的关键贡献，以及它们如何协同提高性能。此外，算法分析展示了$mathcal {P}$owMix在不同场景下的表现，尤其是早期与晚期融合架构的比较。值得注意的是，$mathcal {P}$owMix 在不牺牲模型鲁棒性或放大文本优势的情况下提高了整体性能。在数据有限的情况下，它也能保持强劲的性能。我们的研究结果表明，$mathcal {P}$owMix 是一种很有前途的 MSA 多用途正则化策略。

{"title":"$mathcal {P}$owMix: A Versatile Regularizer for Multimodal Sentiment Analysis","authors":"Efthymios Georgiou;Yannis Avrithis;Alexandros Potamianos","doi":"10.1109/TASLP.2024.3496316","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3496316","url":null,"abstract":"Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to interpret the complex nature of human sentiments. Despite significant progress in multimodal architecture design, the field lacks comprehensive regularization methods. This paper introduces \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches and introduces novel algorithmic components that are specifically tailored to multimodal tasks. \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer. \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix consists of five components: 1) a varying number of generated mixed examples, 2) mixing factor reweighting, 3) anisotropic mixing, 4) dynamic mixing, and 5) cross-modal label mixing. Extensive experimentation across benchmark MSA datasets and a broad spectrum of diverse architectural designs demonstrate the efficacy of \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix, as evidenced by consistent performance improvements over baselines and existing mixing methods. An in-depth ablation study highlights the critical contribution of each \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix component and how they synergistically enhance performance. Furthermore, algorithmic analysis demonstrates how \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix behaves in different scenarios, particularly comparing early versus late fusion architectures. Notably, \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix enhances overall performance without sacrificing model robustness or magnifying text dominance. It also retains its strong performance in situations of limited data. Our findings position \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix as a promising versatile regularization strategy for MSA.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5010-5023"},"PeriodicalIF":4.1,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0