IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献_第10页

Room Acoustic Rendering Networks With Control of Scattering and Early Reflections 控制散射和早期反射的室内声学渲染网络

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-08-02 DOI: 10.1109/TASLP.2024.3436702

Matteo Scerbo;Lauri Savioja;Enzo De Sena

Room acoustic synthesis can be used in virtual reality (VR), augmented reality (AR) and gaming applications to enhance listeners' sense of immersion, realism and externalisation. A common approach is to use geometrical acoustics (GA) models to compute impulse responses at interactive speed, and fast convolution methods to apply said responses in real time. Alternatively, delay-network-based models are capable of modeling certain aspects of room acoustics, but with a significantly lower computational cost. In order to bridge the gap between these classes of models, recent work introduced delay network designs that approximate Acoustic Radiance Transfer (ART), a geometrical acoustics (GA) model that simulates the transfer of acoustic energy between discrete surface patches in an environment. This paper presents two key extensions of such designs. The first extension involves a new physically-based and stability-preserving design of the feedback matrices, enabling more accurate control of scattering and, more in general, of late reverberation properties. The second extension allows an arbitrary number of early reflections to be modeled with high accuracy, meaning the network can be scaled at will between computational cost and early reverberation precision. The proposed extensions are compared to the baseline ART-approximating delay network as well as two reference GA models. The evaluation is based on objective measures of perceptually-relevant features, including frequency-dependent reverberation times, echo density build-up, and early decay time. Results show how the proposed extensions result in a significant improvement over the baseline model, especially for the case of non-convex geometries or the case of unevenly distributed wall absorption, both scenarios of broad practical interest.

室内声学合成可用于虚拟现实（VR）、增强现实（AR）和游戏应用，以增强听众的沉浸感、真实感和外在化。一种常见的方法是使用几何声学（GA）模型以交互速度计算脉冲响应，并使用快速卷积方法实时应用上述响应。另外，基于延迟网络的模型也能对房间声学的某些方面进行建模，但计算成本要低得多。为了缩小这两类模型之间的差距，最近的工作引入了近似声辐射传递（ART）的延迟网络设计，这是一种几何声学（GA）模型，用于模拟环境中离散表面斑块之间的声能传递。本文介绍了此类设计的两个关键扩展。第一个扩展是对反馈矩阵进行新的基于物理和保持稳定的设计，从而能够更精确地控制散射，并更广泛地控制后期混响特性。第二个扩展允许对任意数量的早期反射进行高精度建模，这意味着可以在计算成本和早期混响精度之间随意调整网络规模。我们将所提出的扩展功能与基准 ART 近似延迟网络以及两个参考 GA 模型进行了比较。评估基于感知相关特征的客观测量，包括频率相关混响时间、回声密度积累和早期衰减时间。结果表明，与基线模型相比，所提出的扩展方案有了显著的改进，尤其是在非凸几何形状或墙壁吸收分布不均的情况下，这两种情况都具有广泛的实际意义。

{"title":"Room Acoustic Rendering Networks With Control of Scattering and Early Reflections","authors":"Matteo Scerbo;Lauri Savioja;Enzo De Sena","doi":"10.1109/TASLP.2024.3436702","DOIUrl":"10.1109/TASLP.2024.3436702","url":null,"abstract":"Room acoustic synthesis can be used in virtual reality (VR), augmented reality (AR) and gaming applications to enhance listeners' sense of immersion, realism and externalisation. A common approach is to use geometrical acoustics (GA) models to compute impulse responses at interactive speed, and fast convolution methods to apply said responses in real time. Alternatively, delay-network-based models are capable of modeling certain aspects of room acoustics, but with a significantly lower computational cost. In order to bridge the gap between these classes of models, recent work introduced delay network designs that approximate Acoustic Radiance Transfer (ART), a geometrical acoustics (GA) model that simulates the transfer of acoustic energy between discrete surface patches in an environment. This paper presents two key extensions of such designs. The first extension involves a new physically-based and stability-preserving design of the feedback matrices, enabling more accurate control of scattering and, more in general, of late reverberation properties. The second extension allows an arbitrary number of early reflections to be modeled with high accuracy, meaning the network can be scaled at will between computational cost and early reverberation precision. The proposed extensions are compared to the baseline ART-approximating delay network as well as two reference GA models. The evaluation is based on objective measures of perceptually-relevant features, including frequency-dependent reverberation times, echo density build-up, and early decay time. Results show how the proposed extensions result in a significant improvement over the baseline model, especially for the case of non-convex geometries or the case of unevenly distributed wall absorption, both scenarios of broad practical interest.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3745-3758"},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism 基于注意力机制的两阶段视听融合钢琴转写模型

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-07-30 DOI: 10.1109/TASLP.2024.3426303

Yuqing Li;Xianke Wang;Ruimin Wu;Wei Xu;Wenqing Cheng

Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.

钢琴转写是音乐信息检索领域的一个重要问题，其目的是从捕获的音频或视觉信号中获取音乐的符号表示。以往的研究主要集中于使用音频或视觉信息的单模态转录方法，但基于视听融合的研究为数不多。为了充分利用两种模式的互补优势，实现更高的转录精度，我们提出了一种基于注意力机制的两阶段视听融合钢琴转录模型，同时利用钢琴演奏的音频和视觉信息。在第一阶段，我们提出了一个音频模型和一个视觉模型。音频模型利用频域稀疏注意力捕捉频域中的谐波关系，而视觉模型则包括 CNN 和 Transformer 两个分支，以合并不同分辨率下的局部和全局特征。在第二阶段，我们利用交叉注意来学习不同模态之间的相关性和序列的时间关系。在 OMAPS2 数据集上的实验结果表明，我们的模型达到了 98.60% 的 F1 分数，与单模态转录模型相比有显著提高。

{"title":"A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism","authors":"Yuqing Li;Xianke Wang;Ruimin Wu;Wei Xu;Wenqing Cheng","doi":"10.1109/TASLP.2024.3426303","DOIUrl":"10.1109/TASLP.2024.3426303","url":null,"abstract":"Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3618-3630"},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10614622","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Mispronunciation Detection Using Speech Reconstruction 利用语音重建改进错误发音检测

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-07-29 DOI: 10.1109/TASLP.2024.3434497

Anurag Das;Ricardo Gutierrez-Osuna

Training related machine learning tasks simultaneously can lead to improved performance on both tasks. Text- to-speech (TTS) and mispronunciation detection and diagnosis (MDD) both operate using phonetic information and we wanted to examine whether a boost in MDD performance can be by two tasks. We propose a network that reconstructs speech from the phones produced by the MDD system and computes a speech reconstruction loss. We hypothesize that the phones produced by the MDD system will be closer to the ground truth if the reconstructed speech sounds closer to the original speech. To test this, we first extract wav2vec features from a pre-trained model and feed it to the MDD system along with the text input. The MDD system then predicts the target annotated phones and then synthesizes speech from the predicted phones. The system is therefore trained by computing both a speech reconstruction loss as well as an MDD loss. Comparing the proposed systems against an identical system but without speech reconstruction and another state-of-the-art baseline, we found that the proposed system achieves higher mispronunciation detection and diagnosis (MDD) scores. On a set of sentences unseen during training, the and speaker verification simultaneously can lead to improve proposed system achieves higher MDD scores, which suggests that reconstructing the speech signal from the predicted phones helps the system generalize to new test sentences. We also tested whether the system can generate accented speech when the input phones have mispronunciations. Results from our perceptual experiments show that speech generated from phones containing mispronunciations sounds more accented and less intelligible than phones without any mispronunciations, which suggests that the system can identify differences in phones and generate the desired speech signal.

同时训练相关的机器学习任务可以提高这两项任务的性能。文本到语音（TTS）和错误发音检测与诊断（MDD）都是利用语音信息进行操作的，我们希望研究 MDD 性能的提升是否可以通过两个任务来实现。我们提出了一种从 MDD 系统产生的语音中重建语音并计算语音重建损失的网络。我们假设，如果重建的语音听起来更接近原始语音，那么 MDD 系统产生的电话将更接近地面实况。为了验证这一点，我们首先从预先训练好的模型中提取 wav2vec 特征，并将其与文本输入一起输入 MDD 系统。然后，MDD 系统预测目标注释电话，再根据预测电话合成语音。因此，该系统是通过计算语音重建损失和 MDD 损失来进行训练的。我们将提出的系统与不带语音重构的相同系统和另一个最先进的基线系统进行了比较，发现提出的系统获得了更高的错误发音检测和诊断（MDD）分数。在一组训练过程中未见的句子中，同时进行语音重建和说话人验证可使拟议系统获得更高的 MDD 分数，这表明从预测电话重建语音信号有助于系统泛化到新的测试句子。我们还测试了当输入的电话发音错误时，系统能否生成重音语音。我们的感知实验结果表明，与没有任何发音错误的电话相比，由含有错误发音的电话生成的语音听起来重音更重，可懂度更低，这表明系统能够识别电话中的差异，并生成所需的语音信号。

{"title":"Improving Mispronunciation Detection Using Speech Reconstruction","authors":"Anurag Das;Ricardo Gutierrez-Osuna","doi":"10.1109/TASLP.2024.3434497","DOIUrl":"10.1109/TASLP.2024.3434497","url":null,"abstract":"Training related machine learning tasks simultaneously can lead to improved performance on both tasks. Text- to-speech (TTS) and mispronunciation detection and diagnosis (MDD) both operate using phonetic information and we wanted to examine whether a boost in MDD performance can be by two tasks. We propose a network that reconstructs speech from the phones produced by the MDD system and computes a speech reconstruction loss. We hypothesize that the phones produced by the MDD system will be closer to the ground truth if the reconstructed speech sounds closer to the original speech. To test this, we first extract wav2vec features from a pre-trained model and feed it to the MDD system along with the text input. The MDD system then predicts the target annotated phones and then synthesizes speech from the predicted phones. The system is therefore trained by computing both a speech reconstruction loss as well as an MDD loss. Comparing the proposed systems against an identical system but without speech reconstruction and another state-of-the-art baseline, we found that the proposed system achieves higher mispronunciation detection and diagnosis (MDD) scores. On a set of sentences unseen during training, the and speaker verification simultaneously can lead to improve proposed system achieves higher MDD scores, which suggests that reconstructing the speech signal from the predicted phones helps the system generalize to new test sentences. We also tested whether the system can generate accented speech when the input phones have mispronunciations. Results from our perceptual experiments show that speech generated from phones containing mispronunciations sounds more accented and less intelligible than phones without any mispronunciations, which suggests that the system can identify differences in phones and generate the desired speech signal.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4420-4433"},"PeriodicalIF":4.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation VIOLA：用于语音识别、合成和翻译的条件语言模型

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-07-29 DOI: 10.1109/TASLP.2024.3434425

Tianrui Wang;Long Zhou;Ziqiang Zhang;Yu Wu;Shujie Liu;Yashesh Gaur;Zhuo Chen;Jinyu Li;Furu Wei

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional language model task via multi-task learning framework. To accomplish this, we first convert the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence prediction problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID), language IDs (LID), and LSTM-based acoustic embedding into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.

最近的研究表明，不同模态的各种任务在模型架构、训练目标和推理方法上有很大的趋同性。在本文中，我们提出了 VioLA，这是一个单一的自动回归变换器解码器网络，通过多任务学习框架将涉及语音和文本的各种跨模态任务（如语音到文本、文本到文本、文本到语音和语音到语音任务）统一为一个条件语言模型任务。为此，我们首先使用离线神经编解码器将语音语句转换为离散标记（类似于文本数据）。这样，所有这些任务都被转换成了基于标记的序列预测问题，可以很自然地用一个条件语言模型来处理。我们进一步将任务 ID（TID）、语言 ID（LID）和基于 LSTM 的声学嵌入整合到所提出的模型中，以增强处理不同语言和任务的建模能力。实验结果表明，所提出的 VioLA 模型可以很好地支持单模态和跨模态任务，而纯解码器模型则取得了与强基线相当甚至更好的性能。

{"title":"VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation","authors":"Tianrui Wang;Long Zhou;Ziqiang Zhang;Yu Wu;Shujie Liu;Yashesh Gaur;Zhuo Chen;Jinyu Li;Furu Wei","doi":"10.1109/TASLP.2024.3434425","DOIUrl":"10.1109/TASLP.2024.3434425","url":null,"abstract":"Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose \u0000<sc>VioLA\u0000, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional language model task via multi-task learning framework. To accomplish this, we first convert the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence prediction problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID), language IDs (LID), and LSTM-based acoustic embedding into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed \u0000<sc>VioLA\u0000 model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3709-3716"},"PeriodicalIF":4.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneous-Graph Reasoning With Context Paraphrase for Commonsense Question Answering 利用上下文转述进行异构图推理以实现常识性问题解答

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-07-26 DOI: 10.1109/TASLP.2024.3434469

Yujie Wang;Hu Zhang;Jiye Liang;Ru Li

Commonsense question answering (CQA) generally means that the machine uses its mastered commonsense to answer questions without relevant background material, which is a challenging task in natural language processing. Existing methods focus on retrieving relevant subgraphs from knowledge graphs based on key entities and designing complex graph neural networks to perform reasoning over the subgraphs. However, they have the following problems: i) the nested entities in key entities lead to the introduction of irrelevant knowledge; ii) the QA context is not well integrated with the subgraphs; and iii) insufficient context knowledge hinders subgraph nodes understanding. In this paper, we present a heterogeneous-graph reasoning with context paraphrase method (HCP), which introduces the paraphrase knowledge from the dictionary into key entity recognition and subgraphs construction, and effectively fuses QA context and subgraphs during the encoding phase of the pre-trained language model (PTLM). Specifically, HCP filters the nested entities through the dictionary's vocabulary and constructs the Heterogeneous Path-Paraphrase (HPP) graph by connecting the paraphrase descriptions¹¹

The paraphrase descriptions are English explanations of words or phrases in WordNet and Wiktionary.

with the key entity nodes in the subgraphs. Then, by constructing the visible matrices in the PTLM encoding phase, we fuse the QA context representation into the HPP graph. Finally, to get the answer, we perform reasoning on the HPP graph by Mask Self-Attention. Experimental results on CommonsenseQA and OpenBookQA show that fusing QA context with HPP graph in the encoding stage and enhancing the HPP graph representation by using context paraphrase can improve the machine's commonsense reasoning ability.

常识性问题解答（CQA）一般是指机器在没有相关背景材料的情况下利用其掌握的常识来回答问题，这是自然语言处理中一项具有挑战性的任务。现有的方法主要是根据关键实体从知识图谱中检索相关子图谱，并设计复杂的图神经网络对子图谱进行推理。然而，这些方法存在以下问题：i) 关键实体中的嵌套实体导致引入无关知识；ii) 质量保证上下文没有与子图很好地整合；iii) 上下文知识不足阻碍了对子图节点的理解。本文提出了一种带上下文转述的异构图推理方法（HCP），它将词典中的转述知识引入关键实体识别和子图构建中，并在预训练语言模型（PTLM）的编码阶段有效地融合了质量保证上下文和子图。具体来说，HCP 通过词典词汇过滤嵌套实体，并通过连接意译描述11 构建异构路径-意译（HPP）图，意译描述是 WordNet 和 Wiktionary 中单词或短语的英文解释。然后，通过在 PTLM 编码阶段构建可见矩阵，我们将 QA 上下文表示融合到 HPP 图中。最后，为了得到答案，我们通过 Mask Self-Attention 对 HPP 图进行推理。在 CommonsenseQA 和 OpenBookQA 上的实验结果表明，在编码阶段将 QA 上下文与 HPP 图融合，并通过上下文解析增强 HPP 图的表示，可以提高机器的常识推理能力。

{"title":"Heterogeneous-Graph Reasoning With Context Paraphrase for Commonsense Question Answering","authors":"Yujie Wang;Hu Zhang;Jiye Liang;Ru Li","doi":"10.1109/TASLP.2024.3434469","DOIUrl":"10.1109/TASLP.2024.3434469","url":null,"abstract":"Commonsense question answering (CQA) generally means that the machine uses its mastered commonsense to answer questions without relevant background material, which is a challenging task in natural language processing. Existing methods focus on retrieving relevant subgraphs from knowledge graphs based on key entities and designing complex graph neural networks to perform reasoning over the subgraphs. However, they have the following problems: i) the nested entities in key entities lead to the introduction of irrelevant knowledge; ii) the QA context is not well integrated with the subgraphs; and iii) insufficient context knowledge hinders subgraph nodes understanding. In this paper, we present a heterogeneous-graph reasoning with context paraphrase method (HCP), which introduces the paraphrase knowledge from the dictionary into key entity recognition and subgraphs construction, and effectively fuses QA context and subgraphs during the encoding phase of the pre-trained language model (PTLM). Specifically, HCP filters the nested entities through the dictionary's vocabulary and constructs the Heterogeneous Path-Paraphrase (HPP) graph by connecting the paraphrase descriptions\u0000<xref>1</xref>\u0000<fn><label>1</label>The paraphrase descriptions are English explanations of words or phrases in WordNet and Wiktionary.</fn>\u0000 with the key entity nodes in the subgraphs. Then, by constructing the visible matrices in the PTLM encoding phase, we fuse the QA context representation into the HPP graph. Finally, to get the answer, we perform reasoning on the HPP graph by Mask Self-Attention. Experimental results on CommonsenseQA and OpenBookQA show that fusing QA context with HPP graph in the encoding stage and enhancing the HPP graph representation by using context paraphrase can improve the machine's commonsense reasoning ability.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3759-3770"},"PeriodicalIF":4.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Masked Graph Learning With Recurrent Alignment for Multimodal Emotion Recognition in Conversation 利用递归对齐进行掩蔽图学习，实现对话中的多模态情感识别

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-07-26 DOI: 10.1109/TASLP.2024.3434495

Tao Meng;Fuchen Zhang;Yuntao Shou;Hongen Shao;Wei Ai;Keqin Li

Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that MGLRA outperforms state-of-the-art methods.

由于会话中的多模态情感识别（MERC）可应用于舆情监测、智能对话机器人等领域，近年来受到了广泛的研究关注。与传统的单模态情感识别不同，MERC 可以融合多种模态（如文本、音频和视觉）之间互补的语义信息来提高情感识别率。然而，以往的研究忽略了多模态融合前的模态间配准过程和模态内噪声信息，而是直接融合多模态特征，这将阻碍模型的表征学习。在本研究中，我们开发了一种名为 "带递归配准的掩码图学习"（Masked Graph Learning with Recursive Alignment，MGLRA）的新方法来解决这一问题，该方法使用带记忆的递归迭代模块来配准多模态特征，然后使用掩码 GCN 进行多模态特征融合。首先，我们采用 LSTM 捕捉上下文信息，并使用图注意力过滤机制有效消除模态内的噪声。其次，我们建立了一个具有记忆功能的循环迭代模块，它可以利用不同模态之间的通信消除模态之间的差距，实现模态之间特征的初步对齐。然后，引入跨模态多头注意力机制，实现模态间的特征对齐，并构建用于多模态特征融合的掩码 GCN，它可以对图中的节点进行随机掩码重构，以获得更好的节点特征表示。最后，我们利用多层感知器（MLP）进行情感识别。在两个基准数据集（即 IEMOCAP 和 MELD）上进行的广泛实验表明，MGLRA 的性能优于最先进的方法。

{"title":"Masked Graph Learning With Recurrent Alignment for Multimodal Emotion Recognition in Conversation","authors":"Tao Meng;Fuchen Zhang;Yuntao Shou;Hongen Shao;Wei Ai;Keqin Li","doi":"10.1109/TASLP.2024.3434495","DOIUrl":"10.1109/TASLP.2024.3434495","url":null,"abstract":"Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that MGLRA outperforms state-of-the-art methods.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4298-4312"},"PeriodicalIF":4.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks ELP 适配器：针对各种语音处理任务的参数高效适配器调整

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-07-26 DOI: 10.1109/TASLP.2024.3434445

Nakamasa Inoue;Shinta Otake;Takumi Hirose;Masanari Ohi;Rei Kawakami

Self-supervised learning has emerged as a key approach for learning generic representations from speech data. Despite promising results in downstream tasks such as speech recognition, speaker verification, and emotion recognition, a significant number of parameters is required, which makes fine-tuning for each task memory-inefficient. To address this limitation, we introduce ELP-adapter tuning, a novel method for parameter-efficient fine-tuning using three types of adapter, namely encoder adapters (E-adapters), layer adapters (L-adapters), and a prompt adapter (P-adapter). The E-adapters are integrated into transformer-based encoder layers and help to learn fine-grained speech representations that are effective for speech recognition. The L-adapters create paths from each encoder layer to the downstream head and help to extract non-linguistic features from lower encoder layers that are effective for speaker verification and emotion recognition. The P-adapter appends pseudo features to CNN features to further improve effectiveness and efficiency. With these adapters, models can be quickly adapted to various speech processing tasks. Our evaluation across four downstream tasks using five backbone models demonstrated the effectiveness of the proposed method. With the WavLM backbone, its performance was comparable to or better than that of full fine-tuning on all tasks while requiring 90% fewer learnable parameters.

自监督学习已成为从语音数据中学习通用表征的关键方法。尽管在语音识别、说话人验证和情感识别等下游任务中取得了可喜的成果，但仍需要大量参数，这使得针对每项任务的微调记忆效率低下。为了解决这一限制，我们引入了 ELP 适配器调整，这是一种利用三种适配器（即编码器适配器（E-adapters）、层适配器（L-adapters）和提示适配器（P-adapters））进行参数高效微调的新方法。E 适配器集成到基于变压器的编码器层中，有助于学习对语音识别有效的细粒度语音表征。L 适配器创建了从每个编码器层到下游头的路径，有助于从较低的编码器层中提取非语言特征，从而有效地进行说话人验证和情感识别。P 适配器将伪特征附加到 CNN 特征中，以进一步提高效果和效率。有了这些适配器，模型可以快速适应各种语音处理任务。我们使用五个骨干模型对四个下游任务进行了评估，结果证明了所提方法的有效性。使用 WavLM 骨干模型，它在所有任务中的性能都与完全微调相当或更好，而所需的可学习参数却减少了 90%。

{"title":"ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks","authors":"Nakamasa Inoue;Shinta Otake;Takumi Hirose;Masanari Ohi;Rei Kawakami","doi":"10.1109/TASLP.2024.3434445","DOIUrl":"10.1109/TASLP.2024.3434445","url":null,"abstract":"Self-supervised learning has emerged as a key approach for learning generic representations from speech data. Despite promising results in downstream tasks such as speech recognition, speaker verification, and emotion recognition, a significant number of parameters is required, which makes fine-tuning for each task memory-inefficient. To address this limitation, we introduce ELP-adapter tuning, a novel method for parameter-efficient fine-tuning using three types of adapter, namely encoder adapters (E-adapters), layer adapters (L-adapters), and a prompt adapter (P-adapter). The E-adapters are integrated into transformer-based encoder layers and help to learn fine-grained speech representations that are effective for speech recognition. The L-adapters create paths from each encoder layer to the downstream head and help to extract non-linguistic features from lower encoder layers that are effective for speaker verification and emotion recognition. The P-adapter appends pseudo features to CNN features to further improve effectiveness and efficiency. With these adapters, models can be quickly adapted to various speech processing tasks. Our evaluation across four downstream tasks using five backbone models demonstrated the effectiveness of the proposed method. With the WavLM backbone, its performance was comparable to or better than that of full fine-tuning on all tasks while requiring 90% fewer learnable parameters.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3867-3880"},"PeriodicalIF":4.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Multi-Dimensional Speaker Localization: Axis Partitioning, Unbiased Label Distribution, and Data Augmentation 学习多维扬声器定位：轴划分、无偏标签分布和数据扩充

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-07-25 DOI: 10.1109/TASLP.2024.3426309

Linfeng Feng;Yijun Gong;Zhi Liu;Xiao-Lei Zhang;Xuelong Li

Multi-dimensional speaker localization (SL) aims to estimate the two- or three-dimensional locations of speakers. A recent advancement in multi-dimensional SL is the end-to-end deep neural networks (DNNs) with ad-hoc microphone arrays. This method transforms the SL problem into a classification problem, i.e. a problem of identifying the grids where speakers are located. However, the classification formulation has two closely connected weaknesses. Firstly, this approach introduces quantization error, which needs a large number of grids to mitigate the error. However, increasing the number of grids leads to the curse of dimensionality. To address the problems, we propose an efficient multi-dimensional SL algorithm, which has the following three novel contributions. First, we decouple the high-dimensional grid partitioning into axis partitioning, which substantially mitigates the curse-of-dimensionality. Particularly, for the multi-speaker localization problem, we employ a separator to circumvent the permutation ambiguity of the axis partitioning in the inference stage. Second, we introduce a comprehensive unbiased label distribution scheme to further eliminate quantization errors. Finally, a set of data augmentation techniques are proposed, including coordinate transformation, stochastic node selection, and mixed training, to alleviate overfitting and sample imbalance problems. The proposed methods were evaluated on both simulated and real-world data, and the experimental results confirm the effectiveness.

多维扬声器定位（SL）旨在估计扬声器的二维或三维位置。多维扬声器定位的最新进展是使用特设麦克风阵列的端到端深度神经网络（DNN）。这种方法将 SL 问题转化为分类问题，即识别扬声器所在网格的问题。然而，这种分类方法有两个密切相关的弱点。首先，这种方法会引入量化误差，需要大量的网格来减少误差。然而，增加网格数量会导致维度诅咒。为了解决这些问题，我们提出了一种高效的多维 SL 算法，它有以下三个新贡献。首先，我们将高维网格划分解耦为轴划分，这大大缓解了维度诅咒。特别是针对多扬声器定位问题，我们在推理阶段采用了分离器来规避轴划分的置换模糊性。其次，我们引入了一种全面的无偏标签分布方案，以进一步消除量化误差。最后，我们提出了一套数据增强技术，包括坐标变换、随机节点选择和混合训练，以缓解过拟合和样本不平衡问题。我们在模拟数据和实际数据上对所提出的方法进行了评估，实验结果证实了这些方法的有效性。

{"title":"Learning Multi-Dimensional Speaker Localization: Axis Partitioning, Unbiased Label Distribution, and Data Augmentation","authors":"Linfeng Feng;Yijun Gong;Zhi Liu;Xiao-Lei Zhang;Xuelong Li","doi":"10.1109/TASLP.2024.3426309","DOIUrl":"10.1109/TASLP.2024.3426309","url":null,"abstract":"Multi-dimensional speaker localization (SL) aims to estimate the two- or three-dimensional locations of speakers. A recent advancement in multi-dimensional SL is the end-to-end deep neural networks (DNNs) with ad-hoc microphone arrays. This method transforms the SL problem into a classification problem, i.e. a problem of identifying the grids where speakers are located. However, the classification formulation has two closely connected weaknesses. Firstly, this approach introduces quantization error, which needs a large number of grids to mitigate the error. However, increasing the number of grids leads to the curse of dimensionality. To address the problems, we propose an efficient multi-dimensional SL algorithm, which has the following three novel contributions. First, we decouple the high-dimensional grid partitioning into \u0000<italic>axis partitioning\u0000, which substantially mitigates the curse-of-dimensionality. Particularly, for the multi-speaker localization problem, we employ a separator to circumvent the permutation ambiguity of the axis partitioning in the inference stage. Second, we introduce a comprehensive \u0000<italic>unbiased label distribution\u0000 scheme to further eliminate quantization errors. Finally, a set of data augmentation techniques are proposed, including coordinate transformation, stochastic node selection, and mixed training, to alleviate overfitting and sample imbalance problems. The proposed methods were evaluated on both simulated and real-world data, and the experimental results confirm the effectiveness.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4013-4025"},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance 反思处理失真：厘清语音增强错误对语音识别性能的影响

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-07-22 DOI: 10.1109/TASLP.2024.3426924

Tsubasa Ochiai;Kazuma Iwamoto;Marc Delcroix;Rintaro Ikeshita;Hiroshi Sato;Shoko Araki;Shigeru Katagiri

It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.

使用单通道语音增强（SE）前端提高噪声条件下的自动语音识别（ASR）性能是一项挑战。这通常归因于单通道 SE 前端的非线性处理所造成的处理失真。然而，造成这种 ASR 性能下降的原因尚未得到充分研究。如何设计能显著提高 ASR 性能的单通道 SE 前端仍是一个有待研究的问题。在本研究中，我们研究了一种能解释 ASR 性能下降原因的信号级数值指标。为此，我们提出了一种基于正交投影的 SE 误差分解的新型分析方案。该方案可手动修改分解后的干扰、噪声和人工误差的比率，使我们能够直接评估每种误差类型对 ASR 性能的影响。我们的分析表明，与其他类型的误差相比，人工误差对 ASR 性能的影响尤为不利。这为我们提供了导致 ASR 性能下降的处理失真更为原则性的定义。然后，我们研究了减少人工痕迹错误影响的两种实用方法。首先，我们证明了简单的观测添加（OA）后处理（即对增强信号和观测信号进行插值）可以提高信号与伪迹的比率。其次，我们提出了一种新的训练目标，即人工失真增强信号失真比（AB-SDR），它迫使模型以更少的人工失真误差来估计增强信号。通过实验，我们证实 OA 和 AB-SDR 方法都能有效减少单通道 SE 前端造成的伪音误差，从而显著提高 ASR 性能。

{"title":"Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance","authors":"Tsubasa Ochiai;Kazuma Iwamoto;Marc Delcroix;Rintaro Ikeshita;Hiroshi Sato;Shoko Araki;Shigeru Katagiri","doi":"10.1109/TASLP.2024.3426924","DOIUrl":"10.1109/TASLP.2024.3426924","url":null,"abstract":"It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3589-3602"},"PeriodicalIF":4.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141785199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Consistency-Based Teacher for Semi-Supervised Multimodal Sentiment Analysis 基于多模态一致性的半监督多模态情感分析教师

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-07-18 DOI: 10.1109/TASLP.2024.3430543

Ziqi Yuan;Jingliang Fang;Hua Xu;Kai Gao

Multimodal sentiment analysis holds significant importance within the realm of human-computer interaction. Due to the ease of collecting unlabeled online resources compared to the high costs associated with annotation, it becomes imperative for researchers to develop semi-supervised methods that leverage unlabeled data to enhance model performance. Existing semi-supervised approaches, particularly those applied to trivial image classification tasks, are not suitable for multimodal regression tasks due to their reliance on task-specific augmentation and thresholds designed for classification tasks. To address this limitation, we propose the Multimodal Consistency-based Teacher (MC-Teacher), which incorporates consistency-based pseudo-label technique into semi-supervised multimodal sentiment analysis. In our approach, we first propose synergistic consistency assumption which focus on the consistency among bimodal representation. Building upon this assumption, we develop a learnable filter network that autonomously learns how to identify misleading instances instead of threshold-based methods. This is achieved by leveraging both the implicit discriminant consistency on unlabeled instances and the explicit guidance on constructed training data with labeled instances. Additionally, we design the self-adaptive exponential moving average strategy to decouple the student and teacher networks, utilizing a heuristic momentum coefficient. Through both quantitative and qualitative experiments on two benchmark datasets, we demonstrate the outstanding performances of the proposed MC-Teacher approach. Furthermore, detailed analysis experiments and case studies are provided for each crucial component to intuitively elucidate the inner mechanism and further validate their effectiveness.

多模态情感分析在人机交互领域具有重要意义。与标注所需的高昂成本相比，收集未标注的在线资源非常容易，因此研究人员必须开发半监督方法，利用未标注数据来提高模型性能。现有的半监督方法，尤其是那些应用于琐碎图像分类任务的方法，由于依赖于特定任务的增强和为分类任务设计的阈值，并不适用于多模态回归任务。为了解决这一局限性，我们提出了基于一致性的多模态教师（Multimodal Consistency-based Teacher，MC-Teacher），它将基于一致性的伪标签技术融入到半监督多模态情感分析中。在我们的方法中，我们首先提出了协同一致性假设，重点关注双模态表征之间的一致性。在此假设的基础上，我们开发了一种可学习的过滤网络，它能自主学习如何识别误导性实例，而不是基于阈值的方法。这是通过利用未标注实例的隐式判别一致性和带有标注实例的构建训练数据的显式指导来实现的。此外，我们还利用启发式动量系数设计了自适应指数移动平均策略，以解耦学生和教师网络。通过在两个基准数据集上进行定量和定性实验，我们证明了所提出的 MC-Teacher 方法的卓越性能。此外，我们还为每个关键组件提供了详细的分析实验和案例研究，以直观地阐明其内在机制并进一步验证其有效性。

{"title":"Multimodal Consistency-Based Teacher for Semi-Supervised Multimodal Sentiment Analysis","authors":"Ziqi Yuan;Jingliang Fang;Hua Xu;Kai Gao","doi":"10.1109/TASLP.2024.3430543","DOIUrl":"10.1109/TASLP.2024.3430543","url":null,"abstract":"Multimodal sentiment analysis holds significant importance within the realm of human-computer interaction. Due to the ease of collecting unlabeled online resources compared to the high costs associated with annotation, it becomes imperative for researchers to develop semi-supervised methods that leverage unlabeled data to enhance model performance. Existing semi-supervised approaches, particularly those applied to trivial image classification tasks, are not suitable for multimodal regression tasks due to their reliance on task-specific augmentation and thresholds designed for classification tasks. To address this limitation, we propose the Multimodal Consistency-based Teacher (MC-Teacher), which incorporates consistency-based pseudo-label technique into semi-supervised multimodal sentiment analysis. In our approach, we first propose synergistic consistency assumption which focus on the consistency among bimodal representation. Building upon this assumption, we develop a learnable filter network that autonomously learns how to identify misleading instances instead of threshold-based methods. This is achieved by leveraging both the implicit discriminant consistency on unlabeled instances and the explicit guidance on constructed training data with labeled instances. Additionally, we design the self-adaptive exponential moving average strategy to decouple the student and teacher networks, utilizing a heuristic momentum coefficient. Through both quantitative and qualitative experiments on two benchmark datasets, we demonstrate the outstanding performances of the proposed MC-Teacher approach. Furthermore, detailed analysis experiments and case studies are provided for each crucial component to intuitively elucidate the inner mechanism and further validate their effectiveness.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3669-3683"},"PeriodicalIF":4.1,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0