IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献_第6页

Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children 自动检测粤语学龄前儿童的语音障碍

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463503

Si-Ioi Ng;Cymie Wing-Yee Ng;Jiarui Wang;Tan Lee

Speech sound disorder (SSD) is a type of developmental disorder in which children encounter persistent difficulties in correctly producing certain speech sounds. Conventionally, assessment of SSD relies largely on speech and language pathologists (SLPs) with appropriate language background. With the unsatisfied demand for qualified SLPs, automatic detection of SSD is highly desirable for assisting clinical work and improving the efficiency and quality of services. In this paper, methods and systems for fully automatic detection of SSD in young children are investigated. A microscopic approach and a macroscopic approach are developed. The microscopic system is based on detection of phonological errors in impaired child speech. A deep neural network (DNN) model is trained to learn the similarity and contrast between consonant segments. Phonological error is identified by contrasting a test speech segment to reference segments. The phone-level similarity scores are aggregated for speaker-level SSD detection. The macroscopic approach leverages holistic changes of speech characteristics related to disorders. Various types of speaker-level embeddings are investigated and compared. Experimental results show that the proposed microscopic system achieves unweighted average recall (UAR) from 84.0% to 91.9% on phone-level error detection. The proposed macroscopic approach can achieve a UAR of 89.0% on speaker-level SSD detection. The speaker embeddings adopted for macroscopic SSD detection can effectively discard the information related to speaker's personal identity.

言语发声障碍（SSD）是一种发育障碍，儿童在正确发出某些言语声音时会遇到持续性困难。传统上，对 SSD 的评估主要依赖于具有相应语言背景的言语和语言病理学家（SLPs）。由于对合格语言病理学家的需求得不到满足，自动检测 SSD 对辅助临床工作、提高服务效率和质量非常有必要。本文研究了全自动检测幼儿 SSD 的方法和系统。本文开发了一种微观方法和一种宏观方法。微观系统基于检测受损儿童语音中的语音错误。通过训练深度神经网络（DNN）模型来学习辅音片段之间的相似度和对比度。通过将测试语音片段与参考片段进行对比来识别语音错误。电话级的相似性分数被汇总到扬声器级的 SSD 检测中。宏观方法利用与障碍有关的语音特征的整体变化。研究并比较了各种类型的扬声器级嵌入。实验结果表明，所提出的微观系统在电话级错误检测方面实现了 84.0% 到 91.9% 的非加权平均召回率（UAR）。所提出的宏观方法在扬声器级 SSD 检测方面的 UAR 可达到 89.0%。宏观固态硬盘检测所采用的说话人嵌入可以有效地剔除与说话人个人身份相关的信息。

{"title":"Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children","authors":"Si-Ioi Ng;Cymie Wing-Yee Ng;Jiarui Wang;Tan Lee","doi":"10.1109/TASLP.2024.3463503","DOIUrl":"10.1109/TASLP.2024.3463503","url":null,"abstract":"Speech sound disorder (SSD) is a type of developmental disorder in which children encounter persistent difficulties in correctly producing certain speech sounds. Conventionally, assessment of SSD relies largely on speech and language pathologists (SLPs) with appropriate language background. With the unsatisfied demand for qualified SLPs, automatic detection of SSD is highly desirable for assisting clinical work and improving the efficiency and quality of services. In this paper, methods and systems for fully automatic detection of SSD in young children are investigated. A microscopic approach and a macroscopic approach are developed. The microscopic system is based on detection of phonological errors in impaired child speech. A deep neural network (DNN) model is trained to learn the similarity and contrast between consonant segments. Phonological error is identified by contrasting a test speech segment to reference segments. The phone-level similarity scores are aggregated for speaker-level SSD detection. The macroscopic approach leverages holistic changes of speech characteristics related to disorders. Various types of speaker-level embeddings are investigated and compared. Experimental results show that the proposed microscopic system achieves unweighted average recall (UAR) from 84.0% to 91.9% on phone-level error detection. The proposed macroscopic approach can achieve a UAR of 89.0% on speaker-level SSD detection. The speaker embeddings adopted for macroscopic SSD detection can effectively discard the information related to speaker's personal identity.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4355-4368"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation 利用时态卷积注意力网络进行音视频融合以实现语音分离

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463411

Debang Liu;Tianqi Zhang;Mads Græsbøll Christensen;Chen Yi;Zeliang An

Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.

目前，视听语音分离方法利用说话人的视听相关信息来帮助分离目标说话人的语音。然而，这些方法通常使用线性映射的特征串联方法来获得融合的视听特征，这促使我们对视听融合进行更深入的探索。因此，本文根据说话人在说话过程中的嘴部地标运动，提出了一种新颖的时域单通道视听语音分离方法：视听融合与时空卷积注意力网络语音分离模型（AVTCA）。在该方法中，我们设计了基于注意力机制的时空卷积注意力网络（TCANet）来模拟音频和视觉序列之间的上下文关系，并以 TCANet 为基本单元来构建序列学习和融合网络。在整个深度分离框架中，我们首先利用交叉注意力关注视听序列的交叉相关信息，然后利用 TCANet 融合具有时间依赖性和交叉相关性的视听特征序列。之后，融合后的视听特征序列将作为分离网络的输入，用于预测掩码和分离每个说话者的声源。最后，本文在 Vox2、GRID、LRS2 和 TCD-TIMIT 数据集上进行了对比实验，结果表明 AVTCA 优于其他最先进的（SOTA）分离方法。此外，它在计算性能和模型大小方面也表现出更高的效率。

{"title":"Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation","authors":"Debang Liu;Tianqi Zhang;Mads Græsbøll Christensen;Chen Yi;Zeliang An","doi":"10.1109/TASLP.2024.3463411","DOIUrl":"10.1109/TASLP.2024.3463411","url":null,"abstract":"Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4647-4660"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps 利用广播式 CNN 变换器和自我注意力地图的知识蒸馏训练实现高效轻量级扬声器验证

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463491

Jeong-Hwan Choi;Joon-Young Yang;Joon-Hyuk Chang

Developing a lightweight speaker embedding extractor (SEE) is crucial for the practical implementation of automatic speaker verification (ASV) systems. To this end, we recently introduced broadcasting convolutional neural networks (CNNs)-meet-vision-Transformers (BC-CMT), a lightweight SEE that utilizes broadcasted residual learning (BRL) within the hybrid CNN-Transformer architecture to maintain a small number of model parameters. We proposed three BC-CMT-based SEE with three different sizes: BC-CMT-Tiny, -Small, and -Base. In this study, we extend our previously proposed BC-CMT by introducing an improved model architecture and a training strategy based on knowledge distillation (KD) using self-attention (SA) maps. First, to reduce the computational costs and latency of the BC-CMT, the two-dimensional (2D) SA operations in the BC-CMT, which calculate the SA maps in the frequency–time dimensions, are simplified to 1D SA operations that consider only temporal importance. Moreover, to enhance the SA capability of the BC-CMT, the group convolution layers in the SA block are adjusted to have smaller number of groups and are combined with the BRL operations. Second, to improve the training effectiveness of the modified BC-CMT-Tiny, the SA maps of a pretrained large BC-CMT-Base are used for the KD to guide those of a smaller BC-CMT-Tiny. Because the attention map sizes of the modified BC-CMT models do not depend on the number of frequency bins or convolution channels, the proposed strategy enables KD between feature maps with different sizes. The experimental results demonstrate that the proposed BC-CMT-Tiny model having 271.44K model parameters achieved 36.8% and 9.3% reduction in floating point operations on 1s signals and equal error rate (EER) on VoxCeleb 1 testset, respectively, compared to the conventional BC-CMT-Tiny. The CPU and GPU running time of the proposed BC-CMT-Tiny ranges of 1 to 10 s signals were 29.07 to 146.32 ms and 36.01 to 206.43 ms, respectively. The proposed KD further reduced the EER by 15.5% with improved attention capability.

开发轻量级扬声器嵌入提取器（SEE）对于实际应用自动扬声器验证（ASV）系统至关重要。为此，我们最近推出了广播卷积神经网络（CNN）-视觉变换器（BC-CMT），这是一种轻量级 SEE，它在混合 CNN-变换器架构中利用广播残差学习（BRL）来保持少量模型参数。我们提出了三种不同规模的基于 BC-CMT 的 SEE：BC-CMT-Tiny、-Small 和-Base。在本研究中，我们对之前提出的 BC-CMT 进行了扩展，引入了改进的模型架构和基于知识提炼（KD）的训练策略，并使用了自我注意（SA）地图。首先，为了降低 BC-CMT 的计算成本和延迟，我们将 BC-CMT 中计算频率-时间维度 SA 地图的二维 (2D) SA 操作简化为只考虑时间重要性的一维 SA 操作。此外，为了增强 BC-CMT 的 SA 能力，还将 SA 块中的分组卷积层调整为较少的分组数，并与 BRL 运算相结合。其次，为提高改进型 BC-CMT-Tiny 的训练效果，在 KD 中使用预训练的大型 BC-CMT-Base 的 SA 地图来指导小型 BC-CMT-Tiny 的 SA 地图。由于修改后的 BC-CMT 模型的注意图大小并不取决于频带数或卷积通道数，因此所提出的策略可以在不同大小的特征图之间进行 KD。实验结果表明，与传统的 BC-CMT-Tiny 模型相比，拥有 271.44K 模型参数的 BC-CMT-Tiny 模型在 VoxCeleb 1 测试集上的 1s 信号浮点运算和等错误率（EER）分别减少了 36.8% 和 9.3%。建议的 BC-CMT-Tiny 在 1 至 10 s 信号范围内的 CPU 和 GPU 运行时间分别为 29.07 至 146.32 ms 和 36.01 至 206.43 ms。随着注意力能力的提高，拟议的 KD 进一步将 EER 降低了 15.5%。

{"title":"Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps","authors":"Jeong-Hwan Choi;Joon-Young Yang;Joon-Hyuk Chang","doi":"10.1109/TASLP.2024.3463491","DOIUrl":"10.1109/TASLP.2024.3463491","url":null,"abstract":"Developing a lightweight speaker embedding extractor (SEE) is crucial for the practical implementation of automatic speaker verification (ASV) systems. To this end, we recently introduced \u0000<italic>broadcasting convolutional neural networks (CNNs)-meet-vision-Transformers</i>\u0000 (BC-CMT), a lightweight SEE that utilizes broadcasted residual learning (BRL) within the hybrid CNN-Transformer architecture to maintain a small number of model parameters. We proposed three BC-CMT-based SEE with three different sizes: BC-CMT-Tiny, -Small, and -Base. In this study, we extend our previously proposed BC-CMT by introducing an improved model architecture and a training strategy based on knowledge distillation (KD) using self-attention (SA) maps. First, to reduce the computational costs and latency of the BC-CMT, the two-dimensional (2D) SA operations in the BC-CMT, which calculate the SA maps in the frequency–time dimensions, are simplified to 1D SA operations that consider only temporal importance. Moreover, to enhance the SA capability of the BC-CMT, the group convolution layers in the SA block are adjusted to have smaller number of groups and are combined with the BRL operations. Second, to improve the training effectiveness of the modified BC-CMT-Tiny, the SA maps of a pretrained large BC-CMT-Base are used for the KD to guide those of a smaller BC-CMT-Tiny. Because the attention map sizes of the modified BC-CMT models do not depend on the number of frequency bins or convolution channels, the proposed strategy enables KD between feature maps with different sizes. The experimental results demonstrate that the proposed BC-CMT-Tiny model having 271.44K model parameters achieved 36.8% and 9.3% reduction in floating point operations on 1s signals and equal error rate (EER) on VoxCeleb 1 testset, respectively, compared to the conventional BC-CMT-Tiny. The CPU and GPU running time of the proposed BC-CMT-Tiny ranges of 1 to 10 s signals were 29.07 to 146.32 ms and 36.01 to 206.43 ms, respectively. The proposed KD further reduced the EER by 15.5% with improved attention capability.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4580-4595"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PHAIN: Audio Inpainting via Phase-Aware Optimization With Instantaneous Frequency PHAIN：通过瞬时频率相位感知优化进行音频绘制

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463415

Tomoro Tanaka;Kohei Yatabe;Yasuhiro Oikawa

Audio inpainting restores locally corrupted parts of digital audio signals. Sparsity-based methods achieve this by promoting sparsity in the time-frequency (T-F) domain, assuming short-time audio segments consist of a few sinusoids. However, such sparsity promotion reduces the magnitudes of the resulting waveforms; moreover, it often ignores the temporal connections of sinusoidal components. To address these problems, we propose a novel phase-aware audio inpainting method. Our method minimizes the time variations of a particular T-F representation calculated using the time derivative of the phase. This promotes sinusoidal components that coherently fit in the corrupted parts without directly suppressing the magnitudes. Both objective and subjective experiments confirmed the superiority of the proposed method compared with state-of-the-art methods.

音频绘制可恢复数字音频信号中局部损坏的部分。基于稀疏性的方法通过提高时频（T-F）域的稀疏性来实现这一目标，假定短时音频片段由几个正弦波组成。然而，这种稀疏性提升降低了所得到的波形的幅度，而且往往忽略了正弦成分的时间联系。为了解决这些问题，我们提出了一种新颖的相位感知音频绘制方法。我们的方法能最大限度地减少利用相位的时间导数计算出的特定 T-F 表示法的时间变化。这样就能在不直接抑制幅度的情况下，促进正弦波成分与损坏部分的一致性。客观和主观实验都证实，与最先进的方法相比，我们提出的方法更胜一筹。

引用次数: 0

AudioNet: Supervised Deep Hashing for Retrieval of Similar Audio Events 音频网：有监督的深度散列检索相似音频事件

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-17 DOI: 10.1109/TASLP.2024.3446232

Sagar Dutta;Vipul Arora

This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.

这项研究提出了一种用于检索相似音频事件的有监督深度散列方法。所提出的方法名为 AudioNet，是一种基于深度学习的系统，可使用音频示例作为查询，对类似音频事件进行高效散列和检索。通过为相似音频事件生成二进制散列码，AudioNet 在多个标准数据集上实现了较高的检索性能，在该领域树立了新的标杆，并凸显了其与其他散列方法相比的功效和有效性。通过对标准数据集的全面实验，我们的研究在评估相似音频事件的检索性能方面做出了开创性的努力。我们提出了一种新的损失函数，它结合了加权对比损失和加权成对损失以及哈希码平衡，以提高音频事件检索的效率。该方法采用离散梯度传播，允许在反向传播过程中通过离散变量传播梯度。这样，网络就能使用通常用于连续变量的基于梯度的标准优化算法来优化离散散列码。实验结果表明，即使在处理不平衡数据集时，所提出的方法也能显示出良好的检索性能。本研究中进行的系统分析进一步证实了所提方法在多个数据集检索性能方面的显著优势。本研究的发现为今后利用深度音频嵌入高效检索相似音频事件的研究奠定了基础。

{"title":"AudioNet: Supervised Deep Hashing for Retrieval of Similar Audio Events","authors":"Sagar Dutta;Vipul Arora","doi":"10.1109/TASLP.2024.3446232","DOIUrl":"10.1109/TASLP.2024.3446232","url":null,"abstract":"This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4526-4536"},"PeriodicalIF":4.1,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Task Multi-Attention Transformer for Generative Named Entity Recognition 用于生成式命名实体识别的多任务多注意转换器

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-12 DOI: 10.1109/TASLP.2024.3458796

Ying Mo;Jiahao Liu;Hongyin Tang;Qifan Wang;Zenglin Xu;Jingang Wang;Xiaojun Quan;Wei Wu;Zhoujun Li

Most previous sequential labeling models are task-specific, while recent years have witnessed the rise of generative models due to the advantage of unifying all named entity recognition (NER) tasks into the encoder-decoder framework. Although achieving promising performance, our pilot studies demonstrate that existing generative models are ineffective at detecting entity boundaries and estimating entity types. In this paper, we propose a multi-task Transformer, which incorporates an entity boundary detection task into the named entity recognition task. More concretely, we achieve entity boundary detection by classifying the relations between tokens within the sentence. To improve the accuracy of entity-type mapping during decoding, we adopt an external knowledge base to calculate the prior entity-type distributions and then incorporate the information into the model via the self- and cross-attention mechanisms. We perform experiments on extensive NER benchmarks, including flat, nested, and discontinuous NER datasets involving long entities. It substantially increases nearly

$+0.3 sim +1.5;{F_1}$

scores across a broad spectrum or performs closely to the best generative NER model. Experimental results show that our approach improves the performance of the generative NER model considerably.

以前的大多数顺序标注模型都是针对特定任务的，而近年来，由于将所有命名实体识别（NER）任务统一到编码器-解码器框架中的优势，生成模型开始兴起。尽管取得了可喜的成绩，但我们的试验研究表明，现有的生成模型在检测实体边界和估计实体类型方面效果不佳。在本文中，我们提出了一种多任务转换器，它将实体边界检测任务纳入命名实体识别任务中。更具体地说，我们通过对句子中标记之间的关系进行分类来实现实体边界检测。为了提高解码过程中实体类型映射的准确性，我们采用外部知识库来计算先验实体类型分布，然后通过自关注和交叉关注机制将这些信息纳入模型。我们在广泛的 NER 基准上进行了实验，包括涉及长实体的平面、嵌套和不连续 NER 数据集。它在广泛的范围内大幅提高了近 $+0.3 （sim +1.5）;{F_1}$ 的得分，或与最佳生成式 NER 模型的表现接近。实验结果表明，我们的方法大大提高了生成式 NER 模型的性能。

{"title":"Multi-Task Multi-Attention Transformer for Generative Named Entity Recognition","authors":"Ying Mo;Jiahao Liu;Hongyin Tang;Qifan Wang;Zenglin Xu;Jingang Wang;Xiaojun Quan;Wei Wu;Zhoujun Li","doi":"10.1109/TASLP.2024.3458796","DOIUrl":"10.1109/TASLP.2024.3458796","url":null,"abstract":"Most previous sequential labeling models are task-specific, while recent years have witnessed the rise of generative models due to the advantage of unifying all named entity recognition (NER) tasks into the encoder-decoder framework. Although achieving promising performance, our pilot studies demonstrate that existing generative models are ineffective at detecting entity boundaries and estimating entity types. In this paper, we propose a multi-task Transformer, which incorporates an entity boundary detection task into the named entity recognition task. More concretely, we achieve entity boundary detection by classifying the relations between tokens within the sentence. To improve the accuracy of entity-type mapping during decoding, we adopt an external knowledge base to calculate the prior entity-type distributions and then incorporate the information into the model via the self- and cross-attention mechanisms. We perform experiments on extensive NER benchmarks, including flat, nested, and discontinuous NER datasets involving long entities. It substantially increases nearly \u0000<inline-formula><tex-math>$+0.3 sim +1.5;{F_1}$</tex-math></inline-formula>\u0000 scores across a broad spectrum or performs closely to the best generative NER model. Experimental results show that our approach improves the performance of the generative NER model considerably.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4171-4183"},"PeriodicalIF":4.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Filtered-X Quasi Affine Projection Algorithm for Active Noise Control Networks 用于主动噪声控制网络的滤波-X 准仿射投影算法

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-12 DOI: 10.1109/TASLP.2024.3458806

Miguel Ferrer;María de Diego;Alberto Gonzalez

The affine projection (AP) algorithm enhances the performance of gradient-based adaptive algorithms when dealing with colored reference signals, which is typically the case with filtered-X type algorithms. This enhancement is achieved by using various delayed versions of the reference signal data vector, which are appropriately orthogonalized and normalized to optimize convergence performance. The number of these vectors, known as the projection order of the AP, increases the computational requirements, mainly due to the calculation of a matrix inversion whose dimensions are proportional to this projection order. When used in distributed systems, the AP algorithm typically requires each acoustic node in the system to compute the complete matrix inversion, even though they only need a specific set of data (a subblock) from it. This means that the AP does not offer much advantage in terms of computational savings when used in distributed collaborative networks. To address this issue, an approximate version of the filtered-X affine projection (FXAP) algorithm is introduced in this work. This approximate version avoids the matrix inversion computation in each iteration using a precalculated inverse matrix. This strategy provides computational savings and enables easy distribution of the algorithm. Additionally, a variable step-size approach is proposed to mitigate the deviation caused by a precalculated matrix, which provides good performance, high robustness, and cost-effective distribution.

仿射投影（AP）算法可提高基于梯度的自适应算法在处理彩色参考信号时的性能，而滤波 X 型算法通常就是这种情况。这种增强是通过使用参考信号数据向量的各种延迟版本来实现的，这些延迟版本经过适当的正交化和归一化处理，以优化收敛性能。这些向量的数量，即 AP 的投影阶数，会增加计算需求，主要是因为需要计算矩阵反演，而矩阵反演的维数与投影阶数成正比。在分布式系统中使用时，AP 算法通常要求系统中的每个声学节点计算完整的矩阵反演，即使它们只需要其中的特定数据集（子块）。这意味着，在分布式协作网络中使用 AP 算法时，在节省计算量方面并没有太大优势。为了解决这个问题，本文引入了近似版本的过滤-X 仿射投影算法（FXAP）。该近似版本使用预先计算好的逆矩阵，避免了每次迭代中的矩阵反转计算。这一策略不仅节省了计算量，而且便于算法的推广。此外，还提出了一种步长可变的方法，以减轻预计算矩阵造成的偏差，从而提供良好的性能、高鲁棒性和经济高效的分布。

{"title":"Filtered-X Quasi Affine Projection Algorithm for Active Noise Control Networks","authors":"Miguel Ferrer;María de Diego;Alberto Gonzalez","doi":"10.1109/TASLP.2024.3458806","DOIUrl":"10.1109/TASLP.2024.3458806","url":null,"abstract":"The affine projection (AP) algorithm enhances the performance of gradient-based adaptive algorithms when dealing with colored reference signals, which is typically the case with filtered-X type algorithms. This enhancement is achieved by using various delayed versions of the reference signal data vector, which are appropriately orthogonalized and normalized to optimize convergence performance. The number of these vectors, known as the projection order of the AP, increases the computational requirements, mainly due to the calculation of a matrix inversion whose dimensions are proportional to this projection order. When used in distributed systems, the AP algorithm typically requires each acoustic node in the system to compute the complete matrix inversion, even though they only need a specific set of data (a subblock) from it. This means that the AP does not offer much advantage in terms of computational savings when used in distributed collaborative networks. To address this issue, an approximate version of the filtered-X affine projection (FXAP) algorithm is introduced in this work. This approximate version avoids the matrix inversion computation in each iteration using a precalculated inverse matrix. This strategy provides computational savings and enables easy distribution of the algorithm. Additionally, a variable step-size approach is proposed to mitigate the deviation caused by a precalculated matrix, which provides good performance, high robustness, and cost-effective distribution.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4237-4252"},"PeriodicalIF":4.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10679717","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep Kronecker Product Beamforming for Large-Scale Microphone Arrays 用于大规模麦克风阵列的深克罗内克乘积波束成形

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-12 DOI: 10.1109/TASLP.2024.3459430

Weixin Meng;Xiaoyu Li;Andong Li;Xiaoxue Luo;Shefeng Yan;Xiaodong Li;Chengshi Zheng

Although deep learning based beamformers have achieved promising performance using small microphone arrays, they suffer from performance degradation in very challenging environments, such as extremely low Signal-to-Noise Ratio (SNR) environments, e.g., SNR

$le$

−10 dB. A large-scale microphone array with dozens or hundreds of microphones can improve the performance of beamformers in these challenging scenarios because of its high spatial resolution. While a dramatic increase in the number of microphones leads to feature redundancy, causing difficulties in feature extraction and network training. As an attempt to improve the performance of deep beamformers for speech extraction in very challenging scenarios, this paper proposes a novel all neural Kronecker product beamforming denoted by ANKP-BF for large-scale microphone arrays by taking the following two aspects into account. Firstly, a larger microphone array can provide higher performance of spatial filtering when compared with a small microphone array, and deep neural networks are introduced for their powerful non-linear modeling capability in the speech extraction task. Secondly, the feature redundancy problem is solved by introducing the Kronecker product rule to decompose the original one high-dimension weight vector into the Kronecker product of two much lower-dimensional weight vectors. The proposed ANKP-BF is designed to operate in an end-to-end manner. Extensive experiments are conducted on simulated large-scale microphone-array signals using the DNS-Challenge corpus and WSJ0-SI84 corpus, and the real recordings in a semi-anechoic room and outdoor scenes are also used to evaluate and compare the performance of different methods. Quantitative results demonstrate that the proposed method outperforms existing advanced baselines in terms of multiple objective metrics, especially in very low SNR environments.

尽管基于深度学习的波束成形器在使用小型麦克风阵列时取得了可喜的性能，但在极具挑战性的环境中，例如信噪比（SNR）极低的环境中（例如，SNR $le$-10 dB），它们的性能会下降。由数十个或数百个麦克风组成的大规模麦克风阵列具有很高的空间分辨率，因此可以提高波束成形器在这些具有挑战性的场景中的性能。但麦克风数量的急剧增加会导致特征冗余，给特征提取和网络训练带来困难。为了提高深度波束成形器在极具挑战性的场景中进行语音提取的性能，本文从以下两个方面入手，提出了一种适用于大规模麦克风阵列的新型全神经克朗克积波束成形方法（ANKP-BF）。首先，与小型麦克风阵列相比，大型麦克风阵列能提供更高的空间滤波性能，而深度神经网络在语音提取任务中具有强大的非线性建模能力，因此本文引入了深度神经网络。其次，通过引入 Kronecker 乘积规则，将原始的一个高维权重向量分解为两个低得多的权重向量的 Kronecker 乘积，解决了特征冗余问题。所提出的 ANKP-BF 设计为端到端方式。利用 DNS-Challenge 语料库和 WSJ0-SI84 语料库对模拟的大规模麦克风阵列信号进行了广泛的实验，同时还利用半消声室和室外场景中的真实录音来评估和比较不同方法的性能。定量结果表明，所提出的方法在多个客观指标上都优于现有的先进基线，尤其是在信噪比非常低的环境中。

{"title":"Deep Kronecker Product Beamforming for Large-Scale Microphone Arrays","authors":"Weixin Meng;Xiaoyu Li;Andong Li;Xiaoxue Luo;Shefeng Yan;Xiaodong Li;Chengshi Zheng","doi":"10.1109/TASLP.2024.3459430","DOIUrl":"10.1109/TASLP.2024.3459430","url":null,"abstract":"Although deep learning based beamformers have achieved promising performance using small microphone arrays, they suffer from performance degradation in very challenging environments, such as extremely low Signal-to-Noise Ratio (SNR) environments, e.g., SNR \u0000<inline-formula><tex-math>$le$</tex-math></inline-formula>\u0000−10 dB. A large-scale microphone array with dozens or hundreds of microphones can improve the performance of beamformers in these challenging scenarios because of its high spatial resolution. While a dramatic increase in the number of microphones leads to feature redundancy, causing difficulties in feature extraction and network training. As an attempt to improve the performance of deep beamformers for speech extraction in very challenging scenarios, this paper proposes a novel all neural Kronecker product beamforming denoted by ANKP-BF for large-scale microphone arrays by taking the following two aspects into account. Firstly, a larger microphone array can provide higher performance of spatial filtering when compared with a small microphone array, and deep neural networks are introduced for their powerful non-linear modeling capability in the speech extraction task. Secondly, the feature redundancy problem is solved by introducing the Kronecker product rule to decompose the original one high-dimension weight vector into the Kronecker product of two much lower-dimensional weight vectors. The proposed ANKP-BF is designed to operate in an end-to-end manner. Extensive experiments are conducted on simulated large-scale microphone-array signals using the DNS-Challenge corpus and WSJ0-SI84 corpus, and the real recordings in a semi-anechoic room and outdoor scenes are also used to evaluate and compare the performance of different methods. Quantitative results demonstrate that the proposed method outperforms existing advanced baselines in terms of multiple objective metrics, especially in very low SNR environments.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4537-4553"},"PeriodicalIF":4.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC 利用预训练语言模型、嵌入式蒸馏和上采样策略提高 CTC 的非自回归翻译质量

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-12 DOI: 10.1109/TASLP.2024.3451977

Shen-sian Syu;Juncheng Xie;Hung-yi Lee

Non-autoregressive approaches, especially those that generate output in a one-pass forward manner, have shown great potential in improving the inference speed of translation models. However, these approaches often suffer from a significant drop in translation quality compared to autoregressive models (AT). To tackle this challenge, this paper introduces a series of innovative techniques to enhance the translation quality of non-autoregressive neural machine translation (NAT) models while still maintaining a substantial acceleration in inference speed. Specifically, we propose a method called CTCPMLM, which involves fine-tuning Pretrained Multilingual Language Models (PMLMs) with the Connectionist Temporal Classification (CTC) loss to effectively train NAT models. Additionally, we adopt the MASK insertion scheme instead of token duplication for up-sampling and present an embedding distillation method to further enhance the performance of NAT models. In our experiments, CTCPMLM surpasses the performance of the baseline autoregressive model (Transformer base) on various datasets, including WMT'14 DE

$leftrightarrow$

EN, WMT'16 RO

$leftrightarrow$

EN, and IWSLT'14 DE

$leftrightarrow$

EN. Moreover, CTCPMLM represents the current state-of-the-art among NAT models. Notably, our model achieves superior results compared to the baseline autoregressive model on the IWSLT'14 En

$leftrightarrow$

De and WMT'16 En

$leftrightarrow$

Ro datasets, even without using distillation data during training. Particularly, on the IWSLT'14 DE

$rightarrow$

EN dataset, our model achieves an impressive BLEU score of 39.93, surpassing AT models and establishing a new state-of-the-art. Additionally, our model exhibits a remarkable speed improvement of 16.35 times compared to the autoregressive model.

非自回归方法，特别是那些以单程前向方式生成输出的方法，在提高翻译模型的推理速度方面显示出巨大的潜力。然而，与自回归模型（AT）相比，这些方法的翻译质量往往会大幅下降。为了应对这一挑战，本文介绍了一系列创新技术，以提高非自回归神经机器翻译（NAT）模型的翻译质量，同时仍能保持推理速度的大幅提升。具体来说，我们提出了一种名为 CTCPMLM 的方法，该方法涉及利用连接时序分类（CTC）损失对预处理多语言语言模型（PMLM）进行微调，从而有效地训练 NAT 模型。此外，我们还采用 MASK 插入方案代替标记复制进行上采样，并提出了一种嵌入蒸馏方法，以进一步提高 NAT 模型的性能。在我们的实验中，CTCPMLM 在各种数据集（包括 WMT'14 DE $leftrightarrow$ EN、WMT'16 RO $leftrightarrow$ EN 和 IWSLT'14 DE $leftrightarrow$ EN）上的性能都超过了基准自回归模型（Transformer base）。此外，CTCPMLM 代表了当前 NAT 模型的最先进水平。值得注意的是，与基线自回归模型相比，我们的模型在 IWSLT'14 En $leftrightarrow$ De 和 WMT'16 En $leftrightarrow$ Ro 数据集上取得了更好的结果，即使在训练过程中不使用蒸馏数据。特别是在IWSLT'14 DE $rightarrow$ EN数据集上，我们的模型取得了令人印象深刻的BLEU分数39.93，超过了AT模型，建立了新的先进水平。此外，与自回归模型相比，我们的模型速度显著提高了 16.35 倍。

{"title":"Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC","authors":"Shen-sian Syu;Juncheng Xie;Hung-yi Lee","doi":"10.1109/TASLP.2024.3451977","DOIUrl":"10.1109/TASLP.2024.3451977","url":null,"abstract":"Non-autoregressive approaches, especially those that generate output in a one-pass forward manner, have shown great potential in improving the inference speed of translation models. However, these approaches often suffer from a significant drop in translation quality compared to autoregressive models (AT). To tackle this challenge, this paper introduces a series of innovative techniques to enhance the translation quality of non-autoregressive neural machine translation (NAT) models while still maintaining a substantial acceleration in inference speed. Specifically, we propose a method called CTCPMLM, which involves fine-tuning Pretrained Multilingual Language Models (PMLMs) with the Connectionist Temporal Classification (CTC) loss to effectively train NAT models. Additionally, we adopt the MASK insertion scheme instead of token duplication for up-sampling and present an embedding distillation method to further enhance the performance of NAT models. In our experiments, CTCPMLM surpasses the performance of the baseline autoregressive model (Transformer \u0000<italic>base</i>\u0000) on various datasets, including WMT'14 DE \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 EN, WMT'16 RO \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 EN, and IWSLT'14 DE \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 EN. Moreover, CTCPMLM represents the current state-of-the-art among NAT models. Notably, our model achieves superior results compared to the baseline autoregressive model on the IWSLT'14 En \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 De and WMT'16 En \u0000<inline-formula><tex-math>$leftrightarrow$</tex-math></inline-formula>\u0000 Ro datasets, even without using distillation data during training. Particularly, on the IWSLT'14 DE \u0000<inline-formula><tex-math>$rightarrow$</tex-math></inline-formula>\u0000 EN dataset, our model achieves an impressive BLEU score of 39.93, surpassing AT models and establishing a new state-of-the-art. Additionally, our model exhibits a remarkable speed improvement of 16.35 times compared to the autoregressive model.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4121-4133"},"PeriodicalIF":4.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TriSAT: Trimodal Representation Learning for Multimodal Sentiment Analysis TriSAT：多模态情感分析的三模态表征学习

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-11 DOI: 10.1109/TASLP.2024.3458812

Ruohong Huan;Guowei Zhong;Peng Chen;Ronghua Liang

Transformer-based multimodal sentiment analysis frameworks commonly facilitate cross-modal interactions between two modalities through the attention mechanism. However, such interactions prove inadequate when dealing with three or more modalities, leading to increased computational complexity and network redundancy. To address this challenge, this paper introduces a novel framework, Trimodal representations for Sentiment Analysis from Transformers (TriSAT), tailored for multimodal sentiment analysis. TriSAT incorporates a trimodal transformer featuring a module called Trimodal Multi-Head Attention (TMHA). TMHA considers language as the primary modality, combines information from language, video, and audio using a single computation, and analyzes sentiment from a trimodal perspective. This approach significantly reduces the computational complexity while delivering high performance. Moreover, we propose Attraction-Repulsion (AR) loss and Trimodal Supervised Contrastive (TSC) loss to further enhance sentiment analysis performance. We conduct experiments on three public datasets to evaluate TriSAT's performance, which consistently demonstrates its competitiveness compared to state-of-the-art approaches.

基于变换器的多模态情感分析框架通常通过注意力机制促进两种模态之间的跨模态交互。然而，在处理三种或更多模态时，这种交互被证明是不够的，从而导致计算复杂性和网络冗余增加。为了应对这一挑战，本文介绍了一个为多模态情感分析量身定制的新框架--"来自变形器的情感分析三模态表征"（TriSAT）。TriSAT 包含一个三模态变换器，其特色模块是三模态多头注意力（TMHA）。TMHA 将语言视为主要模态，通过一次计算将语言、视频和音频信息结合起来，并从三模态角度进行情感分析。这种方法在提供高性能的同时大大降低了计算复杂度。此外，我们还提出了吸引-排斥（AR）损失和三模态监督对比（TSC）损失，以进一步提高情感分析性能。我们在三个公开数据集上进行了实验，以评估 TriSAT 的性能，结果表明 TriSAT 与最先进的方法相比始终具有竞争力。

{"title":"TriSAT: Trimodal Representation Learning for Multimodal Sentiment Analysis","authors":"Ruohong Huan;Guowei Zhong;Peng Chen;Ronghua Liang","doi":"10.1109/TASLP.2024.3458812","DOIUrl":"10.1109/TASLP.2024.3458812","url":null,"abstract":"Transformer-based multimodal sentiment analysis frameworks commonly facilitate cross-modal interactions between two modalities through the attention mechanism. However, such interactions prove inadequate when dealing with three or more modalities, leading to increased computational complexity and network redundancy. To address this challenge, this paper introduces a novel framework, Trimodal representations for Sentiment Analysis from Transformers (TriSAT), tailored for multimodal sentiment analysis. TriSAT incorporates a trimodal transformer featuring a module called Trimodal Multi-Head Attention (TMHA). TMHA considers language as the primary modality, combines information from language, video, and audio using a single computation, and analyzes sentiment from a trimodal perspective. This approach significantly reduces the computational complexity while delivering high performance. Moreover, we propose Attraction-Repulsion (AR) loss and Trimodal Supervised Contrastive (TSC) loss to further enhance sentiment analysis performance. We conduct experiments on three public datasets to evaluate TriSAT's performance, which consistently demonstrates its competitiveness compared to state-of-the-art approaches.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4105-4120"},"PeriodicalIF":4.1,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0