利用时态卷积注意力网络进行音视频融合以实现语音分离

IF 5.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-18 DOI:10.1109/TASLP.2024.3463411

Debang Liu;Tianqi Zhang;Mads Græsbøll Christensen;Chen Yi;Zeliang An

{"title":"利用时态卷积注意力网络进行音视频融合以实现语音分离","authors":"Debang Liu;Tianqi Zhang;Mads Græsbøll Christensen;Chen Yi;Zeliang An","doi":"10.1109/TASLP.2024.3463411","DOIUrl":null,"url":null,"abstract":"Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4647-4660"},"PeriodicalIF":5.1000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation\",\"authors\":\"Debang Liu;Tianqi Zhang;Mads Græsbøll Christensen;Chen Yi;Zeliang An\",\"doi\":\"10.1109/TASLP.2024.3463411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"4647-4660\"},\"PeriodicalIF\":5.1000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10683978/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10683978/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

目前，视听语音分离方法利用说话人的视听相关信息来帮助分离目标说话人的语音。然而，这些方法通常使用线性映射的特征串联方法来获得融合的视听特征，这促使我们对视听融合进行更深入的探索。因此，本文根据说话人在说话过程中的嘴部地标运动，提出了一种新颖的时域单通道视听语音分离方法：视听融合与时空卷积注意力网络语音分离模型（AVTCA）。在该方法中，我们设计了基于注意力机制的时空卷积注意力网络（TCANet）来模拟音频和视觉序列之间的上下文关系，并以 TCANet 为基本单元来构建序列学习和融合网络。在整个深度分离框架中，我们首先利用交叉注意力关注视听序列的交叉相关信息，然后利用 TCANet 融合具有时间依赖性和交叉相关性的视听特征序列。之后，融合后的视听特征序列将作为分离网络的输入，用于预测掩码和分离每个说话者的声源。最后，本文在 Vox2、GRID、LRS2 和 TCD-TIMIT 数据集上进行了对比实验，结果表明 AVTCA 优于其他最先进的（SOTA）分离方法。此外，它在计算性能和模型大小方面也表现出更高的效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation

Currently, audio-visual speech separation methods utilize the speaker's audio and visual correlation information to help separate the speech of the target speaker. However, these methods commonly use the approach of feature concatenation with linear mapping to obtain the fused audio-visual features, which prompts us to conduct a deeper exploration for audio-visual fusion. Therefore, in this paper, according to the speaker's mouth landmark movements during speech, we propose a novel time-domain single-channel audio-visual speech separation method: audio-visual fusion with temporal convolution attention network for speech separation model (AVTCA). In this method, we design temporal convolution attention network (TCANet) based on the attention mechanism to model the contextual relationships between audio and visual sequences, and use TCANet as the basic unit to construct sequence learning and fusion network. In the whole deep separation framework, we first use cross attention to focus on the cross-correlation information of the audio and visual sequences, and then we use the TCANet to fuse the audio-visual feature sequences with temporal dependencies and cross-correlations. Afterwards, the fused audio-visual features sequences will be used as input to the separation network to predict mask and separate the source of each speaker. Finally, this paper conducts comparative experiments on Vox2, GRID, LRS2 and TCD-TIMIT datasets, indicating that AVTCA outperforms other state-of-the-art (SOTA) separation methods. Furthermore, it exhibits greater efficiency in computational performance and model size.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach