基于多模态图融合的视听场景分类

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-741

Hancheng Lei, Ning-qiang Chen

{"title":"基于多模态图融合的视听场景分类","authors":"Hancheng Lei, Ning-qiang Chen","doi":"10.21437/interspeech.2022-741","DOIUrl":null,"url":null,"abstract":"Audio-Visual Scene Classification (AVSC) task tries to achieve scene classification through joint analysis of the audio and video modalities. Most of the existing AVSC models are based on feature-level or decision-level fusion. The possible problems are: i) Due to the distribution difference of the corresponding features in different modalities is large, the direct concatenation of them in the feature-level fusion may not result in good performance. ii) The decision-level fusion cannot take full advantage of the common as well as complementary properties between the features and corresponding similarities of different modalities. To solve these problems, Graph Convolutional Network (GCN)-based multi-modal fusion algorithm is proposed for AVSC task. First, the Deep Neural Network (DNN) is trained to extract essential feature from each modality. Then, the Sample-to-Sample Cross Similarity Graph (SSCSG) is constructed based on each modality features. Finally, the DynaMic GCN (DM-GCN) and the ATtention GCN (AT-GCN) are introduced respectively to realize both feature-level and similarity-level fusion to ensure the classification accuracy. Experimental results on TAU Audio-Visual Urban Scenes 2021 development dataset demonstrate that the proposed scheme, called AVSC-MGCN achieves higher classification accuracy and lower computational complexity than state-of-the-art schemes.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4157-4161"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Audio-Visual Scene Classification Based on Multi-modal Graph Fusion\",\"authors\":\"Hancheng Lei, Ning-qiang Chen\",\"doi\":\"10.21437/interspeech.2022-741\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio-Visual Scene Classification (AVSC) task tries to achieve scene classification through joint analysis of the audio and video modalities. Most of the existing AVSC models are based on feature-level or decision-level fusion. The possible problems are: i) Due to the distribution difference of the corresponding features in different modalities is large, the direct concatenation of them in the feature-level fusion may not result in good performance. ii) The decision-level fusion cannot take full advantage of the common as well as complementary properties between the features and corresponding similarities of different modalities. To solve these problems, Graph Convolutional Network (GCN)-based multi-modal fusion algorithm is proposed for AVSC task. First, the Deep Neural Network (DNN) is trained to extract essential feature from each modality. Then, the Sample-to-Sample Cross Similarity Graph (SSCSG) is constructed based on each modality features. Finally, the DynaMic GCN (DM-GCN) and the ATtention GCN (AT-GCN) are introduced respectively to realize both feature-level and similarity-level fusion to ensure the classification accuracy. Experimental results on TAU Audio-Visual Urban Scenes 2021 development dataset demonstrate that the proposed scheme, called AVSC-MGCN achieves higher classification accuracy and lower computational complexity than state-of-the-art schemes.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"4157-4161\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-741\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-741","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

视听场景分类(AVSC)任务试图通过对音频和视频模式的联合分析来实现场景分类。现有的AVSC模型大多基于特征级或决策级融合。可能存在的问题有:i)由于不同模态下对应的特征分布差异较大，在特征级融合中直接拼接可能得不到很好的效果。ii)决策级融合不能充分利用不同模态特征之间的共同性和互补性以及相应的相似性。为了解决这些问题，提出了基于图卷积网络(GCN)的AVSC多模态融合算法。首先，训练深度神经网络(DNN)从每个模态中提取基本特征。然后，基于每个模态特征构建样本间交叉相似图(SSCSG)。最后，分别引入动态GCN (DM-GCN)和关注GCN (AT-GCN)，实现特征级和相似级融合，保证分类精度。在TAU视听城市场景2021开发数据集上的实验结果表明，与现有方案相比，AVSC-MGCN方案具有更高的分类精度和更低的计算复杂度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Audio-Visual Scene Classification Based on Multi-modal Graph Fusion

Audio-Visual Scene Classification (AVSC) task tries to achieve scene classification through joint analysis of the audio and video modalities. Most of the existing AVSC models are based on feature-level or decision-level fusion. The possible problems are: i) Due to the distribution difference of the corresponding features in different modalities is large, the direct concatenation of them in the feature-level fusion may not result in good performance. ii) The decision-level fusion cannot take full advantage of the common as well as complementary properties between the features and corresponding similarities of different modalities. To solve these problems, Graph Convolutional Network (GCN)-based multi-modal fusion algorithm is proposed for AVSC task. First, the Deep Neural Network (DNN) is trained to extract essential feature from each modality. Then, the Sample-to-Sample Cross Similarity Graph (SSCSG) is constructed based on each modality features. Finally, the DynaMic GCN (DM-GCN) and the ATtention GCN (AT-GCN) are introduced respectively to realize both feature-level and similarity-level fusion to ensure the classification accuracy. Experimental results on TAU Audio-Visual Urban Scenes 2021 development dataset demonstrate that the proposed scheme, called AVSC-MGCN achieves higher classification accuracy and lower computational complexity than state-of-the-art schemes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量