基于视频的显性情感预测的 Sec2Sec 协同关注

arXiv - CS - Multimedia Pub Date : 2024-08-27 DOI:arxiv-2408.15209

Mingwei Sun, Kunpeng Zhang

{"title":"基于视频的显性情感预测的 Sec2Sec 协同关注","authors":"Mingwei Sun, Kunpeng Zhang","doi":"arxiv-2408.15209","DOIUrl":null,"url":null,"abstract":"Video-based apparent affect detection plays a crucial role in video\nunderstanding, as it encompasses various elements such as vision, audio,\naudio-visual interactions, and spatiotemporal information, which are essential\nfor accurate video predictions. However, existing approaches often focus on\nextracting only a subset of these elements, resulting in the limited predictive\ncapacity of their models. To address this limitation, we propose a novel\nLSTM-based network augmented with a Transformer co-attention mechanism for\npredicting apparent affect in videos. We demonstrate that our proposed Sec2Sec\nCo-attention Transformer surpasses multiple state-of-the-art methods in\npredicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First\nImpressions. Notably, our model offers interpretability, allowing us to examine\nthe contributions of different time points to the overall prediction. The\nimplementation is available at: https://github.com/nestor-sun/sec2sec.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sec2Sec Co-attention for Video-Based Apparent Affective Prediction\",\"authors\":\"Mingwei Sun, Kunpeng Zhang\",\"doi\":\"arxiv-2408.15209\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video-based apparent affect detection plays a crucial role in video\\nunderstanding, as it encompasses various elements such as vision, audio,\\naudio-visual interactions, and spatiotemporal information, which are essential\\nfor accurate video predictions. However, existing approaches often focus on\\nextracting only a subset of these elements, resulting in the limited predictive\\ncapacity of their models. To address this limitation, we propose a novel\\nLSTM-based network augmented with a Transformer co-attention mechanism for\\npredicting apparent affect in videos. We demonstrate that our proposed Sec2Sec\\nCo-attention Transformer surpasses multiple state-of-the-art methods in\\npredicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First\\nImpressions. Notably, our model offers interpretability, allowing us to examine\\nthe contributions of different time points to the overall prediction. The\\nimplementation is available at: https://github.com/nestor-sun/sec2sec.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"59 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.15209\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.15209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基于视频的表观情感检测在视频理解中起着至关重要的作用，因为它包含了视觉、音频、视听交互和时空信息等多种元素，这些元素对于准确的视频预测至关重要。然而，现有的方法往往只能提取这些元素的一个子集，导致其模型的预测能力有限。为了解决这一局限性，我们提出了一种基于 LSTM 的新型网络，并增加了 Transformer 共同关注机制，用于预测视频中的明显情感。在两个广泛使用的数据集上，我们证明了我们提出的 Sec2SecCo-attention Transformer 在预测表观情感方面超越了多种最先进的方法：LIRIS-ACCEDE 和 FirstImpressions。值得注意的是，我们的模型具有可解释性，允许我们检查不同时间点对整体预测的贡献。具体实施请访问：https://github.com/nestor-sun/sec2sec。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Sec2Sec Co-attention for Video-Based Apparent Affective Prediction

Video-based apparent affect detection plays a crucial role in video understanding, as it encompasses various elements such as vision, audio, audio-visual interactions, and spatiotemporal information, which are essential for accurate video predictions. However, existing approaches often focus on extracting only a subset of these elements, resulting in the limited predictive capacity of their models. To address this limitation, we propose a novel LSTM-based network augmented with a Transformer co-attention mechanism for predicting apparent affect in videos. We demonstrate that our proposed Sec2Sec Co-attention Transformer surpasses multiple state-of-the-art methods in predicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First Impressions. Notably, our model offers interpretability, allowing us to examine the contributions of different time points to the overall prediction. The implementation is available at: https://github.com/nestor-sun/sec2sec.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量

期刊最新文献

Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs