{"title":"基于视频的显性情感预测的 Sec2Sec 协同关注","authors":"Mingwei Sun, Kunpeng Zhang","doi":"arxiv-2408.15209","DOIUrl":null,"url":null,"abstract":"Video-based apparent affect detection plays a crucial role in video\nunderstanding, as it encompasses various elements such as vision, audio,\naudio-visual interactions, and spatiotemporal information, which are essential\nfor accurate video predictions. However, existing approaches often focus on\nextracting only a subset of these elements, resulting in the limited predictive\ncapacity of their models. To address this limitation, we propose a novel\nLSTM-based network augmented with a Transformer co-attention mechanism for\npredicting apparent affect in videos. We demonstrate that our proposed Sec2Sec\nCo-attention Transformer surpasses multiple state-of-the-art methods in\npredicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First\nImpressions. Notably, our model offers interpretability, allowing us to examine\nthe contributions of different time points to the overall prediction. The\nimplementation is available at: https://github.com/nestor-sun/sec2sec.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sec2Sec Co-attention for Video-Based Apparent Affective Prediction\",\"authors\":\"Mingwei Sun, Kunpeng Zhang\",\"doi\":\"arxiv-2408.15209\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video-based apparent affect detection plays a crucial role in video\\nunderstanding, as it encompasses various elements such as vision, audio,\\naudio-visual interactions, and spatiotemporal information, which are essential\\nfor accurate video predictions. However, existing approaches often focus on\\nextracting only a subset of these elements, resulting in the limited predictive\\ncapacity of their models. To address this limitation, we propose a novel\\nLSTM-based network augmented with a Transformer co-attention mechanism for\\npredicting apparent affect in videos. We demonstrate that our proposed Sec2Sec\\nCo-attention Transformer surpasses multiple state-of-the-art methods in\\npredicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First\\nImpressions. Notably, our model offers interpretability, allowing us to examine\\nthe contributions of different time points to the overall prediction. The\\nimplementation is available at: https://github.com/nestor-sun/sec2sec.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"59 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.15209\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.15209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Sec2Sec Co-attention for Video-Based Apparent Affective Prediction
Video-based apparent affect detection plays a crucial role in video
understanding, as it encompasses various elements such as vision, audio,
audio-visual interactions, and spatiotemporal information, which are essential
for accurate video predictions. However, existing approaches often focus on
extracting only a subset of these elements, resulting in the limited predictive
capacity of their models. To address this limitation, we propose a novel
LSTM-based network augmented with a Transformer co-attention mechanism for
predicting apparent affect in videos. We demonstrate that our proposed Sec2Sec
Co-attention Transformer surpasses multiple state-of-the-art methods in
predicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First
Impressions. Notably, our model offers interpretability, allowing us to examine
the contributions of different time points to the overall prediction. The
implementation is available at: https://github.com/nestor-sun/sec2sec.