Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, Mori Kurokawa
{"title":"用于视频问题解答的自顶向下活动表示学习","authors":"Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, Mori Kurokawa","doi":"arxiv-2409.07748","DOIUrl":null,"url":null,"abstract":"Capturing complex hierarchical human activities, from atomic actions (e.g.,\npicking up one present, moving to the sofa, unwrapping the present) to\ncontextual events (e.g., celebrating Christmas) is crucial for achieving\nhigh-performance video question answering (VideoQA). Recent works have expanded\nmultimodal models (e.g., CLIP, LLaVA) to process continuous video sequences,\nenhancing the model's temporal reasoning capabilities. However, these\napproaches often fail to capture contextual events that can be decomposed into\nmultiple atomic actions non-continuously distributed over relatively long-term\nsequences. In this paper, to leverage the spatial visual context representation\ncapability of the CLIP model for obtaining non-continuous visual\nrepresentations in terms of contextual events in videos, we convert long-term\nvideo sequences into a spatial image domain and finetune the multimodal model\nLLaVA for the VideoQA task. Our approach achieves competitive performance on\nthe STAR task, in particular, with a 78.4% accuracy score, exceeding the\ncurrent state-of-the-art score by 2.8 points on the NExTQA task.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Top-down Activity Representation Learning for Video Question Answering\",\"authors\":\"Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, Mori Kurokawa\",\"doi\":\"arxiv-2409.07748\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Capturing complex hierarchical human activities, from atomic actions (e.g.,\\npicking up one present, moving to the sofa, unwrapping the present) to\\ncontextual events (e.g., celebrating Christmas) is crucial for achieving\\nhigh-performance video question answering (VideoQA). Recent works have expanded\\nmultimodal models (e.g., CLIP, LLaVA) to process continuous video sequences,\\nenhancing the model's temporal reasoning capabilities. However, these\\napproaches often fail to capture contextual events that can be decomposed into\\nmultiple atomic actions non-continuously distributed over relatively long-term\\nsequences. In this paper, to leverage the spatial visual context representation\\ncapability of the CLIP model for obtaining non-continuous visual\\nrepresentations in terms of contextual events in videos, we convert long-term\\nvideo sequences into a spatial image domain and finetune the multimodal model\\nLLaVA for the VideoQA task. Our approach achieves competitive performance on\\nthe STAR task, in particular, with a 78.4% accuracy score, exceeding the\\ncurrent state-of-the-art score by 2.8 points on the NExTQA task.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"58 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07748\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07748","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Top-down Activity Representation Learning for Video Question Answering
Capturing complex hierarchical human activities, from atomic actions (e.g.,
picking up one present, moving to the sofa, unwrapping the present) to
contextual events (e.g., celebrating Christmas) is crucial for achieving
high-performance video question answering (VideoQA). Recent works have expanded
multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences,
enhancing the model's temporal reasoning capabilities. However, these
approaches often fail to capture contextual events that can be decomposed into
multiple atomic actions non-continuously distributed over relatively long-term
sequences. In this paper, to leverage the spatial visual context representation
capability of the CLIP model for obtaining non-continuous visual
representations in terms of contextual events in videos, we convert long-term
video sequences into a spatial image domain and finetune the multimodal model
LLaVA for the VideoQA task. Our approach achieves competitive performance on
the STAR task, in particular, with a 78.4% accuracy score, exceeding the
current state-of-the-art score by 2.8 points on the NExTQA task.