Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu
{"title":"Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input","authors":"Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu","doi":"arxiv-2408.15542","DOIUrl":null,"url":null,"abstract":"Rapid advancements have been made in extending Large Language Models (LLMs)\nto Large Multi-modal Models (LMMs). However, extending input modality of LLMs\nto video data remains a challenging endeavor, especially for long videos. Due\nto insufficient access to large-scale high-quality video data and the excessive\ncompression of visual features, current methods exhibit limitations in\neffectively processing long videos. In this paper, we introduce Kangaroo, a\npowerful Video LMM aimed at addressing these challenges. Confronted with issue\nof inadequate training data, we develop a data curation system to build a\nlarge-scale dataset with high-quality annotations for vision-language\npre-training and instruction tuning. In addition, we design a curriculum\ntraining pipeline with gradually increasing resolution and number of input\nframes to accommodate long videos. Evaluation results demonstrate that, with 8B\nparameters, Kangaroo achieves state-of-the-art performance across a variety of\nvideo understanding benchmarks while exhibiting competitive results on others.\nParticularly, on benchmarks specialized for long videos, Kangaroo excels some\nlarger models with over 10B parameters and proprietary models.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.15542","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Rapid advancements have been made in extending Large Language Models (LLMs)
to Large Multi-modal Models (LMMs). However, extending input modality of LLMs
to video data remains a challenging endeavor, especially for long videos. Due
to insufficient access to large-scale high-quality video data and the excessive
compression of visual features, current methods exhibit limitations in
effectively processing long videos. In this paper, we introduce Kangaroo, a
powerful Video LMM aimed at addressing these challenges. Confronted with issue
of inadequate training data, we develop a data curation system to build a
large-scale dataset with high-quality annotations for vision-language
pre-training and instruction tuning. In addition, we design a curriculum
training pipeline with gradually increasing resolution and number of input
frames to accommodate long videos. Evaluation results demonstrate that, with 8B
parameters, Kangaroo achieves state-of-the-art performance across a variety of
video understanding benchmarks while exhibiting competitive results on others.
Particularly, on benchmarks specialized for long videos, Kangaroo excels some
larger models with over 10B parameters and proprietary models.