Mingyu Yao, Yu Bai, Wei Du, Xuejun Zhang, Heng Quan, Fuli Cai, Hongwei Kang
{"title":"Multi-Level Spatiotemporal Network for Video Summarization","authors":"Mingyu Yao, Yu Bai, Wei Du, Xuejun Zhang, Heng Quan, Fuli Cai, Hongwei Kang","doi":"10.1145/3503161.3548105","DOIUrl":null,"url":null,"abstract":"With the increasing of ubiquitous devices with cameras, video content is widely produced in the industry. Automation video summarization allows content consumers effectively retrieve the moments that capture their primary attention. Existing supervised methods mainly focus on frame-level information. As a natural phenomenon, video fragments in different shots are richer in semantics than frames. We leverage this as a free latent supervision signal and introduce a novel model named multi-level spatiotemporal network (MLSN). Our approach contains Multi-Level Feature Representations (MLFR) and Local Relative Loss (LRL). MLFR module consists of frame-level features, fragment-level features, and shot-level features with relative position encoding. For videos of different shot durations, it can flexibly capture and accommodate semantic information of different spatiotemporal granularities; LRL utilizes the partial ordering relations among frames of each fragment to capture highly discriminative features to improve the sensitivity of the model. Our method substantially improves the best existing published method by 7% on our industrial products dataset LSVD. Meanwhile, experimental results on two widely used benchmark datasets SumMe and TVSum demonstrate that our method outperforms most state-of-the-art ones.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3548105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
With the increasing of ubiquitous devices with cameras, video content is widely produced in the industry. Automation video summarization allows content consumers effectively retrieve the moments that capture their primary attention. Existing supervised methods mainly focus on frame-level information. As a natural phenomenon, video fragments in different shots are richer in semantics than frames. We leverage this as a free latent supervision signal and introduce a novel model named multi-level spatiotemporal network (MLSN). Our approach contains Multi-Level Feature Representations (MLFR) and Local Relative Loss (LRL). MLFR module consists of frame-level features, fragment-level features, and shot-level features with relative position encoding. For videos of different shot durations, it can flexibly capture and accommodate semantic information of different spatiotemporal granularities; LRL utilizes the partial ordering relations among frames of each fragment to capture highly discriminative features to improve the sensitivity of the model. Our method substantially improves the best existing published method by 7% on our industrial products dataset LSVD. Meanwhile, experimental results on two widely used benchmark datasets SumMe and TVSum demonstrate that our method outperforms most state-of-the-art ones.