Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI:10.1145/3512527.3531404

Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras

{"title":"Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames","authors":"Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras","doi":"10.1145/3512527.3531404","DOIUrl":null,"url":null,"abstract":"In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531404","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用集中注意力总结视频并考虑视频帧的独特性和多样性

在这项工作中，我们描述了一种新的无监督视频摘要方法。为了克服现有无监督视频摘要方法的局限性，即与生成器-鉴别器体系结构的不稳定训练、使用rnn对远程帧的依赖关系建模以及并行化基于rnn的网络体系结构的训练过程有关的局限性，所开发的方法仅依赖于使用自关注机制来估计视频帧的重要性。我们的方法不是简单地基于全局注意力对帧之间的依赖关系建模，而是集成了集中注意力机制，能够关注注意力矩阵主对角线上不重叠的块，并通过提取和利用有关视频相关帧的唯一性和多样性的知识来丰富现有信息。通过这种方式，我们的方法可以更好地估计视频不同部分的重要性，并大大减少了可学习参数的数量。使用两个基准数据集(SumMe和TVSum)的实验评估表明，所提出的方法与其他最先进的无监督总结方法相比具有竞争力，并证明了其生成非常接近人类偏好的视频摘要的能力。一项聚焦于引入成分的消融研究，即集中注意与基于注意的框架独特性和多样性估计相结合，显示了它们对整体总结性能的相对贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2022 International Conference on Multimedia Retrieval

自引率

0.00%

发文量