Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization.

IF 2.7 Q3 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY Journal of Imaging Pub Date : 2024-09-14 DOI:10.3390/jimaging10090229
Zongshang Pang, Yuta Nakashima, Mayu Otani, Hajime Nagahara
{"title":"Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization.","authors":"Zongshang Pang, Yuta Nakashima, Mayu Otani, Hajime Nagahara","doi":"10.3390/jimaging10090229","DOIUrl":null,"url":null,"abstract":"<p><p>Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level tasks contain rich semantic information that can be readily leveraged to quantify frame-level importance for zero-shot video summarization. Leveraging pre-trained features and contrastive learning, we propose three metrics featuring a desirable keyframe: local dissimilarity, global consistency, and uniqueness. We show that the metrics can well-capture the diversity and representativeness of frames commonly used for the unsupervised generation of video summaries, demonstrating competitive or better performance compared to past methods when no training is needed. We further propose a contrastive learning-based pre-training strategy on unlabeled videos to enhance the quality of the proposed metrics and, thus, improve the evaluated performance on the public benchmarks TVSum and SumMe.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":null,"pages":null},"PeriodicalIF":2.7000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11433058/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/jimaging10090229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level tasks contain rich semantic information that can be readily leveraged to quantify frame-level importance for zero-shot video summarization. Leveraging pre-trained features and contrastive learning, we propose three metrics featuring a desirable keyframe: local dissimilarity, global consistency, and uniqueness. We show that the metrics can well-capture the diversity and representativeness of frames commonly used for the unsupervised generation of video summaries, demonstrating competitive or better performance compared to past methods when no training is needed. We further propose a contrastive learning-based pre-training strategy on unlabeled videos to enhance the quality of the proposed metrics and, thus, improve the evaluated performance on the public benchmarks TVSum and SumMe.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
释放对比学习的力量,实现零镜头视频总结。
视频摘要的目的是选择视频中信息量最大的帧子集,以促进高效的视频浏览。以往的工作无一例外都是利用注释摘要或启发式目标来训练摘要模型。在这项工作中,我们揭示了在图像级任务中预先训练的特征包含丰富的语义信息,可随时用于量化零镜头视频摘要的帧级重要性。利用预先训练的特征和对比学习,我们提出了具有理想关键帧特征的三个指标:局部不相似性、全局一致性和唯一性。我们的研究表明,这些指标能很好地捕捉常用于无监督生成视频摘要的帧的多样性和代表性,在不需要训练的情况下,与过去的方法相比具有竞争力或更好的性能。我们进一步提出了一种基于对比学习的预训练策略,该策略可在无标记视频上提高所提指标的质量,从而提高在公共基准 TVSum 和 SumMe 上的评估性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Imaging
Journal of Imaging Medicine-Radiology, Nuclear Medicine and Imaging
CiteScore
5.90
自引率
6.20%
发文量
303
审稿时长
7 weeks
期刊最新文献
Efficient End-to-End Convolutional Architecture for Point-of-Gaze Estimation. Comparison of Visual and Quantra Software Mammographic Density Assessment According to BI-RADS® in 2D and 3D Images. A Multi-Task Model for Pulmonary Nodule Segmentation and Classification. Convolutional Neural Network-Machine Learning Model: Hybrid Model for Meningioma Tumour and Healthy Brain Classification. Historical Blurry Video-Based Face Recognition.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1