Free-VSC:来自视觉基础模型的自由语义,用于无监督视频语义压缩

Yuan Tian, Guo Lu, Guangtao Zhai
{"title":"Free-VSC:来自视觉基础模型的自由语义,用于无监督视频语义压缩","authors":"Yuan Tian, Guo Lu, Guangtao Zhai","doi":"arxiv-2409.11718","DOIUrl":null,"url":null,"abstract":"Unsupervised video semantic compression (UVSC), i.e., compressing videos to\nbetter support various analysis tasks, has recently garnered attention.\nHowever, the semantic richness of previous methods remains limited, due to the\nsingle semantic learning objective, limited training data, etc. To address\nthis, we propose to boost the UVSC task by absorbing the off-the-shelf rich\nsemantics from VFMs. Specifically, we introduce a VFMs-shared semantic\nalignment layer, complemented by VFM-specific prompts, to flexibly align\nsemantics between the compressed video and various VFMs. This allows different\nVFMs to collaboratively build a mutually-enhanced semantic space, guiding the\nlearning of the compression model. Moreover, we introduce a dynamic\ntrajectory-based inter-frame compression scheme, which first estimates the\nsemantic trajectory based on the historical content, and then traverses along\nthe trajectory to predict the future semantics as the coding context. This\nreduces the overall bitcost of the system, further improving the compression\nefficiency. Our approach outperforms previous coding methods on three\nmainstream tasks and six datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression\",\"authors\":\"Yuan Tian, Guo Lu, Guangtao Zhai\",\"doi\":\"arxiv-2409.11718\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unsupervised video semantic compression (UVSC), i.e., compressing videos to\\nbetter support various analysis tasks, has recently garnered attention.\\nHowever, the semantic richness of previous methods remains limited, due to the\\nsingle semantic learning objective, limited training data, etc. To address\\nthis, we propose to boost the UVSC task by absorbing the off-the-shelf rich\\nsemantics from VFMs. Specifically, we introduce a VFMs-shared semantic\\nalignment layer, complemented by VFM-specific prompts, to flexibly align\\nsemantics between the compressed video and various VFMs. This allows different\\nVFMs to collaboratively build a mutually-enhanced semantic space, guiding the\\nlearning of the compression model. Moreover, we introduce a dynamic\\ntrajectory-based inter-frame compression scheme, which first estimates the\\nsemantic trajectory based on the historical content, and then traverses along\\nthe trajectory to predict the future semantics as the coding context. This\\nreduces the overall bitcost of the system, further improving the compression\\nefficiency. Our approach outperforms previous coding methods on three\\nmainstream tasks and six datasets.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"40 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11718\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

无监督视频语义压缩(UVSC),即压缩视频以更好地支持各种分析任务,最近引起了人们的关注。然而,由于语义学习目标单一、训练数据有限等原因,以往方法的语义丰富度仍然有限。为了解决这个问题,我们建议通过吸收 VFM 中现成的丰富语义来提升 UVSC 任务。具体来说,我们引入了一个 VFMs 共享语义对齐层,并辅以特定 VFM 的提示,以灵活地对齐压缩视频和各种 VFMs 之间的语义。这允许不同的 VFM 协同构建一个相互增强的语义空间,从而指导压缩模型的学习。此外,我们还引入了基于动态轨迹的帧间压缩方案,该方案首先根据历史内容估算语义轨迹,然后沿着轨迹预测作为编码上下文的未来语义。这降低了系统的总体比特成本,进一步提高了压缩效率。在三个主流任务和六个数据集上,我们的方法优于以前的编码方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression
Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1