Free-VSC：来自视觉基础模型的自由语义，用于无监督视频语义压缩

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI:arxiv-2409.11718

Yuan Tian, Guo Lu, Guangtao Zhai

{"title":"Free-VSC：来自视觉基础模型的自由语义，用于无监督视频语义压缩","authors":"Yuan Tian, Guo Lu, Guangtao Zhai","doi":"arxiv-2409.11718","DOIUrl":null,"url":null,"abstract":"Unsupervised video semantic compression (UVSC), i.e., compressing videos to\nbetter support various analysis tasks, has recently garnered attention.\nHowever, the semantic richness of previous methods remains limited, due to the\nsingle semantic learning objective, limited training data, etc. To address\nthis, we propose to boost the UVSC task by absorbing the off-the-shelf rich\nsemantics from VFMs. Specifically, we introduce a VFMs-shared semantic\nalignment layer, complemented by VFM-specific prompts, to flexibly align\nsemantics between the compressed video and various VFMs. This allows different\nVFMs to collaboratively build a mutually-enhanced semantic space, guiding the\nlearning of the compression model. Moreover, we introduce a dynamic\ntrajectory-based inter-frame compression scheme, which first estimates the\nsemantic trajectory based on the historical content, and then traverses along\nthe trajectory to predict the future semantics as the coding context. This\nreduces the overall bitcost of the system, further improving the compression\nefficiency. Our approach outperforms previous coding methods on three\nmainstream tasks and six datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression\",\"authors\":\"Yuan Tian, Guo Lu, Guangtao Zhai\",\"doi\":\"arxiv-2409.11718\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unsupervised video semantic compression (UVSC), i.e., compressing videos to\\nbetter support various analysis tasks, has recently garnered attention.\\nHowever, the semantic richness of previous methods remains limited, due to the\\nsingle semantic learning objective, limited training data, etc. To address\\nthis, we propose to boost the UVSC task by absorbing the off-the-shelf rich\\nsemantics from VFMs. Specifically, we introduce a VFMs-shared semantic\\nalignment layer, complemented by VFM-specific prompts, to flexibly align\\nsemantics between the compressed video and various VFMs. This allows different\\nVFMs to collaboratively build a mutually-enhanced semantic space, guiding the\\nlearning of the compression model. Moreover, we introduce a dynamic\\ntrajectory-based inter-frame compression scheme, which first estimates the\\nsemantic trajectory based on the historical content, and then traverses along\\nthe trajectory to predict the future semantics as the coding context. This\\nreduces the overall bitcost of the system, further improving the compression\\nefficiency. Our approach outperforms previous coding methods on three\\nmainstream tasks and six datasets.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"40 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11718\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

无监督视频语义压缩（UVSC），即压缩视频以更好地支持各种分析任务，最近引起了人们的关注。然而，由于语义学习目标单一、训练数据有限等原因，以往方法的语义丰富度仍然有限。为了解决这个问题，我们建议通过吸收 VFM 中现成的丰富语义来提升 UVSC 任务。具体来说，我们引入了一个 VFMs 共享语义对齐层，并辅以特定 VFM 的提示，以灵活地对齐压缩视频和各种 VFMs 之间的语义。这允许不同的 VFM 协同构建一个相互增强的语义空间，从而指导压缩模型的学习。此外，我们还引入了基于动态轨迹的帧间压缩方案，该方案首先根据历史内容估算语义轨迹，然后沿着轨迹预测作为编码上下文的未来语义。这降低了系统的总体比特成本，进一步提高了压缩效率。在三个主流任务和六个数据集上，我们的方法优于以前的编码方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey