arXiv - CS - Multimedia最新文献_第6页

Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models 面向真实世界的恶劣天气图像修复：利用视觉语言模型提高清晰度和语义性

arXiv - CS - Multimedia

Pub Date : 2024-09-03 DOI: arxiv-2409.02101

Jiaqi Xu, Mengyang Wu, Xiaowei Hu, Chi-Wing Fu, Qi Dou, Pheng-Ann Heng

This paper addresses the limitations of adverse weather image restorationapproaches trained on synthetic data when applied to real-world scenarios. Weformulate a semi-supervised learning framework employing vision-language modelsto enhance restoration performance across diverse adverse weather conditions inreal-world settings. Our approach involves assessing image clearness andproviding semantics using vision-language models on real data, serving assupervision signals for training restoration models. For clearness enhancement,we use real-world data, utilizing a dual-step strategy with pseudo-labelsassessed by vision-language models and weather prompt learning. For semanticenhancement, we integrate real-world data by adjusting weather conditions invision-language model descriptions while preserving semantic meaning.Additionally, we introduce an effective training strategy to bootstraprestoration performance. Our approach achieves superior results in real-worldadverse weather image restoration, demonstrated through qualitative andquantitative comparisons with state-of-the-art works.

本文探讨了在合成数据基础上训练的恶劣天气图像修复方法在应用于真实世界场景时的局限性。我们制定了一个半监督学习框架，利用视觉语言模型来提高真实世界中各种恶劣天气条件下的修复性能。我们的方法包括在真实数据上使用视觉语言模型评估图像清晰度并提供语义，作为训练修复模型的监督信号。在清晰度增强方面，我们使用真实世界的数据，采用视觉语言模型评估伪标签和天气提示学习的双步骤策略。在语义增强方面，我们通过调整视觉语言模型描述中的天气条件来整合真实世界数据，同时保留语义。我们的方法在真实世界不利天气图像复原方面取得了卓越的成果，这一点通过与最先进的作品进行定性和定量比较得到了证明。

{"title":"Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models","authors":"Jiaqi Xu, Mengyang Wu, Xiaowei Hu, Chi-Wing Fu, Qi Dou, Pheng-Ann Heng","doi":"arxiv-2409.02101","DOIUrl":"https://doi.org/arxiv-2409.02101","url":null,"abstract":"This paper addresses the limitations of adverse weather image restoration\u0000approaches trained on synthetic data when applied to real-world scenarios. We\u0000formulate a semi-supervised learning framework employing vision-language models\u0000to enhance restoration performance across diverse adverse weather conditions in\u0000real-world settings. Our approach involves assessing image clearness and\u0000providing semantics using vision-language models on real data, serving as\u0000supervision signals for training restoration models. For clearness enhancement,\u0000we use real-world data, utilizing a dual-step strategy with pseudo-labels\u0000assessed by vision-language models and weather prompt learning. For semantic\u0000enhancement, we integrate real-world data by adjusting weather conditions in\u0000vision-language model descriptions while preserving semantic meaning.\u0000Additionally, we introduce an effective training strategy to bootstrap\u0000restoration performance. Our approach achieves superior results in real-world\u0000adverse weather image restoration, demonstrated through qualitative and\u0000quantitative comparisons with state-of-the-art works.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unveiling Deep Shadows: A Survey on Image and Video Shadow Detection, Removal, and Generation in the Era of Deep Learning 揭开深度阴影的面纱：深度学习时代图像和视频阴影检测、去除和生成概览

arXiv - CS - Multimedia

Pub Date : 2024-09-03 DOI: arxiv-2409.02108

Xiaowei Hu, Zhenghao Xing, Tianyu Wang, Chi-Wing Fu, Pheng-Ann Heng

Shadows are formed when light encounters obstacles, leading to areas ofdiminished illumination. In computer vision, shadow detection, removal, andgeneration are crucial for enhancing scene understanding, refining imagequality, ensuring visual consistency in video editing, and improving virtualenvironments. This paper presents a comprehensive survey of shadow detection,removal, and generation in images and videos within the deep learning landscapeover the past decade, covering tasks, deep models, datasets, and evaluationmetrics. Our key contributions include a comprehensive survey of shadowanalysis, standardization of experimental comparisons, exploration of therelationships among model size, speed, and performance, a cross-datasetgeneralization study, identification of open issues and future directions, andprovision of publicly available resources to support further research.

当光线遇到障碍物时就会形成阴影，从而导致光照减弱。在计算机视觉领域，阴影的检测、去除和生成对于增强场景理解、提高图像质量、确保视频编辑中的视觉一致性以及改善虚拟环境至关重要。本文对过去十年深度学习领域中图像和视频中的阴影检测、移除和生成进行了全面研究，涵盖了任务、深度模型、数据集和评估指标。我们的主要贡献包括对阴影分析的全面调查，实验比较的标准化，对模型大小、速度和性能之间关系的探索，跨数据集泛化研究，确定开放问题和未来方向，以及提供公开可用的资源以支持进一步的研究。

引用次数: 0

Privacy-Preserving Multimedia Mobile Cloud Computing Using Protective Perturbation 利用保护性扰动保护隐私的多媒体移动云计算

arXiv - CS - Multimedia

Pub Date : 2024-09-03 DOI: arxiv-2409.01710

Zhongze Tang, Mengmei Ye, Yao Liu, Sheng Wei

Mobile cloud computing has been adopted in many multimedia applications,where the resource-constrained mobile device sends multimedia data (e.g.,images) to remote cloud servers to request computation-intensive multimediaservices (e.g., image recognition). While significantly improving theperformance of the mobile applications, the cloud-based mechanism often causesprivacy concerns as the multimedia data and services are offloaded from thetrusted user device to untrusted cloud servers. Several recent studies haveproposed perturbation-based privacy preserving mechanisms, which obfuscate theoffloaded multimedia data to eliminate privacy exposures without affecting thefunctionality of the remote multimedia services. However, the existing privacyprotection approaches require the deployment of computation-intensiveperturbation generation on the resource-constrained mobile devices. Also, theobfuscated images are typically not compliant with the standard imagecompression algorithms and suffer from significant bandwidth consumption. Inthis paper, we develop a novel privacy-preserving multimedia mobile cloudcomputing framework, namely $PMC^2$, to address the resource and bandwidthchallenges. $PMC^2$ employs secure confidential computing in the cloud todeploy the perturbation generator, which addresses the resource challenge whilemaintaining the privacy. Furthermore, we develop a neural compressorspecifically trained to compress the perturbed images in order to address thebandwidth challenge. We implement $PMC^2$ in an end-to-end mobile cloudcomputing system, based on which our evaluations demonstrate superior latency,power efficiency, and bandwidth consumption achieved by $PMC^2$ whilemaintaining high accuracy in the target multimedia service.

移动云计算已被许多多媒体应用所采用，在这些应用中，资源受限的移动设备向远程云服务器发送多媒体数据（如图像），请求计算密集型多媒体服务（如图像识别）。这种基于云的机制虽然大大提高了移动应用的性能，但由于多媒体数据和服务从受信任的用户设备卸载到不受信任的云服务器上，往往会引起隐私问题。最近的一些研究提出了基于扰动的隐私保护机制，这种机制会对卸载的多媒体数据进行混淆，从而在不影响远程多媒体服务功能的情况下消除隐私暴露。然而，现有的隐私保护方法需要在资源有限的移动设备上部署计算密集型扰动生成。此外，经过混淆处理的图像通常不符合标准图像压缩算法，而且会消耗大量带宽。在本文中，我们开发了一种新型隐私保护多媒体移动云计算框架，即$PMC^2$，以解决资源和带宽挑战。PMC^2$利用云中的安全保密计算来部署扰动发生器，从而在解决资源挑战的同时维护了隐私。此外，我们还开发了一种经过专门训练的神经压缩器来压缩扰动图像，以应对带宽挑战。我们在端到端移动云计算系统中实现了$PMC^2$，在此基础上，我们的评估证明了$PMC^2$在保持目标多媒体服务高精确度的同时，实现了卓越的延迟、能效和带宽消耗。

{"title":"Privacy-Preserving Multimedia Mobile Cloud Computing Using Protective Perturbation","authors":"Zhongze Tang, Mengmei Ye, Yao Liu, Sheng Wei","doi":"arxiv-2409.01710","DOIUrl":"https://doi.org/arxiv-2409.01710","url":null,"abstract":"Mobile cloud computing has been adopted in many multimedia applications,\u0000where the resource-constrained mobile device sends multimedia data (e.g.,\u0000images) to remote cloud servers to request computation-intensive multimedia\u0000services (e.g., image recognition). While significantly improving the\u0000performance of the mobile applications, the cloud-based mechanism often causes\u0000privacy concerns as the multimedia data and services are offloaded from the\u0000trusted user device to untrusted cloud servers. Several recent studies have\u0000proposed perturbation-based privacy preserving mechanisms, which obfuscate the\u0000offloaded multimedia data to eliminate privacy exposures without affecting the\u0000functionality of the remote multimedia services. However, the existing privacy\u0000protection approaches require the deployment of computation-intensive\u0000perturbation generation on the resource-constrained mobile devices. Also, the\u0000obfuscated images are typically not compliant with the standard image\u0000compression algorithms and suffer from significant bandwidth consumption. In\u0000this paper, we develop a novel privacy-preserving multimedia mobile cloud\u0000computing framework, namely $PMC^2$, to address the resource and bandwidth\u0000challenges. $PMC^2$ employs secure confidential computing in the cloud to\u0000deploy the perturbation generator, which addresses the resource challenge while\u0000maintaining the privacy. Furthermore, we develop a neural compressor\u0000specifically trained to compress the perturbed images in order to address the\u0000bandwidth challenge. We implement $PMC^2$ in an end-to-end mobile cloud\u0000computing system, based on which our evaluations demonstrate superior latency,\u0000power efficiency, and bandwidth consumption achieved by $PMC^2$ while\u0000maintaining high accuracy in the target multimedia service.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-Resolution Face Recognition via Adaptable Instance-Relation Distillation 通过适应性实例关联蒸馏实现低分辨率人脸识别

arXiv - CS - Multimedia

Pub Date : 2024-09-03 DOI: arxiv-2409.02049

Ruixin Shi, Weijia Guo, Shiming Ge

Low-resolution face recognition is a challenging task due to the missing ofinformative details. Recent approaches based on knowledge distillation haveproven that high-resolution clues can well guide low-resolution facerecognition via proper knowledge transfer. However, due to the distributiondifference between training and testing faces, the learned models often sufferfrom poor adaptability. To address that, we split the knowledge transferprocess into distillation and adaptation steps, and propose an adaptableinstance-relation distillation approach to facilitate low-resolution facerecognition. In the approach, the student distills knowledge fromhigh-resolution teacher in both instance level and relation level, providingsufficient cross-resolution knowledge transfer. Then, the learned student canbe adaptable to recognize low-resolution faces with adaptive batchnormalization in inference. In this manner, the capability of recoveringmissing details of familiar low-resolution faces can be effectively enhanced,leading to a better knowledge transfer. Extensive experiments on low-resolutionface recognition clearly demonstrate the effectiveness and adaptability of ourapproach.

由于缺少信息细节，低分辨率人脸识别是一项具有挑战性的任务。最近基于知识提炼的方法证明，通过适当的知识转移，高分辨率线索可以很好地指导低分辨率人脸识别。然而，由于训练面孔和测试面孔的分布不同，学习到的模型往往适应性较差。针对这一问题，我们将知识转移过程分为提炼和适应两个步骤，并提出了一种可适应的实例相关提炼方法来促进低分辨率人脸识别。在这种方法中，学生从高分辨率教师那里提炼出实例级和关系级知识，提供充分的跨分辨率知识转移。然后，学习的学生可以在推理中通过自适应批量归一化适应低分辨率人脸识别。通过这种方式，可以有效增强对熟悉的低分辨率人脸的细节恢复能力，从而实现更好的知识迁移。广泛的低分辨率人脸识别实验清楚地证明了我们方法的有效性和适应性。

{"title":"Low-Resolution Face Recognition via Adaptable Instance-Relation Distillation","authors":"Ruixin Shi, Weijia Guo, Shiming Ge","doi":"arxiv-2409.02049","DOIUrl":"https://doi.org/arxiv-2409.02049","url":null,"abstract":"Low-resolution face recognition is a challenging task due to the missing of\u0000informative details. Recent approaches based on knowledge distillation have\u0000proven that high-resolution clues can well guide low-resolution face\u0000recognition via proper knowledge transfer. However, due to the distribution\u0000difference between training and testing faces, the learned models often suffer\u0000from poor adaptability. To address that, we split the knowledge transfer\u0000process into distillation and adaptation steps, and propose an adaptable\u0000instance-relation distillation approach to facilitate low-resolution face\u0000recognition. In the approach, the student distills knowledge from\u0000high-resolution teacher in both instance level and relation level, providing\u0000sufficient cross-resolution knowledge transfer. Then, the learned student can\u0000be adaptable to recognize low-resolution faces with adaptive batch\u0000normalization in inference. In this manner, the capability of recovering\u0000missing details of familiar low-resolution faces can be effectively enhanced,\u0000leading to a better knowledge transfer. Extensive experiments on low-resolution\u0000face recognition clearly demonstrate the effectiveness and adaptability of our\u0000approach.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PRoGS: Progressive Rendering of Gaussian Splats PRoGS：高斯花斑渐进渲染

arXiv - CS - Multimedia

Pub Date : 2024-09-03 DOI: arxiv-2409.01761

Brent Zoomers, Maarten Wijnants, Ivan Molenaers, Joni Vanherck, Jeroen Put, Lode Jorissen, Nick Michiels

Over the past year, 3D Gaussian Splatting (3DGS) has received significantattention for its ability to represent 3D scenes in a perceptually accuratemanner. However, it can require a substantial amount of storage since eachsplat's individual data must be stored. While compression techniques offer apotential solution by reducing the memory footprint, they still necessitateretrieving the entire scene before any part of it can be rendered. In thiswork, we introduce a novel approach for progressively rendering such scenes,aiming to display visible content that closely approximates the final scene asearly as possible without loading the entire scene into memory. This approachbenefits both on-device rendering applications limited by memory constraintsand streaming applications where minimal bandwidth usage is preferred. Toachieve this, we approximate the contribution of each Gaussian to the finalscene and construct an order of prioritization on their inclusion in therendering process. Additionally, we demonstrate that our approach can becombined with existing compression methods to progressively render (and stream)3DGS scenes, optimizing bandwidth usage by focusing on the most importantsplats within a scene. Overall, our work establishes a foundation for makingremotely hosted 3DGS content more quickly accessible to end-users inover-the-top consumption scenarios, with our results showing significantimprovements in quality across all metrics compared to existing methods.

在过去的一年里，三维高斯拼接技术（3DGS）因其能够以感知准确的方式表现三维场景而备受关注。然而，由于必须存储每个拼接的单独数据，因此需要大量存储空间。虽然压缩技术通过减少内存占用提供了一种潜在的解决方案，但它们仍然需要在渲染场景的任何部分之前检索整个场景。在本作品中，我们介绍了一种逐步渲染此类场景的新方法，目的是在不将整个场景载入内存的情况下，尽可能早地显示与最终场景接近的可见内容。这种方法既有利于受内存限制的设备上渲染应用，也有利于希望尽量减少带宽使用的流媒体应用。为了实现这一目标，我们近似计算了每个高斯对最终场景的贡献，并构建了将它们纳入渲染过程的优先顺序。此外，我们还证明了我们的方法可以与现有的压缩方法相结合，逐步渲染（和流式传输）3DGS 场景，通过集中处理场景中最重要的部分来优化带宽使用。总之，我们的工作为使远程托管的 3DGS 内容更快地供终端用户在over-the-top 消费场景中访问奠定了基础，我们的结果表明，与现有方法相比，所有指标的质量都有显著提高。

{"title":"PRoGS: Progressive Rendering of Gaussian Splats","authors":"Brent Zoomers, Maarten Wijnants, Ivan Molenaers, Joni Vanherck, Jeroen Put, Lode Jorissen, Nick Michiels","doi":"arxiv-2409.01761","DOIUrl":"https://doi.org/arxiv-2409.01761","url":null,"abstract":"Over the past year, 3D Gaussian Splatting (3DGS) has received significant\u0000attention for its ability to represent 3D scenes in a perceptually accurate\u0000manner. However, it can require a substantial amount of storage since each\u0000splat's individual data must be stored. While compression techniques offer a\u0000potential solution by reducing the memory footprint, they still necessitate\u0000retrieving the entire scene before any part of it can be rendered. In this\u0000work, we introduce a novel approach for progressively rendering such scenes,\u0000aiming to display visible content that closely approximates the final scene as\u0000early as possible without loading the entire scene into memory. This approach\u0000benefits both on-device rendering applications limited by memory constraints\u0000and streaming applications where minimal bandwidth usage is preferred. To\u0000achieve this, we approximate the contribution of each Gaussian to the final\u0000scene and construct an order of prioritization on their inclusion in the\u0000rendering process. Additionally, we demonstrate that our approach can be\u0000combined with existing compression methods to progressively render (and stream)\u00003DGS scenes, optimizing bandwidth usage by focusing on the most important\u0000splats within a scene. Overall, our work establishes a foundation for making\u0000remotely hosted 3DGS content more quickly accessible to end-users in\u0000over-the-top consumption scenarios, with our results showing significant\u0000improvements in quality across all metrics compared to existing methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Think Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition 识别前三思：用于一般精细交通标志识别的大型多模态模型

arXiv - CS - Multimedia

Pub Date : 2024-09-03 DOI: arxiv-2409.01534

Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

We propose a new strategy called think twice before recognizing to improvefine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild isdifficult due to the complex road conditions, and existing approachesparticularly struggle with cross-country TSR when data is lacking. Our strategyachieves effective fine-grained TSR by stimulating the multiple-thinkingcapability of large multimodal models (LMM). We introduce context,characteristic, and differential descriptions to design multiple thinkingprocesses for the LMM. The context descriptions with center coordinate promptoptimization help the LMM to locate the target traffic sign in the originalroad images containing multiple traffic signs and filter irrelevant answersthrough the proposed prior traffic sign hypothesis. The characteristicdescription is based on few-shot in-context learning of template traffic signs,which decreases the cross-domain difference and enhances the fine-grainedrecognition capability of the LMM. The differential descriptions of similartraffic signs optimize the multimodal thinking capability of the LMM. Theproposed method is independent of training data and requires only simple anduniform instructions. We conducted extensive experiments on three benchmarkdatasets and two real-world datasets from different countries, and the proposedmethod achieves state-of-the-art TSR results on all five datasets.

我们提出了一种名为 "三思而后行 "的新策略，以改进细粒度交通标志识别（TSR）。由于路况复杂，在野外进行细粒度 TSR 十分困难，现有方法尤其难以在缺乏数据的情况下进行跨国 TSR。我们的策略通过激发大型多模态模型（LMM）的多重思维能力来实现有效的细粒度 TSR。我们引入了上下文、特征和差异描述来为 LMM 设计多重思维过程。带有中心坐标提示优化的上下文描述有助于 LMM 在包含多个交通标志的原始道路图像中定位目标交通标志，并通过提出的先验交通标志假设过滤无关答案。特征描述是基于对模板交通标志的少帧上下文学习，从而减小了跨域差异，增强了 LMM 的细粒度识别能力。对相似交通标志的差分描述优化了 LMM 的多模态思维能力。所提出的方法与训练数据无关，只需要简单而统一的指令。我们在三个基准数据集和两个来自不同国家的实际数据集上进行了广泛的实验，所提出的方法在所有五个数据集上都取得了最先进的 TSR 结果。

{"title":"Think Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition","authors":"Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama","doi":"arxiv-2409.01534","DOIUrl":"https://doi.org/arxiv-2409.01534","url":null,"abstract":"We propose a new strategy called think twice before recognizing to improve\u0000fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is\u0000difficult due to the complex road conditions, and existing approaches\u0000particularly struggle with cross-country TSR when data is lacking. Our strategy\u0000achieves effective fine-grained TSR by stimulating the multiple-thinking\u0000capability of large multimodal models (LMM). We introduce context,\u0000characteristic, and differential descriptions to design multiple thinking\u0000processes for the LMM. The context descriptions with center coordinate prompt\u0000optimization help the LMM to locate the target traffic sign in the original\u0000road images containing multiple traffic signs and filter irrelevant answers\u0000through the proposed prior traffic sign hypothesis. The characteristic\u0000description is based on few-shot in-context learning of template traffic signs,\u0000which decreases the cross-domain difference and enhances the fine-grained\u0000recognition capability of the LMM. The differential descriptions of similar\u0000traffic signs optimize the multimodal thinking capability of the LMM. The\u0000proposed method is independent of training data and requires only simple and\u0000uniform instructions. We conducted extensive experiments on three benchmark\u0000datasets and two real-world datasets from different countries, and the proposed\u0000method achieves state-of-the-art TSR results on all five datasets.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Reference Generative Face Video Compression with Contrastive Learning 利用对比学习进行多参考生成式人脸视频压缩

arXiv - CS - Multimedia

Pub Date : 2024-09-02 DOI: arxiv-2409.01029

Goluck Konuko, Giuseppe Valenzise

Generative face video coding (GFVC) has been demonstrated as a potentialapproach to low-latency, low bitrate video conferencing. GFVC frameworksachieve an extreme gain in coding efficiency with over 70% bitrate savings whencompared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVETstandardization efforts, all the information required to reconstruct videosequences using GFVC frameworks are adopted as part of the supplementalenhancement information (SEI) in existing compression pipelines. In light ofthis development, we aim to address a challenge that has been weakly addressedin prior GFVC frameworks, i.e., reconstruction drift as the distance betweenthe reference and target frames increases. This challenge creates the need toupdate the reference buffer more frequently by transmitting more Intra-refreshframes, which are the most expensive element of the GFVC bitstream. To overcomethis problem, we propose instead multiple reference animation as a robustapproach to minimizing reconstruction drift, especially when used in abi-directional prediction mode. Further, we propose a contrastive learningformulation for multi-reference animation. We observe that using a contrastivelearning framework enhances the representation capabilities of the animationgenerator. The resulting framework, MRDAC (Multi-Reference Deep AnimationCodec) can therefore be used to compress longer sequences with fewer referenceframes or achieve a significant gain in reconstruction accuracy at comparablebitrates to previous frameworks. Quantitative and qualitative results showsignificant coding and reconstruction quality gains compared to previous GFVCmethods, and more accurate animation quality in presence of large pose andfacial expression changes.

生成式人脸视频编码（GFVC）已被证明是低延迟、低比特率视频会议的一种潜在方法。与比特率低于 10kbps 的传统编解码器相比，GFVC 框架能极大地提高编码效率，节省 70% 以上的比特率。在最近的 MPEG/JVET 标准化工作中，使用 GFVC 框架重构视频序列所需的所有信息都被采纳为现有压缩管道中补充增强信息（SEI）的一部分。考虑到这一发展，我们的目标是解决在先前的 GFVC 框架中解决不力的难题，即随着参考帧和目标帧之间距离的增加而出现的重建漂移。这一挑战导致需要通过传输更多的内部刷新帧来更频繁地更新参考缓冲区，而内部刷新帧是 GFVC 比特流中最昂贵的元素。为了解决这个问题，我们提出了多参考动画作为一种稳健的方法，以最大限度地减少重建漂移，尤其是在非定向预测模式下使用时。此外，我们还为多参考动画提出了一种对比学习公式。我们发现，使用对比学习框架可以增强动画生成器的表示能力。因此，由此产生的框架 MRDAC（多参考深度动画编解码器）可以用来压缩较长的序列，减少参考帧的数量，或者以与以前框架相当的比特率实现重建精度的显著提高。定量和定性结果表明，与以前的 GFVC 方法相比，编码和重建质量都有显著提高，而且在姿势和面部表情变化较大的情况下，动画质量也更加精确。

{"title":"Multi-Reference Generative Face Video Compression with Contrastive Learning","authors":"Goluck Konuko, Giuseppe Valenzise","doi":"arxiv-2409.01029","DOIUrl":"https://doi.org/arxiv-2409.01029","url":null,"abstract":"Generative face video coding (GFVC) has been demonstrated as a potential\u0000approach to low-latency, low bitrate video conferencing. GFVC frameworks\u0000achieve an extreme gain in coding efficiency with over 70% bitrate savings when\u0000compared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET\u0000standardization efforts, all the information required to reconstruct video\u0000sequences using GFVC frameworks are adopted as part of the supplemental\u0000enhancement information (SEI) in existing compression pipelines. In light of\u0000this development, we aim to address a challenge that has been weakly addressed\u0000in prior GFVC frameworks, i.e., reconstruction drift as the distance between\u0000the reference and target frames increases. This challenge creates the need to\u0000update the reference buffer more frequently by transmitting more Intra-refresh\u0000frames, which are the most expensive element of the GFVC bitstream. To overcome\u0000this problem, we propose instead multiple reference animation as a robust\u0000approach to minimizing reconstruction drift, especially when used in a\u0000bi-directional prediction mode. Further, we propose a contrastive learning\u0000formulation for multi-reference animation. We observe that using a contrastive\u0000learning framework enhances the representation capabilities of the animation\u0000generator. The resulting framework, MRDAC (Multi-Reference Deep Animation\u0000Codec) can therefore be used to compress longer sequences with fewer reference\u0000frames or achieve a significant gain in reconstruction accuracy at comparable\u0000bitrates to previous frameworks. Quantitative and qualitative results show\u0000significant coding and reconstruction quality gains compared to previous GFVC\u0000methods, and more accurate animation quality in presence of large pose and\u0000facial expression changes.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model 多模态多转弯对话姿态检测：挑战数据集和有效模型

arXiv - CS - Multimedia

Pub Date : 2024-09-01 DOI: arxiv-2409.00597

Fuqiang Niu, Zebang Cheng, Xianghua Fu, Xiaojiang Peng, Genan Dai, Yin Chen, Hu Huang, Bowen Zhang

Stance detection, which aims to identify public opinion towards specifictargets using social media data, is an important yet challenging task. With theproliferation of diverse multimodal social media content including text, andimages multimodal stance detection (MSD) has become a crucial research area.However, existing MSD studies have focused on modeling stance within individualtext-image pairs, overlooking the multi-party conversational contexts thatnaturally occur on social media. This limitation stems from a lack of datasetsthat authentically capture such conversational scenarios, hindering progress inconversational MSD. To address this, we introduce a new multimodal multi-turnconversational stance detection dataset (called MmMtCSD). To derive stancesfrom this challenging dataset, we propose a novel multimodal large languagemodel stance detection framework (MLLM-SD), that learns joint stancerepresentations from textual and visual modalities. Experiments on MmMtCSD showstate-of-the-art performance of our proposed MLLM-SD approach for multimodalstance detection. We believe that MmMtCSD will contribute to advancingreal-world applications of stance detection research.

立场检测旨在利用社交媒体数据识别公众对特定目标的看法，是一项重要而又具有挑战性的任务。然而，现有的多模态姿态检测研究主要集中在对单个文本-图像对的姿态建模，忽略了社交媒体上自然出现的多方对话语境。这一局限源于缺乏能真实捕捉此类对话场景的数据集，从而阻碍了对话式 MSD 的研究进展。为了解决这个问题，我们引入了一个新的多模态多回合对话姿态检测数据集（称为 MmMtCSD）。为了从这一具有挑战性的数据集中得出语态，我们提出了一种新颖的多模态大型语言模型语态检测框架（MLLM-SD），它可以从文本和视觉模态中学习联合语态表示。在 MmMtCSD 上进行的实验表明，我们提出的 MLLM-SD 方法在多模态姿态检测方面具有最先进的性能。我们相信，MmMtCSD 将有助于推进姿态检测研究在现实世界中的应用。

{"title":"Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model","authors":"Fuqiang Niu, Zebang Cheng, Xianghua Fu, Xiaojiang Peng, Genan Dai, Yin Chen, Hu Huang, Bowen Zhang","doi":"arxiv-2409.00597","DOIUrl":"https://doi.org/arxiv-2409.00597","url":null,"abstract":"Stance detection, which aims to identify public opinion towards specific\u0000targets using social media data, is an important yet challenging task. With the\u0000proliferation of diverse multimodal social media content including text, and\u0000images multimodal stance detection (MSD) has become a crucial research area.\u0000However, existing MSD studies have focused on modeling stance within individual\u0000text-image pairs, overlooking the multi-party conversational contexts that\u0000naturally occur on social media. This limitation stems from a lack of datasets\u0000that authentically capture such conversational scenarios, hindering progress in\u0000conversational MSD. To address this, we introduce a new multimodal multi-turn\u0000conversational stance detection dataset (called MmMtCSD). To derive stances\u0000from this challenging dataset, we propose a novel multimodal large language\u0000model stance detection framework (MLLM-SD), that learns joint stance\u0000representations from textual and visual modalities. Experiments on MmMtCSD show\u0000state-of-the-art performance of our proposed MLLM-SD approach for multimodal\u0000stance detection. We believe that MmMtCSD will contribute to advancing\u0000real-world applications of stance detection research.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MetaDigiHuman: Haptic Interfaces for Digital Humans in Metaverse MetaDigiHuman：元宇宙中数字人类的触觉界面

arXiv - CS - Multimedia

Pub Date : 2024-09-01 DOI: arxiv-2409.00615

Senthil Kumar Jagatheesaperumal, Praveen Sathikumar, Harikrishnan Rajan

The way we engage with digital spaces and the digital world has undergonerapid changes in recent years, largely due to the emergence of the Metaverse.As technology continues to advance, the demand for sophisticated and immersiveinterfaces to interact with the Metaverse has become increasingly crucial.Haptic interfaces have been developed to meet this need and provide users withtactile feedback and realistic touch sensations. These interfaces play a vitalrole in creating a more authentic and immersive experience within theMetaverse. This article introduces the concept of MetaDigiHuman, agroundbreaking framework that combines blended digital humans and hapticinterfaces. By harnessing cutting-edge technologies, MetaDigiHuman enablesseamless and immersive interaction within the Metaverse. Through thisframework, users can simulate the sensation of touching, feeling, andinteracting with digital beings as if they were physically present in theenvironments, offering a more compelling and immersive experience within theMetaverse.

近年来，我们与数字空间和数字世界打交道的方式发生了日新月异的变化，这主要是由于元宇宙的出现。随着技术的不断进步，人们对与元宇宙进行交互的复杂而身临其境的界面的需求变得越来越重要。触觉界面的开发正是为了满足这一需求，为用户提供触觉反馈和逼真的触感。这些界面在创造更真实、更身临其境的元宇宙体验方面发挥着至关重要的作用。本文介绍了 "元数字人"（MetaDigiHuman）的概念，这是一个结合了混合数字人和触觉界面的突破性框架。通过利用尖端技术，MetaDigiHuman 可以在元宇宙中实现无缝和身临其境的交互。通过这个框架，用户可以模拟触摸、感受和与数字人互动的感觉，就像他们真实地存在于环境中一样，从而在元宇宙中提供更具吸引力和身临其境的体验。

引用次数: 0

Video to Music Moment Retrieval 视频转音乐瞬间检索

arXiv - CS - Multimedia

Pub Date : 2024-08-30 DOI: arxiv-2408.16990

Zijie Xin, Minquan Wang, Ye Ma, Bo Wang, Quan Chen, Peng Jiang, Xirong Li

Adding proper background music helps complete a short video to be shared.Towards automating the task, previous research focuses on video-to-musicretrieval (VMR), aiming to find amidst a collection of music the one bestmatching the content of a given video. Since music tracks are typically muchlonger than short videos, meaning the returned music has to be cut to a shortermoment, there is a clear gap between the practical need and VMR. In order tobridge the gap, we propose in this paper video to music moment retrieval (VMMR)as a new task. To tackle the new task, we build a comprehensive datasetAd-Moment which contains 50K short videos annotated with music moments anddevelop a two-stage approach. In particular, given a test video, the mostsimilar music is retrieved from a given collection. Then, a Transformer basedmusic moment localization is performed. We term this approach Retrieval andLocalization (ReaL). Extensive experiments on real-world datasets verify theeffectiveness of the proposed method for VMMR.

为了实现这项任务的自动化，以前的研究主要集中在视频到音乐检索（VMR）上，目的是在音乐集合中找到与给定视频内容最匹配的音乐。由于音乐曲目通常比视频短片要长得多，这意味着返回的音乐必须剪切成较短的片段，因此实际需求与 VMR 之间存在明显的差距。为了弥补这一差距，我们在本文中提出了视频音乐瞬间检索（VMMR）这一新任务。为了完成这项新任务，我们建立了一个包含 50K 个注释了音乐瞬间的短视频的综合数据集 Ad-Moment，并开发了一种两阶段方法。具体来说，给定一个测试视频，从给定集合中检索最相似的音乐。然后，执行基于变换器的音乐时刻定位。我们将这种方法称为检索和定位（ReaL）。在真实世界数据集上进行的大量实验验证了所提出的 VMMR 方法的有效性。

引用次数: 0