This paper addresses the limitations of adverse weather image restoration approaches trained on synthetic data when applied to real-world scenarios. We formulate a semi-supervised learning framework employing vision-language models to enhance restoration performance across diverse adverse weather conditions in real-world settings. Our approach involves assessing image clearness and providing semantics using vision-language models on real data, serving as supervision signals for training restoration models. For clearness enhancement, we use real-world data, utilizing a dual-step strategy with pseudo-labels assessed by vision-language models and weather prompt learning. For semantic enhancement, we integrate real-world data by adjusting weather conditions in vision-language model descriptions while preserving semantic meaning. Additionally, we introduce an effective training strategy to bootstrap restoration performance. Our approach achieves superior results in real-world adverse weather image restoration, demonstrated through qualitative and quantitative comparisons with state-of-the-art works.
{"title":"Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models","authors":"Jiaqi Xu, Mengyang Wu, Xiaowei Hu, Chi-Wing Fu, Qi Dou, Pheng-Ann Heng","doi":"arxiv-2409.02101","DOIUrl":"https://doi.org/arxiv-2409.02101","url":null,"abstract":"This paper addresses the limitations of adverse weather image restoration\u0000approaches trained on synthetic data when applied to real-world scenarios. We\u0000formulate a semi-supervised learning framework employing vision-language models\u0000to enhance restoration performance across diverse adverse weather conditions in\u0000real-world settings. Our approach involves assessing image clearness and\u0000providing semantics using vision-language models on real data, serving as\u0000supervision signals for training restoration models. For clearness enhancement,\u0000we use real-world data, utilizing a dual-step strategy with pseudo-labels\u0000assessed by vision-language models and weather prompt learning. For semantic\u0000enhancement, we integrate real-world data by adjusting weather conditions in\u0000vision-language model descriptions while preserving semantic meaning.\u0000Additionally, we introduce an effective training strategy to bootstrap\u0000restoration performance. Our approach achieves superior results in real-world\u0000adverse weather image restoration, demonstrated through qualitative and\u0000quantitative comparisons with state-of-the-art works.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shadows are formed when light encounters obstacles, leading to areas of diminished illumination. In computer vision, shadow detection, removal, and generation are crucial for enhancing scene understanding, refining image quality, ensuring visual consistency in video editing, and improving virtual environments. This paper presents a comprehensive survey of shadow detection, removal, and generation in images and videos within the deep learning landscape over the past decade, covering tasks, deep models, datasets, and evaluation metrics. Our key contributions include a comprehensive survey of shadow analysis, standardization of experimental comparisons, exploration of the relationships among model size, speed, and performance, a cross-dataset generalization study, identification of open issues and future directions, and provision of publicly available resources to support further research.
{"title":"Unveiling Deep Shadows: A Survey on Image and Video Shadow Detection, Removal, and Generation in the Era of Deep Learning","authors":"Xiaowei Hu, Zhenghao Xing, Tianyu Wang, Chi-Wing Fu, Pheng-Ann Heng","doi":"arxiv-2409.02108","DOIUrl":"https://doi.org/arxiv-2409.02108","url":null,"abstract":"Shadows are formed when light encounters obstacles, leading to areas of\u0000diminished illumination. In computer vision, shadow detection, removal, and\u0000generation are crucial for enhancing scene understanding, refining image\u0000quality, ensuring visual consistency in video editing, and improving virtual\u0000environments. This paper presents a comprehensive survey of shadow detection,\u0000removal, and generation in images and videos within the deep learning landscape\u0000over the past decade, covering tasks, deep models, datasets, and evaluation\u0000metrics. Our key contributions include a comprehensive survey of shadow\u0000analysis, standardization of experimental comparisons, exploration of the\u0000relationships among model size, speed, and performance, a cross-dataset\u0000generalization study, identification of open issues and future directions, and\u0000provision of publicly available resources to support further research.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile cloud computing has been adopted in many multimedia applications, where the resource-constrained mobile device sends multimedia data (e.g., images) to remote cloud servers to request computation-intensive multimedia services (e.g., image recognition). While significantly improving the performance of the mobile applications, the cloud-based mechanism often causes privacy concerns as the multimedia data and services are offloaded from the trusted user device to untrusted cloud servers. Several recent studies have proposed perturbation-based privacy preserving mechanisms, which obfuscate the offloaded multimedia data to eliminate privacy exposures without affecting the functionality of the remote multimedia services. However, the existing privacy protection approaches require the deployment of computation-intensive perturbation generation on the resource-constrained mobile devices. Also, the obfuscated images are typically not compliant with the standard image compression algorithms and suffer from significant bandwidth consumption. In this paper, we develop a novel privacy-preserving multimedia mobile cloud computing framework, namely $PMC^2$, to address the resource and bandwidth challenges. $PMC^2$ employs secure confidential computing in the cloud to deploy the perturbation generator, which addresses the resource challenge while maintaining the privacy. Furthermore, we develop a neural compressor specifically trained to compress the perturbed images in order to address the bandwidth challenge. We implement $PMC^2$ in an end-to-end mobile cloud computing system, based on which our evaluations demonstrate superior latency, power efficiency, and bandwidth consumption achieved by $PMC^2$ while maintaining high accuracy in the target multimedia service.
{"title":"Privacy-Preserving Multimedia Mobile Cloud Computing Using Protective Perturbation","authors":"Zhongze Tang, Mengmei Ye, Yao Liu, Sheng Wei","doi":"arxiv-2409.01710","DOIUrl":"https://doi.org/arxiv-2409.01710","url":null,"abstract":"Mobile cloud computing has been adopted in many multimedia applications,\u0000where the resource-constrained mobile device sends multimedia data (e.g.,\u0000images) to remote cloud servers to request computation-intensive multimedia\u0000services (e.g., image recognition). While significantly improving the\u0000performance of the mobile applications, the cloud-based mechanism often causes\u0000privacy concerns as the multimedia data and services are offloaded from the\u0000trusted user device to untrusted cloud servers. Several recent studies have\u0000proposed perturbation-based privacy preserving mechanisms, which obfuscate the\u0000offloaded multimedia data to eliminate privacy exposures without affecting the\u0000functionality of the remote multimedia services. However, the existing privacy\u0000protection approaches require the deployment of computation-intensive\u0000perturbation generation on the resource-constrained mobile devices. Also, the\u0000obfuscated images are typically not compliant with the standard image\u0000compression algorithms and suffer from significant bandwidth consumption. In\u0000this paper, we develop a novel privacy-preserving multimedia mobile cloud\u0000computing framework, namely $PMC^2$, to address the resource and bandwidth\u0000challenges. $PMC^2$ employs secure confidential computing in the cloud to\u0000deploy the perturbation generator, which addresses the resource challenge while\u0000maintaining the privacy. Furthermore, we develop a neural compressor\u0000specifically trained to compress the perturbed images in order to address the\u0000bandwidth challenge. We implement $PMC^2$ in an end-to-end mobile cloud\u0000computing system, based on which our evaluations demonstrate superior latency,\u0000power efficiency, and bandwidth consumption achieved by $PMC^2$ while\u0000maintaining high accuracy in the target multimedia service.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low-resolution face recognition is a challenging task due to the missing of informative details. Recent approaches based on knowledge distillation have proven that high-resolution clues can well guide low-resolution face recognition via proper knowledge transfer. However, due to the distribution difference between training and testing faces, the learned models often suffer from poor adaptability. To address that, we split the knowledge transfer process into distillation and adaptation steps, and propose an adaptable instance-relation distillation approach to facilitate low-resolution face recognition. In the approach, the student distills knowledge from high-resolution teacher in both instance level and relation level, providing sufficient cross-resolution knowledge transfer. Then, the learned student can be adaptable to recognize low-resolution faces with adaptive batch normalization in inference. In this manner, the capability of recovering missing details of familiar low-resolution faces can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
{"title":"Low-Resolution Face Recognition via Adaptable Instance-Relation Distillation","authors":"Ruixin Shi, Weijia Guo, Shiming Ge","doi":"arxiv-2409.02049","DOIUrl":"https://doi.org/arxiv-2409.02049","url":null,"abstract":"Low-resolution face recognition is a challenging task due to the missing of\u0000informative details. Recent approaches based on knowledge distillation have\u0000proven that high-resolution clues can well guide low-resolution face\u0000recognition via proper knowledge transfer. However, due to the distribution\u0000difference between training and testing faces, the learned models often suffer\u0000from poor adaptability. To address that, we split the knowledge transfer\u0000process into distillation and adaptation steps, and propose an adaptable\u0000instance-relation distillation approach to facilitate low-resolution face\u0000recognition. In the approach, the student distills knowledge from\u0000high-resolution teacher in both instance level and relation level, providing\u0000sufficient cross-resolution knowledge transfer. Then, the learned student can\u0000be adaptable to recognize low-resolution faces with adaptive batch\u0000normalization in inference. In this manner, the capability of recovering\u0000missing details of familiar low-resolution faces can be effectively enhanced,\u0000leading to a better knowledge transfer. Extensive experiments on low-resolution\u0000face recognition clearly demonstrate the effectiveness and adaptability of our\u0000approach.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brent Zoomers, Maarten Wijnants, Ivan Molenaers, Joni Vanherck, Jeroen Put, Lode Jorissen, Nick Michiels
Over the past year, 3D Gaussian Splatting (3DGS) has received significant attention for its ability to represent 3D scenes in a perceptually accurate manner. However, it can require a substantial amount of storage since each splat's individual data must be stored. While compression techniques offer a potential solution by reducing the memory footprint, they still necessitate retrieving the entire scene before any part of it can be rendered. In this work, we introduce a novel approach for progressively rendering such scenes, aiming to display visible content that closely approximates the final scene as early as possible without loading the entire scene into memory. This approach benefits both on-device rendering applications limited by memory constraints and streaming applications where minimal bandwidth usage is preferred. To achieve this, we approximate the contribution of each Gaussian to the final scene and construct an order of prioritization on their inclusion in the rendering process. Additionally, we demonstrate that our approach can be combined with existing compression methods to progressively render (and stream) 3DGS scenes, optimizing bandwidth usage by focusing on the most important splats within a scene. Overall, our work establishes a foundation for making remotely hosted 3DGS content more quickly accessible to end-users in over-the-top consumption scenarios, with our results showing significant improvements in quality across all metrics compared to existing methods.
{"title":"PRoGS: Progressive Rendering of Gaussian Splats","authors":"Brent Zoomers, Maarten Wijnants, Ivan Molenaers, Joni Vanherck, Jeroen Put, Lode Jorissen, Nick Michiels","doi":"arxiv-2409.01761","DOIUrl":"https://doi.org/arxiv-2409.01761","url":null,"abstract":"Over the past year, 3D Gaussian Splatting (3DGS) has received significant\u0000attention for its ability to represent 3D scenes in a perceptually accurate\u0000manner. However, it can require a substantial amount of storage since each\u0000splat's individual data must be stored. While compression techniques offer a\u0000potential solution by reducing the memory footprint, they still necessitate\u0000retrieving the entire scene before any part of it can be rendered. In this\u0000work, we introduce a novel approach for progressively rendering such scenes,\u0000aiming to display visible content that closely approximates the final scene as\u0000early as possible without loading the entire scene into memory. This approach\u0000benefits both on-device rendering applications limited by memory constraints\u0000and streaming applications where minimal bandwidth usage is preferred. To\u0000achieve this, we approximate the contribution of each Gaussian to the final\u0000scene and construct an order of prioritization on their inclusion in the\u0000rendering process. Additionally, we demonstrate that our approach can be\u0000combined with existing compression methods to progressively render (and stream)\u00003DGS scenes, optimizing bandwidth usage by focusing on the most important\u0000splats within a scene. Overall, our work establishes a foundation for making\u0000remotely hosted 3DGS content more quickly accessible to end-users in\u0000over-the-top consumption scenarios, with our results showing significant\u0000improvements in quality across all metrics compared to existing methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a new strategy called think twice before recognizing to improve fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is difficult due to the complex road conditions, and existing approaches particularly struggle with cross-country TSR when data is lacking. Our strategy achieves effective fine-grained TSR by stimulating the multiple-thinking capability of large multimodal models (LMM). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for the LMM. The context descriptions with center coordinate prompt optimization help the LMM to locate the target traffic sign in the original road images containing multiple traffic signs and filter irrelevant answers through the proposed prior traffic sign hypothesis. The characteristic description is based on few-shot in-context learning of template traffic signs, which decreases the cross-domain difference and enhances the fine-grained recognition capability of the LMM. The differential descriptions of similar traffic signs optimize the multimodal thinking capability of the LMM. The proposed method is independent of training data and requires only simple and uniform instructions. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries, and the proposed method achieves state-of-the-art TSR results on all five datasets.
我们提出了一种名为 "三思而后行 "的新策略,以改进细粒度交通标志识别(TSR)。由于路况复杂,在野外进行细粒度 TSR 十分困难,现有方法尤其难以在缺乏数据的情况下进行跨国 TSR。我们的策略通过激发大型多模态模型(LMM)的多重思维能力来实现有效的细粒度 TSR。我们引入了上下文、特征和差异描述来为 LMM 设计多重思维过程。带有中心坐标提示优化的上下文描述有助于 LMM 在包含多个交通标志的原始道路图像中定位目标交通标志,并通过提出的先验交通标志假设过滤无关答案。特征描述是基于对模板交通标志的少帧上下文学习,从而减小了跨域差异,增强了 LMM 的细粒度识别能力。对相似交通标志的差分描述优化了 LMM 的多模态思维能力。所提出的方法与训练数据无关,只需要简单而统一的指令。我们在三个基准数据集和两个来自不同国家的实际数据集上进行了广泛的实验,所提出的方法在所有五个数据集上都取得了最先进的 TSR 结果。
{"title":"Think Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition","authors":"Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama","doi":"arxiv-2409.01534","DOIUrl":"https://doi.org/arxiv-2409.01534","url":null,"abstract":"We propose a new strategy called think twice before recognizing to improve\u0000fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is\u0000difficult due to the complex road conditions, and existing approaches\u0000particularly struggle with cross-country TSR when data is lacking. Our strategy\u0000achieves effective fine-grained TSR by stimulating the multiple-thinking\u0000capability of large multimodal models (LMM). We introduce context,\u0000characteristic, and differential descriptions to design multiple thinking\u0000processes for the LMM. The context descriptions with center coordinate prompt\u0000optimization help the LMM to locate the target traffic sign in the original\u0000road images containing multiple traffic signs and filter irrelevant answers\u0000through the proposed prior traffic sign hypothesis. The characteristic\u0000description is based on few-shot in-context learning of template traffic signs,\u0000which decreases the cross-domain difference and enhances the fine-grained\u0000recognition capability of the LMM. The differential descriptions of similar\u0000traffic signs optimize the multimodal thinking capability of the LMM. The\u0000proposed method is independent of training data and requires only simple and\u0000uniform instructions. We conducted extensive experiments on three benchmark\u0000datasets and two real-world datasets from different countries, and the proposed\u0000method achieves state-of-the-art TSR results on all five datasets.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generative face video coding (GFVC) has been demonstrated as a potential approach to low-latency, low bitrate video conferencing. GFVC frameworks achieve an extreme gain in coding efficiency with over 70% bitrate savings when compared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET standardization efforts, all the information required to reconstruct video sequences using GFVC frameworks are adopted as part of the supplemental enhancement information (SEI) in existing compression pipelines. In light of this development, we aim to address a challenge that has been weakly addressed in prior GFVC frameworks, i.e., reconstruction drift as the distance between the reference and target frames increases. This challenge creates the need to update the reference buffer more frequently by transmitting more Intra-refresh frames, which are the most expensive element of the GFVC bitstream. To overcome this problem, we propose instead multiple reference animation as a robust approach to minimizing reconstruction drift, especially when used in a bi-directional prediction mode. Further, we propose a contrastive learning formulation for multi-reference animation. We observe that using a contrastive learning framework enhances the representation capabilities of the animation generator. The resulting framework, MRDAC (Multi-Reference Deep Animation Codec) can therefore be used to compress longer sequences with fewer reference frames or achieve a significant gain in reconstruction accuracy at comparable bitrates to previous frameworks. Quantitative and qualitative results show significant coding and reconstruction quality gains compared to previous GFVC methods, and more accurate animation quality in presence of large pose and facial expression changes.
{"title":"Multi-Reference Generative Face Video Compression with Contrastive Learning","authors":"Goluck Konuko, Giuseppe Valenzise","doi":"arxiv-2409.01029","DOIUrl":"https://doi.org/arxiv-2409.01029","url":null,"abstract":"Generative face video coding (GFVC) has been demonstrated as a potential\u0000approach to low-latency, low bitrate video conferencing. GFVC frameworks\u0000achieve an extreme gain in coding efficiency with over 70% bitrate savings when\u0000compared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET\u0000standardization efforts, all the information required to reconstruct video\u0000sequences using GFVC frameworks are adopted as part of the supplemental\u0000enhancement information (SEI) in existing compression pipelines. In light of\u0000this development, we aim to address a challenge that has been weakly addressed\u0000in prior GFVC frameworks, i.e., reconstruction drift as the distance between\u0000the reference and target frames increases. This challenge creates the need to\u0000update the reference buffer more frequently by transmitting more Intra-refresh\u0000frames, which are the most expensive element of the GFVC bitstream. To overcome\u0000this problem, we propose instead multiple reference animation as a robust\u0000approach to minimizing reconstruction drift, especially when used in a\u0000bi-directional prediction mode. Further, we propose a contrastive learning\u0000formulation for multi-reference animation. We observe that using a contrastive\u0000learning framework enhances the representation capabilities of the animation\u0000generator. The resulting framework, MRDAC (Multi-Reference Deep Animation\u0000Codec) can therefore be used to compress longer sequences with fewer reference\u0000frames or achieve a significant gain in reconstruction accuracy at comparable\u0000bitrates to previous frameworks. Quantitative and qualitative results show\u0000significant coding and reconstruction quality gains compared to previous GFVC\u0000methods, and more accurate animation quality in presence of large pose and\u0000facial expression changes.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pairs, overlooking the multi-party conversational contexts that naturally occur on social media. This limitation stems from a lack of datasets that authentically capture such conversational scenarios, hindering progress in conversational MSD. To address this, we introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD). To derive stances from this challenging dataset, we propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities. Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection. We believe that MmMtCSD will contribute to advancing real-world applications of stance detection research.
{"title":"Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model","authors":"Fuqiang Niu, Zebang Cheng, Xianghua Fu, Xiaojiang Peng, Genan Dai, Yin Chen, Hu Huang, Bowen Zhang","doi":"arxiv-2409.00597","DOIUrl":"https://doi.org/arxiv-2409.00597","url":null,"abstract":"Stance detection, which aims to identify public opinion towards specific\u0000targets using social media data, is an important yet challenging task. With the\u0000proliferation of diverse multimodal social media content including text, and\u0000images multimodal stance detection (MSD) has become a crucial research area.\u0000However, existing MSD studies have focused on modeling stance within individual\u0000text-image pairs, overlooking the multi-party conversational contexts that\u0000naturally occur on social media. This limitation stems from a lack of datasets\u0000that authentically capture such conversational scenarios, hindering progress in\u0000conversational MSD. To address this, we introduce a new multimodal multi-turn\u0000conversational stance detection dataset (called MmMtCSD). To derive stances\u0000from this challenging dataset, we propose a novel multimodal large language\u0000model stance detection framework (MLLM-SD), that learns joint stance\u0000representations from textual and visual modalities. Experiments on MmMtCSD show\u0000state-of-the-art performance of our proposed MLLM-SD approach for multimodal\u0000stance detection. We believe that MmMtCSD will contribute to advancing\u0000real-world applications of stance detection research.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Senthil Kumar Jagatheesaperumal, Praveen Sathikumar, Harikrishnan Rajan
The way we engage with digital spaces and the digital world has undergone rapid changes in recent years, largely due to the emergence of the Metaverse. As technology continues to advance, the demand for sophisticated and immersive interfaces to interact with the Metaverse has become increasingly crucial. Haptic interfaces have been developed to meet this need and provide users with tactile feedback and realistic touch sensations. These interfaces play a vital role in creating a more authentic and immersive experience within the Metaverse. This article introduces the concept of MetaDigiHuman, a groundbreaking framework that combines blended digital humans and haptic interfaces. By harnessing cutting-edge technologies, MetaDigiHuman enables seamless and immersive interaction within the Metaverse. Through this framework, users can simulate the sensation of touching, feeling, and interacting with digital beings as if they were physically present in the environments, offering a more compelling and immersive experience within the Metaverse.
{"title":"MetaDigiHuman: Haptic Interfaces for Digital Humans in Metaverse","authors":"Senthil Kumar Jagatheesaperumal, Praveen Sathikumar, Harikrishnan Rajan","doi":"arxiv-2409.00615","DOIUrl":"https://doi.org/arxiv-2409.00615","url":null,"abstract":"The way we engage with digital spaces and the digital world has undergone\u0000rapid changes in recent years, largely due to the emergence of the Metaverse.\u0000As technology continues to advance, the demand for sophisticated and immersive\u0000interfaces to interact with the Metaverse has become increasingly crucial.\u0000Haptic interfaces have been developed to meet this need and provide users with\u0000tactile feedback and realistic touch sensations. These interfaces play a vital\u0000role in creating a more authentic and immersive experience within the\u0000Metaverse. This article introduces the concept of MetaDigiHuman, a\u0000groundbreaking framework that combines blended digital humans and haptic\u0000interfaces. By harnessing cutting-edge technologies, MetaDigiHuman enables\u0000seamless and immersive interaction within the Metaverse. Through this\u0000framework, users can simulate the sensation of touching, feeling, and\u0000interacting with digital beings as if they were physically present in the\u0000environments, offering a more compelling and immersive experience within the\u0000Metaverse.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zijie Xin, Minquan Wang, Ye Ma, Bo Wang, Quan Chen, Peng Jiang, Xirong Li
Adding proper background music helps complete a short video to be shared. Towards automating the task, previous research focuses on video-to-music retrieval (VMR), aiming to find amidst a collection of music the one best matching the content of a given video. Since music tracks are typically much longer than short videos, meaning the returned music has to be cut to a shorter moment, there is a clear gap between the practical need and VMR. In order to bridge the gap, we propose in this paper video to music moment retrieval (VMMR) as a new task. To tackle the new task, we build a comprehensive dataset Ad-Moment which contains 50K short videos annotated with music moments and develop a two-stage approach. In particular, given a test video, the most similar music is retrieved from a given collection. Then, a Transformer based music moment localization is performed. We term this approach Retrieval and Localization (ReaL). Extensive experiments on real-world datasets verify the effectiveness of the proposed method for VMMR.
{"title":"Video to Music Moment Retrieval","authors":"Zijie Xin, Minquan Wang, Ye Ma, Bo Wang, Quan Chen, Peng Jiang, Xirong Li","doi":"arxiv-2408.16990","DOIUrl":"https://doi.org/arxiv-2408.16990","url":null,"abstract":"Adding proper background music helps complete a short video to be shared.\u0000Towards automating the task, previous research focuses on video-to-music\u0000retrieval (VMR), aiming to find amidst a collection of music the one best\u0000matching the content of a given video. Since music tracks are typically much\u0000longer than short videos, meaning the returned music has to be cut to a shorter\u0000moment, there is a clear gap between the practical need and VMR. In order to\u0000bridge the gap, we propose in this paper video to music moment retrieval (VMMR)\u0000as a new task. To tackle the new task, we build a comprehensive dataset\u0000Ad-Moment which contains 50K short videos annotated with music moments and\u0000develop a two-stage approach. In particular, given a test video, the most\u0000similar music is retrieved from a given collection. Then, a Transformer based\u0000music moment localization is performed. We term this approach Retrieval and\u0000Localization (ReaL). Extensive experiments on real-world datasets verify the\u0000effectiveness of the proposed method for VMMR.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}