Traditional in the wild image quality assessment (IQA) models are generally trained with the quality labels of mean opinion score (MOS), while missing the rich subjective quality information contained in the quality ratings, for example, the standard deviation of opinion scores (SOS) or even distribution of opinion scores (DOS). In this paper, we propose a novel IQA method named RichIQA to explore the rich subjective rating information beyond MOS to predict image quality in the wild. RichIQA is characterized by two key novel designs: (1) a three-stage image quality prediction network which exploits the powerful feature representation capability of the Convolutional vision Transformer (CvT) and mimics the short-term and long-term memory mechanisms of human brain; (2) a multi-label training strategy in which rich subjective quality information like MOS, SOS and DOS are concurrently used to train the quality prediction network. Powered by these two novel designs, RichIQA is able to predict the image quality in terms of a distribution, from which the mean image quality can be subsequently obtained. Extensive experimental results verify that the three-stage network is tailored to predict rich quality information, while the multi-label training strategy can fully exploit the potentials within subjective quality rating and enhance the prediction performance and generalizability of the network. RichIQA outperforms state-of-the-art competitors on multiple large-scale in the wild IQA databases with rich subjective rating labels. The code of RichIQA will be made publicly available on GitHub.
{"title":"Exploring Rich Subjective Quality Information for Image Quality Assessment in the Wild","authors":"Xiongkuo Min, Yixuan Gao, Yuqin Cao, Guangtao Zhai, Wenjun Zhang, Huifang Sun, Chang Wen Chen","doi":"arxiv-2409.05540","DOIUrl":"https://doi.org/arxiv-2409.05540","url":null,"abstract":"Traditional in the wild image quality assessment (IQA) models are generally\u0000trained with the quality labels of mean opinion score (MOS), while missing the\u0000rich subjective quality information contained in the quality ratings, for\u0000example, the standard deviation of opinion scores (SOS) or even distribution of\u0000opinion scores (DOS). In this paper, we propose a novel IQA method named\u0000RichIQA to explore the rich subjective rating information beyond MOS to predict\u0000image quality in the wild. RichIQA is characterized by two key novel designs:\u0000(1) a three-stage image quality prediction network which exploits the powerful\u0000feature representation capability of the Convolutional vision Transformer (CvT)\u0000and mimics the short-term and long-term memory mechanisms of human brain; (2) a\u0000multi-label training strategy in which rich subjective quality information like\u0000MOS, SOS and DOS are concurrently used to train the quality prediction network.\u0000Powered by these two novel designs, RichIQA is able to predict the image\u0000quality in terms of a distribution, from which the mean image quality can be\u0000subsequently obtained. Extensive experimental results verify that the\u0000three-stage network is tailored to predict rich quality information, while the\u0000multi-label training strategy can fully exploit the potentials within\u0000subjective quality rating and enhance the prediction performance and\u0000generalizability of the network. RichIQA outperforms state-of-the-art\u0000competitors on multiple large-scale in the wild IQA databases with rich\u0000subjective rating labels. The code of RichIQA will be made publicly available\u0000on GitHub.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In spite of great success in many image recognition tasks achieved by recent deep models, directly applying them to recognize low-resolution images may suffer from low accuracy due to the missing of informative details during resolution degradation. However, these images are still recognizable for subjects who are familiar with the corresponding high-resolution ones. Inspired by that, we propose a teacher-student learning approach to facilitate low-resolution image recognition via hybrid order relational knowledge distillation. The approach refers to three streams: the teacher stream is pretrained to recognize high-resolution images in high accuracy, the student stream is learned to identify low-resolution images by mimicking the teacher's behaviors, and the extra assistant stream is introduced as bridge to help knowledge transfer across the teacher to the student. To extract sufficient knowledge for reducing the loss in accuracy, the learning of student is supervised with multiple losses, which preserves the similarities in various order relational structures. In this way, the capability of recovering missing details of familiar low-resolution images can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on metric learning, low-resolution image classification and low-resolution face recognition tasks show the effectiveness of our approach, while taking reduced models.
{"title":"Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition","authors":"Shiming Ge, Kangkai Zhang, Haolin Liu, Yingying Hua, Shengwei Zhao, Xin Jin, Hao Wen","doi":"arxiv-2409.05384","DOIUrl":"https://doi.org/arxiv-2409.05384","url":null,"abstract":"In spite of great success in many image recognition tasks achieved by recent\u0000deep models, directly applying them to recognize low-resolution images may\u0000suffer from low accuracy due to the missing of informative details during\u0000resolution degradation. However, these images are still recognizable for\u0000subjects who are familiar with the corresponding high-resolution ones. Inspired\u0000by that, we propose a teacher-student learning approach to facilitate\u0000low-resolution image recognition via hybrid order relational knowledge\u0000distillation. The approach refers to three streams: the teacher stream is\u0000pretrained to recognize high-resolution images in high accuracy, the student\u0000stream is learned to identify low-resolution images by mimicking the teacher's\u0000behaviors, and the extra assistant stream is introduced as bridge to help\u0000knowledge transfer across the teacher to the student. To extract sufficient\u0000knowledge for reducing the loss in accuracy, the learning of student is\u0000supervised with multiple losses, which preserves the similarities in various\u0000order relational structures. In this way, the capability of recovering missing\u0000details of familiar low-resolution images can be effectively enhanced, leading\u0000to a better knowledge transfer. Extensive experiments on metric learning,\u0000low-resolution image classification and low-resolution face recognition tasks\u0000show the effectiveness of our approach, while taking reduced models.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the soaring popularity of video applications and the consequent rise in video traffic on the Internet, technologies like HTTP Adaptive Streaming (HAS) are crucial for delivering high Quality of Experience (QoE) to consumers. HAS technology enables video players on consumer devices to enhance viewer engagement by dynamically adapting video content quality based on network conditions. This is especially relevant for consumer electronics as it ensures an optimized viewing experience across a variety of devices, from smartphones to smart TVs. This paper introduces REVISION, an efficient roadmap designed to enhance adaptive video streaming, a core feature of modern consumer electronics. The REVISION optimization triangle highlights three essential aspects for improving streaming: Objective, Input Space, and Action Domain. Additionally, REVISION proposes a novel layer-based architecture tailored to refine video streaming systems, comprising Application, Control and Management, and Resource layers. Each layer is designed to optimize different components of the streaming process, which is directly linked to the performance and efficiency of consumer devices. By adopting the principles of the REVISION, manufacturers and developers can significantly improve the streaming capabilities of consumer electronics, thereby enriching the consumer's multimedia experience and accommodating the increasing demand for high-quality, real-time video content. This approach addresses the complexities of today's diverse video streaming ecosystem and paves the way for future advancements in consumer technology.
{"title":"REVISION: A Roadmap on Adaptive Video Streaming Optimization","authors":"Farzad Tashtarian, Christian Timmerer","doi":"arxiv-2409.06051","DOIUrl":"https://doi.org/arxiv-2409.06051","url":null,"abstract":"Due to the soaring popularity of video applications and the consequent rise\u0000in video traffic on the Internet, technologies like HTTP Adaptive Streaming\u0000(HAS) are crucial for delivering high Quality of Experience (QoE) to consumers.\u0000HAS technology enables video players on consumer devices to enhance viewer\u0000engagement by dynamically adapting video content quality based on network\u0000conditions. This is especially relevant for consumer electronics as it ensures\u0000an optimized viewing experience across a variety of devices, from smartphones\u0000to smart TVs. This paper introduces REVISION, an efficient roadmap designed to\u0000enhance adaptive video streaming, a core feature of modern consumer\u0000electronics. The REVISION optimization triangle highlights three essential\u0000aspects for improving streaming: Objective, Input Space, and Action Domain.\u0000Additionally, REVISION proposes a novel layer-based architecture tailored to\u0000refine video streaming systems, comprising Application, Control and Management,\u0000and Resource layers. Each layer is designed to optimize different components of\u0000the streaming process, which is directly linked to the performance and\u0000efficiency of consumer devices. By adopting the principles of the REVISION,\u0000manufacturers and developers can significantly improve the streaming\u0000capabilities of consumer electronics, thereby enriching the consumer's\u0000multimedia experience and accommodating the increasing demand for high-quality,\u0000real-time video content. This approach addresses the complexities of today's\u0000diverse video streaming ecosystem and paves the way for future advancements in\u0000consumer technology.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates diverse modalities such as text, image and audio, etc. to provide more accurate, personalized, and contextually relevant results. To facilitate a deeper understanding of this promising direction, this survey explores multimodal composite editing and retrieval in depth, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. In this survey, we systematically organize the application scenarios, methods, benchmarks, experiments, and future directions. Multimodal learning is a hot topic in large model era, and have also witnessed some surveys in multimodal learning and vision-language models with transformers published in the PAMI journal. To the best of our knowledge, this survey is the first comprehensive review of the literature on multimodal composite retrieval, which is a timely complement of multimodal fusion to existing reviews. To help readers' quickly track this field, we build the project page for this survey, which can be found at https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.
{"title":"A Survey of Multimodal Composite Editing and Retrieval","authors":"Suyan Li, Fuxiang Huang, Lei Zhang","doi":"arxiv-2409.05405","DOIUrl":"https://doi.org/arxiv-2409.05405","url":null,"abstract":"In the real world, where information is abundant and diverse across different\u0000modalities, understanding and utilizing various data types to improve retrieval\u0000systems is a key focus of research. Multimodal composite retrieval integrates\u0000diverse modalities such as text, image and audio, etc. to provide more\u0000accurate, personalized, and contextually relevant results. To facilitate a\u0000deeper understanding of this promising direction, this survey explores\u0000multimodal composite editing and retrieval in depth, covering image-text\u0000composite editing, image-text composite retrieval, and other multimodal\u0000composite retrieval. In this survey, we systematically organize the application\u0000scenarios, methods, benchmarks, experiments, and future directions. Multimodal\u0000learning is a hot topic in large model era, and have also witnessed some\u0000surveys in multimodal learning and vision-language models with transformers\u0000published in the PAMI journal. To the best of our knowledge, this survey is the\u0000first comprehensive review of the literature on multimodal composite retrieval,\u0000which is a timely complement of multimodal fusion to existing reviews. To help\u0000readers' quickly track this field, we build the project page for this survey,\u0000which can be found at\u0000https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao
Subject-driven text-to-image (T2I) customization has drawn significant interest in academia and industry. This task enables pre-trained models to generate novel images based on unique subjects. Existing studies adopt a self-reconstructive perspective, focusing on capturing all details of a single image, which will misconstrue the specific image's irrelevant attributes (e.g., view, pose, and background) as the subject intrinsic attributes. This misconstruction leads to both overfitting or underfitting of irrelevant and intrinsic attributes of the subject, i.e., these attributes are over-represented or under-represented simultaneously, causing a trade-off between similarity and controllability. In this study, we argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning, which allows the model to focus more on intrinsic attributes through intra-consistency (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences). Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning (MCL) paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is used to extract intrinsic features of subjects from high-level semantics to low-level appearance through crossmodal semantic contrastive learning and multiscale appearance contrastive learning. To facilitate contrastive learning, we introduce the MFI encoder to capture cross-modal representations. Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.
{"title":"CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization","authors":"Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao","doi":"arxiv-2409.05606","DOIUrl":"https://doi.org/arxiv-2409.05606","url":null,"abstract":"Subject-driven text-to-image (T2I) customization has drawn significant\u0000interest in academia and industry. This task enables pre-trained models to\u0000generate novel images based on unique subjects. Existing studies adopt a\u0000self-reconstructive perspective, focusing on capturing all details of a single\u0000image, which will misconstrue the specific image's irrelevant attributes (e.g.,\u0000view, pose, and background) as the subject intrinsic attributes. This\u0000misconstruction leads to both overfitting or underfitting of irrelevant and\u0000intrinsic attributes of the subject, i.e., these attributes are\u0000over-represented or under-represented simultaneously, causing a trade-off\u0000between similarity and controllability. In this study, we argue an ideal\u0000subject representation can be achieved by a cross-differential perspective,\u0000i.e., decoupling subject intrinsic attributes from irrelevant attributes via\u0000contrastive learning, which allows the model to focus more on intrinsic\u0000attributes through intra-consistency (features of the same subject are\u0000spatially closer) and inter-distinctiveness (features of different subjects\u0000have distinguished differences). Specifically, we propose CustomContrast, a\u0000novel framework, which includes a Multilevel Contrastive Learning (MCL)\u0000paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is\u0000used to extract intrinsic features of subjects from high-level semantics to\u0000low-level appearance through crossmodal semantic contrastive learning and\u0000multiscale appearance contrastive learning. To facilitate contrastive learning,\u0000we introduce the MFI encoder to capture cross-modal representations. Extensive\u0000experiments show the effectiveness of CustomContrast in subject similarity and\u0000text controllability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hoang-Son Vo-Thanh, Quang-Vinh Nguyen, Soo-Hyung Kim
Audio-driven talking face generation is a widely researched topic due to its high applicability. Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality. Early studies often focused solely on changing the mouth movements, which resulted in outcomes with limited practical applications. Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders. To achieve this, they need to generate through landmarks. However, creating stable landmarks that align well with the audio is a challenge. In this paper, we propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio. We separate the audio into two distinct domains to learn emotional information and facial context, then use a fusion mechanism based on the KAN model. Our model demonstrates high efficiency compared to recent models. This will lay the groundwork for the development of the audio-driven talking face generation problem in the future.
音频驱动的人脸识别技术具有很强的适用性,因此是一个被广泛研究的课题。利用音频重建会说话的人脸对教育、医疗保健、在线对话、虚拟助手和虚拟现实等领域大有裨益。早期的研究通常只关注嘴部动作的变化,结果实际应用有限。最近,研究人员提出了一种新方法,即构建整个面部,包括面部姿势、颈部和肩部。为了实现这一目标,他们需要通过地标来生成。然而,创建与音频完全一致的稳定地标是一项挑战。在本文中,我们提出了 KFusion 双域模型,这是一种从音频生成地标的稳健模型。我们将音频分为两个不同的域来学习情感信息和面部上下文,然后使用基于 KAN 模型的融合机制。与最近的模型相比,我们的模型具有很高的效率。这将为未来开发音频驱动的人脸生成问题奠定基础。
{"title":"KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation","authors":"Hoang-Son Vo-Thanh, Quang-Vinh Nguyen, Soo-Hyung Kim","doi":"arxiv-2409.05330","DOIUrl":"https://doi.org/arxiv-2409.05330","url":null,"abstract":"Audio-driven talking face generation is a widely researched topic due to its\u0000high applicability. Reconstructing a talking face using audio significantly\u0000contributes to fields such as education, healthcare, online conversations,\u0000virtual assistants, and virtual reality. Early studies often focused solely on\u0000changing the mouth movements, which resulted in outcomes with limited practical\u0000applications. Recently, researchers have proposed a new approach of\u0000constructing the entire face, including face pose, neck, and shoulders. To\u0000achieve this, they need to generate through landmarks. However, creating stable\u0000landmarks that align well with the audio is a challenge. In this paper, we\u0000propose the KFusion of Dual-Domain model, a robust model that generates\u0000landmarks from audio. We separate the audio into two distinct domains to learn\u0000emotional information and facial context, then use a fusion mechanism based on\u0000the KAN model. Our model demonstrates high efficiency compared to recent\u0000models. This will lay the groundwork for the development of the audio-driven\u0000talking face generation problem in the future.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Javier Huertas-Tato, Christos Koutlis, Symeon Papadopoulos, David Camacho, Ioannis Kompatsiaris
Memes are an increasingly prevalent element of online discourse in social networks, especially among young audiences. They carry ideas and messages that range from humorous to hateful, and are widely consumed. Their potentially high impact requires adequate means of control to moderate their use in large scale. In this work, we propose SimCLIP a deep learning-based architecture for cross-modal understanding of memes, leveraging a pre-trained CLIP encoder to produce context-aware embeddings and a Siamese fusion technique to capture the interactions between text and image. We perform an extensive experimentation on seven meme classification tasks across six datasets. We establish a new state of the art in Memotion7k with a 7.25% relative F1-score improvement, and achieve super-human performance on Harm-P with 13.73% F1-Score improvement. Our approach demonstrates the potential for compact meme classification models, enabling accurate and efficient meme monitoring. We share our code at https://github.com/jahuerta92/meme-classification-simclip
{"title":"A CLIP-based siamese approach for meme classification","authors":"Javier Huertas-Tato, Christos Koutlis, Symeon Papadopoulos, David Camacho, Ioannis Kompatsiaris","doi":"arxiv-2409.05772","DOIUrl":"https://doi.org/arxiv-2409.05772","url":null,"abstract":"Memes are an increasingly prevalent element of online discourse in social\u0000networks, especially among young audiences. They carry ideas and messages that\u0000range from humorous to hateful, and are widely consumed. Their potentially high\u0000impact requires adequate means of control to moderate their use in large scale.\u0000In this work, we propose SimCLIP a deep learning-based architecture for\u0000cross-modal understanding of memes, leveraging a pre-trained CLIP encoder to\u0000produce context-aware embeddings and a Siamese fusion technique to capture the\u0000interactions between text and image. We perform an extensive experimentation on\u0000seven meme classification tasks across six datasets. We establish a new state\u0000of the art in Memotion7k with a 7.25% relative F1-score improvement, and\u0000achieve super-human performance on Harm-P with 13.73% F1-Score improvement. Our\u0000approach demonstrates the potential for compact meme classification models,\u0000enabling accurate and efficient meme monitoring. We share our code at\u0000https://github.com/jahuerta92/meme-classification-simclip","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Surya Kalvakolu, Heinrich Söbke, Jannicke Baalsrud Hauge, Eckhard Kraft
Virtual field trips (VFTs) have proven to be valuable learning tools. Such applications are mostly based on 360{deg} technology and are to be characterized as single-user applications in technological terms. In contrast, Social VR applications are characterized by multi-user capability and user-specific avatars. From a learning perspective, the concepts of collaborative learning and embodiment have long been proposed as conducive to learning. Both concepts might be supported using Social VR. However, little is currently known about the use of Social VR for VFTs. Accordingly, the research questions are to what extent VFTs can be implemented in Social VR environments and how these Social VR-based VFTs are perceived by learners. This article presents an evaluation study on the development and evaluation of a VFT environment using the Social VR platform Mozilla Hubs. It describes the design decisions to create the environment and evaluation results from a mixed-method study (N=16) using a questionnaire and focus group discussions. The study highlighted the opportunities offered by Social VR-based VFTs but also revealed several challenges that need to be addressed to embrace the potential of Social VR-based VFTs to be utilized regularly in education.
{"title":"Educational Virtual Field Trips based on Social VR and 360° Spaces","authors":"Surya Kalvakolu, Heinrich Söbke, Jannicke Baalsrud Hauge, Eckhard Kraft","doi":"arxiv-2409.05496","DOIUrl":"https://doi.org/arxiv-2409.05496","url":null,"abstract":"Virtual field trips (VFTs) have proven to be valuable learning tools. Such\u0000applications are mostly based on 360{deg} technology and are to be\u0000characterized as single-user applications in technological terms. In contrast,\u0000Social VR applications are characterized by multi-user capability and\u0000user-specific avatars. From a learning perspective, the concepts of\u0000collaborative learning and embodiment have long been proposed as conducive to\u0000learning. Both concepts might be supported using Social VR. However, little is\u0000currently known about the use of Social VR for VFTs. Accordingly, the research\u0000questions are to what extent VFTs can be implemented in Social VR environments\u0000and how these Social VR-based VFTs are perceived by learners. This article\u0000presents an evaluation study on the development and evaluation of a VFT\u0000environment using the Social VR platform Mozilla Hubs. It describes the design\u0000decisions to create the environment and evaluation results from a mixed-method\u0000study (N=16) using a questionnaire and focus group discussions. The study\u0000highlighted the opportunities offered by Social VR-based VFTs but also revealed\u0000several challenges that need to be addressed to embrace the potential of Social\u0000VR-based VFTs to be utilized regularly in education.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions. Specifically, we first integrate information from different modalities to obtain multi-modal embeddings. Then we utilize a set of weighting coefficients, which generated from the multimodal embeddings, to reorganize the weight update matrices and apply them to the visual encoder of the visual grounding model. Extensive experiments on four widely used datasets demonstrate that MMCA achieves significant improvements and state-of-the-art results. Ablation experiments further demonstrate the lightweight and efficiency of our method. Our source code is available at: https://github.com/Mr-Bigworth/MMCA.
{"title":"Visual Grounding with Multi-modal Conditional Adaptation","authors":"Ruilin Yao, Shengwu Xiong, Yichen Zhao, Yi Rong","doi":"arxiv-2409.04999","DOIUrl":"https://doi.org/arxiv-2409.04999","url":null,"abstract":"Visual grounding is the task of locating objects specified by natural\u0000language expressions. Existing methods extend generic object detection\u0000frameworks to tackle this task. They typically extract visual and textual\u0000features separately using independent visual and textual encoders, then fuse\u0000these features in a multi-modal decoder for final prediction. However, visual\u0000grounding presents unique challenges. It often involves locating objects with\u0000different text descriptions within the same image. Existing methods struggle\u0000with this task because the independent visual encoder produces identical visual\u0000features for the same image, limiting detection performance. Some recently\u0000approaches propose various language-guided visual encoders to address this\u0000issue, but they mostly rely solely on textual information and require\u0000sophisticated designs. In this paper, we introduce Multi-modal Conditional\u0000Adaptation (MMCA), which enables the visual encoder to adaptively update\u0000weights, directing its focus towards text-relevant regions. Specifically, we\u0000first integrate information from different modalities to obtain multi-modal\u0000embeddings. Then we utilize a set of weighting coefficients, which generated\u0000from the multimodal embeddings, to reorganize the weight update matrices and\u0000apply them to the visual encoder of the visual grounding model. Extensive\u0000experiments on four widely used datasets demonstrate that MMCA achieves\u0000significant improvements and state-of-the-art results. Ablation experiments\u0000further demonstrate the lightweight and efficiency of our method. Our source\u0000code is available at: https://github.com/Mr-Bigworth/MMCA.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou
In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.
{"title":"POINTS: Improving Your Vision-language Model with Affordable Strategies","authors":"Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou","doi":"arxiv-2409.04828","DOIUrl":"https://doi.org/arxiv-2409.04828","url":null,"abstract":"In recent years, vision-language models have made significant strides,\u0000excelling in tasks like optical character recognition and geometric\u0000problem-solving. However, several critical issues remain: 1) Proprietary models\u0000often lack transparency about their architectures, while open-source models\u0000need more detailed ablations of their training strategies. 2) Pre-training data\u0000in open-source works is under-explored, with datasets added empirically, making\u0000the process cumbersome. 3) Fine-tuning often focuses on adding datasets,\u0000leading to diminishing returns. To address these issues, we propose the\u0000following contributions: 1) We trained a robust baseline model using the latest\u0000advancements in vision-language models, introducing effective improvements and\u0000conducting comprehensive ablation and validation for each technique. 2)\u0000Inspired by recent work on large language models, we filtered pre-training data\u0000using perplexity, selecting the lowest perplexity data for training. This\u0000approach allowed us to train on a curated 1M dataset, achieving competitive\u0000performance. 3) During visual instruction tuning, we used model soup on\u0000different datasets when adding more datasets yielded marginal improvements.\u0000These innovations resulted in a 9B parameter model that performs competitively\u0000with state-of-the-art models. Our strategies are efficient and lightweight,\u0000making them easily adoptable by the community.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}