Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li
Contrastive learning has become one of the most impressive approaches for multi-modal representation learning. However, previous multi-modal works mainly focused on cross-modal understanding, ignoring in-modal contrastive learning, which limits the representation of each modality. In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding by joint in-modal and cross-modal contrastive learning. Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality. With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training. Finally, we combine the self-supervised Turbo with the supervised multi-modal classification and demonstrate its effectiveness on two audio-text classification tasks, where the state-of-the-art performance is achieved on a speech emotion recognition benchmark dataset.
{"title":"Turbo your multi-modal classification with contrastive learning","authors":"Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li","doi":"arxiv-2409.09282","DOIUrl":"https://doi.org/arxiv-2409.09282","url":null,"abstract":"Contrastive learning has become one of the most impressive approaches for\u0000multi-modal representation learning. However, previous multi-modal works mainly\u0000focused on cross-modal understanding, ignoring in-modal contrastive learning,\u0000which limits the representation of each modality. In this paper, we propose a\u0000novel contrastive learning strategy, called $Turbo$, to promote multi-modal\u0000understanding by joint in-modal and cross-modal contrastive learning.\u0000Specifically, multi-modal data pairs are sent through the forward pass twice\u0000with different hidden dropout masks to get two different representations for\u0000each modality. With these representations, we obtain multiple in-modal and\u0000cross-modal contrastive objectives for training. Finally, we combine the\u0000self-supervised Turbo with the supervised multi-modal classification and\u0000demonstrate its effectiveness on two audio-text classification tasks, where the\u0000state-of-the-art performance is achieved on a speech emotion recognition\u0000benchmark dataset.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Yu, Jintao Fei, Xinyi Liu, Yang Yao, Jun Zhao, Guoxin Wang, Xin Li
Video-based physiology, exemplified by remote photoplethysmography (rPPG), extracts physiological signals such as pulse and respiration by analyzing subtle changes in video recordings. This non-contact, real-time monitoring method holds great potential for home settings. Despite the valuable contributions of public benchmark datasets to this technology, there is currently no dataset specifically designed for passive home monitoring. Existing datasets are often limited to close-up, static, frontal recordings and typically include only 1-2 physiological signals. To advance video-based physiology in real home settings, we introduce the MHAD dataset. It comprises 1,440 videos from 40 subjects, capturing 6 typical activities from 3 angles in a real home environment. Additionally, 5 physiological signals were recorded, making it a comprehensive video-based physiology dataset. MHAD is compatible with the rPPG-toolbox and has been validated using several unsupervised and supervised methods. Our dataset is publicly available at https://github.com/jdh-algo/MHAD-Dataset.
{"title":"MHAD: Multimodal Home Activity Dataset with Multi-Angle Videos and Synchronized Physiological Signals","authors":"Lei Yu, Jintao Fei, Xinyi Liu, Yang Yao, Jun Zhao, Guoxin Wang, Xin Li","doi":"arxiv-2409.09366","DOIUrl":"https://doi.org/arxiv-2409.09366","url":null,"abstract":"Video-based physiology, exemplified by remote photoplethysmography (rPPG),\u0000extracts physiological signals such as pulse and respiration by analyzing\u0000subtle changes in video recordings. This non-contact, real-time monitoring\u0000method holds great potential for home settings. Despite the valuable\u0000contributions of public benchmark datasets to this technology, there is\u0000currently no dataset specifically designed for passive home monitoring.\u0000Existing datasets are often limited to close-up, static, frontal recordings and\u0000typically include only 1-2 physiological signals. To advance video-based\u0000physiology in real home settings, we introduce the MHAD dataset. It comprises\u00001,440 videos from 40 subjects, capturing 6 typical activities from 3 angles in\u0000a real home environment. Additionally, 5 physiological signals were recorded,\u0000making it a comprehensive video-based physiology dataset. MHAD is compatible\u0000with the rPPG-toolbox and has been validated using several unsupervised and\u0000supervised methods. Our dataset is publicly available at\u0000https://github.com/jdh-algo/MHAD-Dataset.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang
In this paper, we study the problem of Text-to-Image Person Re-identification (TIReID), which aims to find images of the same identity described by a text sentence from a pool of candidate images. Benefiting from Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), the TIReID techniques have achieved remarkable progress recently. However, most existing methods only focus on instance-level matching and ignore identity-level matching, which involves associating multiple images and texts belonging to the same person. In this paper, we propose a novel prototypical prompting framework (Propot) designed to simultaneously model instance-level and identity-level matching for TIReID. Our Propot transforms the identity-level matching problem into a prototype learning problem, aiming to learn identity-enriched prototypes. Specifically, Propot works by 'initialize, adapt, enrich, then aggregate'. We first use CLIP to generate high-quality initial prototypes. Then, we propose a domain-conditional prototypical prompting (DPP) module to adapt the prototypes to the TIReID task using task-related information. Further, we propose an instance-conditional prototypical prompting (IPP) module to update prototypes conditioned on intra-modal and inter-modal instances to ensure prototype diversity. Finally, we design an adaptive prototype aggregation module to aggregate these prototypes, generating final identity-enriched prototypes. With identity-enriched prototypes, we diffuse its rich identity information to instances through prototype-to-instance contrastive loss to facilitate identity-level matching. Extensive experiments conducted on three benchmarks demonstrate the superiority of Propot compared to existing TIReID methods.
{"title":"Prototypical Prompting for Text-to-image Person Re-identification","authors":"Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang","doi":"arxiv-2409.09427","DOIUrl":"https://doi.org/arxiv-2409.09427","url":null,"abstract":"In this paper, we study the problem of Text-to-Image Person Re-identification\u0000(TIReID), which aims to find images of the same identity described by a text\u0000sentence from a pool of candidate images. Benefiting from Vision-Language\u0000Pre-training, such as CLIP (Contrastive Language-Image Pretraining), the TIReID\u0000techniques have achieved remarkable progress recently. However, most existing\u0000methods only focus on instance-level matching and ignore identity-level\u0000matching, which involves associating multiple images and texts belonging to the\u0000same person. In this paper, we propose a novel prototypical prompting framework\u0000(Propot) designed to simultaneously model instance-level and identity-level\u0000matching for TIReID. Our Propot transforms the identity-level matching problem\u0000into a prototype learning problem, aiming to learn identity-enriched\u0000prototypes. Specifically, Propot works by 'initialize, adapt, enrich, then\u0000aggregate'. We first use CLIP to generate high-quality initial prototypes.\u0000Then, we propose a domain-conditional prototypical prompting (DPP) module to\u0000adapt the prototypes to the TIReID task using task-related information.\u0000Further, we propose an instance-conditional prototypical prompting (IPP) module\u0000to update prototypes conditioned on intra-modal and inter-modal instances to\u0000ensure prototype diversity. Finally, we design an adaptive prototype\u0000aggregation module to aggregate these prototypes, generating final\u0000identity-enriched prototypes. With identity-enriched prototypes, we diffuse its\u0000rich identity information to instances through prototype-to-instance\u0000contrastive loss to facilitate identity-level matching. Extensive experiments\u0000conducted on three benchmarks demonstrate the superiority of Propot compared to\u0000existing TIReID methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Students frequently make mistakes while solving mathematical problems, and traditional error correction methods are both time-consuming and labor-intensive. This paper introduces an innovative textbf{V}irtual textbf{A}I textbf{T}eacher system designed to autonomously analyze and correct student textbf{E}rrors (VATE). Leveraging advanced large language models (LLMs), the system uses student drafts as a primary source for error analysis, which enhances understanding of the student's learning process. It incorporates sophisticated prompt engineering and maintains an error pool to reduce computational overhead. The AI-driven system also features a real-time dialogue component for efficient student interaction. Our approach demonstrates significant advantages over traditional and machine learning-based error correction methods, including reduced educational costs, high scalability, and superior generalizability. The system has been deployed on the Squirrel AI learning platform for elementary mathematics education, where it achieves 78.3% accuracy in error analysis and shows a marked improvement in student learning efficiency. Satisfaction surveys indicate a strong positive reception, highlighting the system's potential to transform educational practices.
{"title":"AI-Driven Virtual Teacher for Enhanced Educational Efficiency: Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction","authors":"Tianlong Xu, Yi-Fan Zhang, Zhendong Chu, Shen Wang, Qingsong Wen","doi":"arxiv-2409.09403","DOIUrl":"https://doi.org/arxiv-2409.09403","url":null,"abstract":"Students frequently make mistakes while solving mathematical problems, and\u0000traditional error correction methods are both time-consuming and\u0000labor-intensive. This paper introduces an innovative textbf{V}irtual\u0000textbf{A}I textbf{T}eacher system designed to autonomously analyze and\u0000correct student textbf{E}rrors (VATE). Leveraging advanced large language\u0000models (LLMs), the system uses student drafts as a primary source for error\u0000analysis, which enhances understanding of the student's learning process. It\u0000incorporates sophisticated prompt engineering and maintains an error pool to\u0000reduce computational overhead. The AI-driven system also features a real-time\u0000dialogue component for efficient student interaction. Our approach demonstrates\u0000significant advantages over traditional and machine learning-based error\u0000correction methods, including reduced educational costs, high scalability, and\u0000superior generalizability. The system has been deployed on the Squirrel AI\u0000learning platform for elementary mathematics education, where it achieves\u000078.3% accuracy in error analysis and shows a marked improvement in student\u0000learning efficiency. Satisfaction surveys indicate a strong positive reception,\u0000highlighting the system's potential to transform educational practices.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill
Decoder-only discrete-token language models have recently achieved significant success in automatic speech recognition. However, systematic analyses of how different modalities impact performance in specific scenarios remain limited. In this paper, we investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets. Our experiments suggest that: (1) Integrating more modalities can increase accuracy; in particular, our paper is, to our best knowledge, the first to show the benefit of combining audio, image context, and lip information; (2) Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels, moreover, they exhibit a different trend compared to inherently synchronized modalities like lip movements; (3) Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.
{"title":"Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?","authors":"Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill","doi":"arxiv-2409.09221","DOIUrl":"https://doi.org/arxiv-2409.09221","url":null,"abstract":"Decoder-only discrete-token language models have recently achieved\u0000significant success in automatic speech recognition. However, systematic\u0000analyses of how different modalities impact performance in specific scenarios\u0000remain limited. In this paper, we investigate the effects of multiple\u0000modalities on recognition accuracy on both synthetic and real-world datasets.\u0000Our experiments suggest that: (1) Integrating more modalities can increase\u0000accuracy; in particular, our paper is, to our best knowledge, the first to show\u0000the benefit of combining audio, image context, and lip information; (2) Images\u0000as a supplementary modality for speech recognition provide the greatest benefit\u0000at moderate noise levels, moreover, they exhibit a different trend compared to\u0000inherently synchronized modalities like lip movements; (3) Performance improves\u0000on both synthetic and real-world datasets when the most relevant visual\u0000information is filtered as a preprocessing step.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei
Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.
{"title":"Improving Virtual Try-On with Garment-focused Diffusion Models","authors":"Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei","doi":"arxiv-2409.08258","DOIUrl":"https://doi.org/arxiv-2409.08258","url":null,"abstract":"Diffusion models have led to the revolutionizing of generative modeling in\u0000numerous image synthesis tasks. Nevertheless, it is not trivial to directly\u0000apply diffusion models for synthesizing an image of a target person wearing a\u0000given in-shop garment, i.e., image-based virtual try-on (VTON) task. The\u0000difficulty originates from the aspect that the diffusion process should not\u0000only produce holistically high-fidelity photorealistic image of the target\u0000person, but also locally preserve every appearance and texture detail of the\u0000given garment. To address this, we shape a new Diffusion model, namely GarDiff,\u0000which triggers the garment-focused diffusion process with amplified guidance of\u0000both basic visual appearance and detailed textures (i.e., high-frequency\u0000details) derived from the given garment. GarDiff first remoulds a pre-trained\u0000latent diffusion model with additional appearance priors derived from the CLIP\u0000and VAE encodings of the reference garment. Meanwhile, a novel garment-focused\u0000adapter is integrated into the UNet of diffusion model, pursuing local\u0000fine-grained alignment with the visual appearance of reference garment and\u0000human pose. We specifically design an appearance loss over the synthesized\u0000garment to enhance the crucial, high-frequency details. Extensive experiments\u0000on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff\u0000when compared to state-of-the-art VTON approaches. Code is publicly available\u0000at:\u0000href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters
{"title":"Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations","authors":"Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja","doi":"arxiv-2409.08381","DOIUrl":"https://doi.org/arxiv-2409.08381","url":null,"abstract":"Vision-language models (VLMs) like CLIP have been adapted for Multi-Label\u0000Recognition (MLR) with partial annotations by leveraging prompt-learning, where\u0000positive and negative prompts are learned for each class to associate their\u0000embeddings with class presence or absence in the shared vision-text feature\u0000space. While this approach improves MLR performance by relying on VLM priors,\u0000we hypothesize that learning negative prompts may be suboptimal, as the\u0000datasets used to train VLMs lack image-caption pairs explicitly focusing on\u0000class absence. To analyze the impact of positive and negative prompt learning\u0000on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is\u0000learned with VLM guidance while the other is replaced by an embedding vector\u0000learned directly in the shared feature space without relying on the text\u0000encoder. Through empirical analysis, we observe that negative prompts degrade\u0000MLR performance, and learning only positive prompts, combined with learned\u0000negative embeddings (PositiveCoOp), outperforms dual prompt learning\u0000approaches. Moreover, we quantify the performance benefits that prompt-learning\u0000offers over a simple vision-features-only baseline, observing that the baseline\u0000displays strong performance comparable to dual prompt learning approach\u0000(DualCoOp), when the proportion of missing labels is low, while requiring half\u0000the training compute and 16 times fewer parameters","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"201 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah
Vision-language models (VLMs) like CLIP have showcased a remarkable ability to extract transferable features for downstream tasks. Nonetheless, the training process of these models is usually based on a coarse-grained contrastive loss between the global embedding of images and texts which may lose the compositional structure of these modalities. Many recent studies have shown VLMs lack compositional understandings like attribute binding and identifying object relationships. Although some recent methods have tried to achieve finer-level alignments, they either are not based on extracting meaningful components of proper granularity or don't properly utilize the modalities' correspondence (especially in image-text pairs with more ingredients). Addressing these limitations, we introduce Compositional Alignment (ComAlign), a fine-grained approach to discover more exact correspondence of text and image components using only the weak supervision in the form of image-text pairs. Our methodology emphasizes that the compositional structure (including entities and relations) extracted from the text modality must also be retained in the image modality. To enforce correspondence of fine-grained concepts in image and text modalities, we train a lightweight network lying on top of existing visual and language encoders using a small dataset. The network is trained to align nodes and edges of the structure across the modalities. Experimental results on various VLMs and datasets demonstrate significant improvements in retrieval and compositional benchmarks, affirming the effectiveness of our plugin model.
{"title":"ComAlign: Compositional Alignment in Vision-Language Models","authors":"Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah","doi":"arxiv-2409.08206","DOIUrl":"https://doi.org/arxiv-2409.08206","url":null,"abstract":"Vision-language models (VLMs) like CLIP have showcased a remarkable ability\u0000to extract transferable features for downstream tasks. Nonetheless, the\u0000training process of these models is usually based on a coarse-grained\u0000contrastive loss between the global embedding of images and texts which may\u0000lose the compositional structure of these modalities. Many recent studies have\u0000shown VLMs lack compositional understandings like attribute binding and\u0000identifying object relationships. Although some recent methods have tried to\u0000achieve finer-level alignments, they either are not based on extracting\u0000meaningful components of proper granularity or don't properly utilize the\u0000modalities' correspondence (especially in image-text pairs with more\u0000ingredients). Addressing these limitations, we introduce Compositional\u0000Alignment (ComAlign), a fine-grained approach to discover more exact\u0000correspondence of text and image components using only the weak supervision in\u0000the form of image-text pairs. Our methodology emphasizes that the compositional\u0000structure (including entities and relations) extracted from the text modality\u0000must also be retained in the image modality. To enforce correspondence of\u0000fine-grained concepts in image and text modalities, we train a lightweight\u0000network lying on top of existing visual and language encoders using a small\u0000dataset. The network is trained to align nodes and edges of the structure\u0000across the modalities. Experimental results on various VLMs and datasets\u0000demonstrate significant improvements in retrieval and compositional benchmarks,\u0000affirming the effectiveness of our plugin model.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents MSMF (Multi-Scale Multi-Modal Fusion), a novel approach for enhanced stock market prediction. MSMF addresses key challenges in multi-modal stock analysis by integrating a modality completion encoder, multi-scale feature extraction, and an innovative fusion mechanism. Our model leverages blank learning and progressive fusion to balance complementarity and redundancy across modalities, while multi-scale alignment facilitates direct correlations between heterogeneous data types. We introduce Multi-Granularity Gates and a specialized architecture to optimize the integration of local and global information for different tasks. Additionally, a Task-targeted Prediction layer is employed to preserve both coarse and fine-grained features during fusion. Experimental results demonstrate that MSMF outperforms existing methods, achieving significant improvements in accuracy and reducing prediction errors across various stock market forecasting tasks. This research contributes valuable insights to the field of multi-modal financial analysis and offers a robust framework for enhanced market prediction.
{"title":"MSMF: Multi-Scale Multi-Modal Fusion for Enhanced Stock Market Prediction","authors":"Jiahao Qin","doi":"arxiv-2409.07855","DOIUrl":"https://doi.org/arxiv-2409.07855","url":null,"abstract":"This paper presents MSMF (Multi-Scale Multi-Modal Fusion), a novel approach\u0000for enhanced stock market prediction. MSMF addresses key challenges in\u0000multi-modal stock analysis by integrating a modality completion encoder,\u0000multi-scale feature extraction, and an innovative fusion mechanism. Our model\u0000leverages blank learning and progressive fusion to balance complementarity and\u0000redundancy across modalities, while multi-scale alignment facilitates direct\u0000correlations between heterogeneous data types. We introduce Multi-Granularity\u0000Gates and a specialized architecture to optimize the integration of local and\u0000global information for different tasks. Additionally, a Task-targeted\u0000Prediction layer is employed to preserve both coarse and fine-grained features\u0000during fusion. Experimental results demonstrate that MSMF outperforms existing\u0000methods, achieving significant improvements in accuracy and reducing prediction\u0000errors across various stock market forecasting tasks. This research contributes\u0000valuable insights to the field of multi-modal financial analysis and offers a\u0000robust framework for enhanced market prediction.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei
Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at url{https://github.com/Nnn-s/CATdiffusion}.
{"title":"Improving Text-guided Object Inpainting with Semantic Pre-inpainting","authors":"Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei","doi":"arxiv-2409.08260","DOIUrl":"https://doi.org/arxiv-2409.08260","url":null,"abstract":"Recent years have witnessed the success of large text-to-image diffusion\u0000models and their remarkable potential to generate high-quality images. The\u0000further pursuit of enhancing the editability of images has sparked significant\u0000interest in the downstream task of inpainting a novel object described by a\u0000text prompt within a designated region in the image. Nevertheless, the problem\u0000is not trivial from two aspects: 1) Solely relying on one single U-Net to align\u0000text prompt and visual object across all the denoising timesteps is\u0000insufficient to generate desired objects; 2) The controllability of object\u0000generation is not guaranteed in the intricate sampling space of diffusion\u0000model. In this paper, we propose to decompose the typical single-stage object\u0000inpainting into two cascaded processes: 1) semantic pre-inpainting that infers\u0000the semantic features of desired objects in a multi-modal feature space; 2)\u0000high-fieldity object generation in diffusion latent space that pivots on such\u0000inpainted semantic features. To achieve this, we cascade a Transformer-based\u0000semantic inpainter and an object inpainting diffusion model, leading to a novel\u0000CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object\u0000inpainting. Technically, the semantic inpainter is trained to predict the\u0000semantic features of the target object conditioning on unmasked context and\u0000text prompt. The outputs of the semantic inpainter then act as the informative\u0000visual prompts to guide high-fieldity object generation through a reference\u0000adapter layer, leading to controllable object inpainting. Extensive evaluations\u0000on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against\u0000the state-of-the-art methods. Code is available at\u0000url{https://github.com/Nnn-s/CATdiffusion}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}