Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah
{"title":"ComAlign: Compositional Alignment in Vision-Language Models","authors":"Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah","doi":"arxiv-2409.08206","DOIUrl":null,"url":null,"abstract":"Vision-language models (VLMs) like CLIP have showcased a remarkable ability\nto extract transferable features for downstream tasks. Nonetheless, the\ntraining process of these models is usually based on a coarse-grained\ncontrastive loss between the global embedding of images and texts which may\nlose the compositional structure of these modalities. Many recent studies have\nshown VLMs lack compositional understandings like attribute binding and\nidentifying object relationships. Although some recent methods have tried to\nachieve finer-level alignments, they either are not based on extracting\nmeaningful components of proper granularity or don't properly utilize the\nmodalities' correspondence (especially in image-text pairs with more\ningredients). Addressing these limitations, we introduce Compositional\nAlignment (ComAlign), a fine-grained approach to discover more exact\ncorrespondence of text and image components using only the weak supervision in\nthe form of image-text pairs. Our methodology emphasizes that the compositional\nstructure (including entities and relations) extracted from the text modality\nmust also be retained in the image modality. To enforce correspondence of\nfine-grained concepts in image and text modalities, we train a lightweight\nnetwork lying on top of existing visual and language encoders using a small\ndataset. The network is trained to align nodes and edges of the structure\nacross the modalities. Experimental results on various VLMs and datasets\ndemonstrate significant improvements in retrieval and compositional benchmarks,\naffirming the effectiveness of our plugin model.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-language models (VLMs) like CLIP have showcased a remarkable ability
to extract transferable features for downstream tasks. Nonetheless, the
training process of these models is usually based on a coarse-grained
contrastive loss between the global embedding of images and texts which may
lose the compositional structure of these modalities. Many recent studies have
shown VLMs lack compositional understandings like attribute binding and
identifying object relationships. Although some recent methods have tried to
achieve finer-level alignments, they either are not based on extracting
meaningful components of proper granularity or don't properly utilize the
modalities' correspondence (especially in image-text pairs with more
ingredients). Addressing these limitations, we introduce Compositional
Alignment (ComAlign), a fine-grained approach to discover more exact
correspondence of text and image components using only the weak supervision in
the form of image-text pairs. Our methodology emphasizes that the compositional
structure (including entities and relations) extracted from the text modality
must also be retained in the image modality. To enforce correspondence of
fine-grained concepts in image and text modalities, we train a lightweight
network lying on top of existing visual and language encoders using a small
dataset. The network is trained to align nodes and edges of the structure
across the modalities. Experimental results on various VLMs and datasets
demonstrate significant improvements in retrieval and compositional benchmarks,
affirming the effectiveness of our plugin model.