An-An Liu , Long Yang , Wenhui Li , Weizhi Nie , Xianzhu Liu , Haipeng Chen
{"title":"Multi-level semantics probability embedding for image–text matching","authors":"An-An Liu , Long Yang , Wenhui Li , Weizhi Nie , Xianzhu Liu , Haipeng Chen","doi":"10.1016/j.ipm.2024.103968","DOIUrl":null,"url":null,"abstract":"<div><div>The requirement of image–text matching is to retrieve matching images or texts based on textual or visual queries. However, image–text matching is inherently a many-to-many problem, as an image can correspond to multiple levels of visual semantic scenes, which can be described by different texts. Similarly, textual descriptions can be visualized through multiple visual scenes. This leads to ambiguity in the matching between images and texts. To better capture these matching relationships, we employ graph convolutional networks to extract multi-level semantic information for image–text pairs, and construct Gaussian distribution representations for image and text instead of conventional point representations. Furthermore, we introduce a inter-modal mixture of Gaussian distribution to constrain the matching relationships between image–text pairs, which ensures more precise distribution representations in a shared space and strengthens the correlation between cross-modal. We conducted experiments on Flickr30K and MS-COCO, which are two widely used datasets, demonstrates the superior performance of our approach.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 2","pages":"Article 103968"},"PeriodicalIF":7.4000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324003273","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The requirement of image–text matching is to retrieve matching images or texts based on textual or visual queries. However, image–text matching is inherently a many-to-many problem, as an image can correspond to multiple levels of visual semantic scenes, which can be described by different texts. Similarly, textual descriptions can be visualized through multiple visual scenes. This leads to ambiguity in the matching between images and texts. To better capture these matching relationships, we employ graph convolutional networks to extract multi-level semantic information for image–text pairs, and construct Gaussian distribution representations for image and text instead of conventional point representations. Furthermore, we introduce a inter-modal mixture of Gaussian distribution to constrain the matching relationships between image–text pairs, which ensures more precise distribution representations in a shared space and strengthens the correlation between cross-modal. We conducted experiments on Flickr30K and MS-COCO, which are two widely used datasets, demonstrates the superior performance of our approach.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.