{"title":"Hierarchical Composition Learning for Composed Query Image Retrieval","authors":"Yahui Xu, Yi Bin, Guoqing Wang, Yang Yang","doi":"10.1145/3469877.3490601","DOIUrl":null,"url":null,"abstract":"Composed query image retrieval is a growing research topic. The object is to retrieve images not only generally resemble the reference image, but differ according to the desired modification text. Existing methods mainly explore composing modification text with global feature or local entity descriptor of reference image. However, they ignore the fact that modification text is indeed diverse and arbitrary. It not only relates to abstractive global feature or concrete local entity transformation, but also often associates with the fine-grained structured visual adjustment. Thus, it is insufficient to emphasize the global or local entity visual for the query composition. In this work, we tackle this task by hierarchical composition learning. Specifically, the proposed method first encodes images into three representations consisting of global, entity and structure level representations. Structure level representation is richly explicable, which explicitly describes entities as well as attributes and relationships in the image with a directed graph. Based on these, we naturally perform hierarchical composition learning by fusing modification text and reference image in the global-entity-structure manner. It can transform the visual feature conditioned on modification text to target image in a coarse-to-fine manner, which takes advantage of the complementary information among three levels. Moreover, we introduce a hybrid space matching to explore global, entity and structure alignments which can get high performance and good interpretability.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Multimedia Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3469877.3490601","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Composed query image retrieval is a growing research topic. The object is to retrieve images not only generally resemble the reference image, but differ according to the desired modification text. Existing methods mainly explore composing modification text with global feature or local entity descriptor of reference image. However, they ignore the fact that modification text is indeed diverse and arbitrary. It not only relates to abstractive global feature or concrete local entity transformation, but also often associates with the fine-grained structured visual adjustment. Thus, it is insufficient to emphasize the global or local entity visual for the query composition. In this work, we tackle this task by hierarchical composition learning. Specifically, the proposed method first encodes images into three representations consisting of global, entity and structure level representations. Structure level representation is richly explicable, which explicitly describes entities as well as attributes and relationships in the image with a directed graph. Based on these, we naturally perform hierarchical composition learning by fusing modification text and reference image in the global-entity-structure manner. It can transform the visual feature conditioned on modification text to target image in a coarse-to-fine manner, which takes advantage of the complementary information among three levels. Moreover, we introduce a hybrid space matching to explore global, entity and structure alignments which can get high performance and good interpretability.