{"title":"Bridging the gap: multi-granularity representation learning for text-based vehicle retrieval","authors":"Xue Bo, Junjie Liu, Di Yang, Wentao Ma","doi":"10.1007/s40747-024-01614-w","DOIUrl":null,"url":null,"abstract":"<p>Text-based cross-modal vehicle retrieval has been widely applied in smart city contexts and other scenarios. The objective of this approach is to identify semantically relevant target vehicles in videos using text descriptions, thereby facilitating the analysis of vehicle spatio-temporal trajectories. Current methodologies predominantly employ a two-tower architecture, where single-granularity features from both visual and textual domains are extracted independently. However, due to the intricate semantic relationships between videos and text, aligning the two modalities effectively using single-granularity feature representation poses a challenge. To address this issue, we introduce a <b>M</b>ulti-<b>G</b>ranularity <b>R</b>epresentation <b>L</b>earning model, termed <b>MGRL</b>, tailored for text-based cross-modal vehicle retrieval. Specifically, the model parses information from the two modalities into three hierarchical levels of feature representation: coarse-granularity, medium-granularity, and fine-granularity. Subsequently, a feature adaptive fusion strategy is devised to automatically determine the optimal pooling mechanism. Finally, a multi-granularity contrastive learning approach is implemented to ensure comprehensive semantic coverage, ranging from coarse to fine levels. Experimental outcomes on public benchmarks show that our method achieves up to a 14.56% improvement in text-to-vehicle retrieval performance, as measured by the Mean Reciprocal Rank (MRR) metric, when compared against 10 state-of-the-art baselines and 6 ablation studies.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":null,"pages":null},"PeriodicalIF":5.0000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01614-w","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Text-based cross-modal vehicle retrieval has been widely applied in smart city contexts and other scenarios. The objective of this approach is to identify semantically relevant target vehicles in videos using text descriptions, thereby facilitating the analysis of vehicle spatio-temporal trajectories. Current methodologies predominantly employ a two-tower architecture, where single-granularity features from both visual and textual domains are extracted independently. However, due to the intricate semantic relationships between videos and text, aligning the two modalities effectively using single-granularity feature representation poses a challenge. To address this issue, we introduce a Multi-Granularity Representation Learning model, termed MGRL, tailored for text-based cross-modal vehicle retrieval. Specifically, the model parses information from the two modalities into three hierarchical levels of feature representation: coarse-granularity, medium-granularity, and fine-granularity. Subsequently, a feature adaptive fusion strategy is devised to automatically determine the optimal pooling mechanism. Finally, a multi-granularity contrastive learning approach is implemented to ensure comprehensive semantic coverage, ranging from coarse to fine levels. Experimental outcomes on public benchmarks show that our method achieves up to a 14.56% improvement in text-to-vehicle retrieval performance, as measured by the Mean Reciprocal Rank (MRR) metric, when compared against 10 state-of-the-art baselines and 6 ablation studies.
期刊介绍:
Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.