{"title":"Attribute-Image Similarity Measure for Multimodal Attention Mechanism","authors":"Ali Salehi Najafabadi, A. Ghomsheh","doi":"10.1109/CSICC52343.2021.9420626","DOIUrl":null,"url":null,"abstract":"Multimodal attention mechanisms in computer vision applications enable rich feature extraction by attending to specific image regions, highlighted through a second mode of data regarded as auxiliary information. The correspondence between image regions and auxiliary data can be defined as the similarity between parts of the two modes. In this paper, we propose a similarity measure that maximizes the posterior for matching high-level object attributes with image regions. In contrast to previous methods, we rely on attribute space rather than textual descriptions. We evaluate our results on the CUB dataset. The results show that the proposed method better minimizes the similarity loss function compared to the text-image similarity measurement.","PeriodicalId":374593,"journal":{"name":"2021 26th International Computer Conference, Computer Society of Iran (CSICC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 26th International Computer Conference, Computer Society of Iran (CSICC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSICC52343.2021.9420626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal attention mechanisms in computer vision applications enable rich feature extraction by attending to specific image regions, highlighted through a second mode of data regarded as auxiliary information. The correspondence between image regions and auxiliary data can be defined as the similarity between parts of the two modes. In this paper, we propose a similarity measure that maximizes the posterior for matching high-level object attributes with image regions. In contrast to previous methods, we rely on attribute space rather than textual descriptions. We evaluate our results on the CUB dataset. The results show that the proposed method better minimizes the similarity loss function compared to the text-image similarity measurement.