{"title":"Gauging the Limitations of Natural Language Supervised Text-Image Metrics Learning by Iconclass Visual Concepts","authors":"Kai Labusch, Clemens Neudecker","doi":"10.1145/3604951.3605516","DOIUrl":null,"url":null,"abstract":"Identification of images that are close to each other in terms of their iconographical meaning requires an applicable distance measure for text-image or image-image pairs. To obtain such a measure of distance, we finetune a group of contrastive loss based text-to-image similarity models (MS-CLIP) with respect to a large number of Iconclass visual concepts by means of natural language supervised learning. We show that there are certain Iconclass concepts that actually can be learned by the models whereas other visual concepts cannot be learned. We hypothesize that the visual concepts that can be learned more easily are intrinsically different from those that are more difficult to learn and that these qualitative differences can provide a valuable orientation for future research directions in text-to-image similarity learning.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3604951.3605516","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Identification of images that are close to each other in terms of their iconographical meaning requires an applicable distance measure for text-image or image-image pairs. To obtain such a measure of distance, we finetune a group of contrastive loss based text-to-image similarity models (MS-CLIP) with respect to a large number of Iconclass visual concepts by means of natural language supervised learning. We show that there are certain Iconclass concepts that actually can be learned by the models whereas other visual concepts cannot be learned. We hypothesize that the visual concepts that can be learned more easily are intrinsically different from those that are more difficult to learn and that these qualitative differences can provide a valuable orientation for future research directions in text-to-image similarity learning.