Gauging the Limitations of Natural Language Supervised Text-Image Metrics Learning by Iconclass Visual Concepts

Proceedings of the 7th International Workshop on Historical Document Imaging and Processing Pub Date : 2023-08-25 DOI:10.1145/3604951.3605516

Kai Labusch, Clemens Neudecker

引用次数: 0

Abstract

Identification of images that are close to each other in terms of their iconographical meaning requires an applicable distance measure for text-image or image-image pairs. To obtain such a measure of distance, we finetune a group of contrastive loss based text-to-image similarity models (MS-CLIP) with respect to a large number of Iconclass visual concepts by means of natural language supervised learning. We show that there are certain Iconclass concepts that actually can be learned by the models whereas other visual concepts cannot be learned. We hypothesize that the visual concepts that can be learned more easily are intrinsically different from those that are more difficult to learn and that these qualitative differences can provide a valuable orientation for future research directions in text-to-image similarity learning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用Iconclass视觉概念衡量自然语言监督的文本-图像度量学习的局限性

识别在图像意义上彼此接近的图像需要一个适用于文本-图像或图像-图像对的距离度量。为了获得这样的距离度量，我们通过自然语言监督学习的方式，针对大量的Iconclass视觉概念，对一组基于对比损失的文本到图像相似性模型(MS-CLIP)进行了微调。我们表明，模型实际上可以学习某些Iconclass概念，而其他视觉概念则无法学习。我们假设，易于学习的视觉概念与较难学习的视觉概念具有本质上的不同，这些质的差异可以为文本到图像相似性学习的未来研究方向提供有价值的方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 7th International Workshop on Historical Document Imaging and Processing

自引率

0.00%

发文量