David Tschirschwitz, Franziska Klemstein, Henning Schmidgen, V. Rodehorst
{"title":"Drawing the Line: A Dual Evaluation Approach for Shaping Ground Truth in Image Retrieval Using Rich Visual Embeddings of Historical Images","authors":"David Tschirschwitz, Franziska Klemstein, Henning Schmidgen, V. Rodehorst","doi":"10.1145/3604951.3605524","DOIUrl":null,"url":null,"abstract":"Images contain rich visual information that can be interpreted in multiple ways, each of which may be correct. However, current retrieval systems in computer vision predominantly focus on content-based and instance-based image retrieval, while other facets relevant to the querying person, such as temporal aspects or image syntax, are often neglected. This study addresses this issue by examining a retrieval system in a domain-specific document processing pipeline. A retrieval evaluation dataset, which focuses on the aforementioned tasks, is utilized to compare different promising approaches. Subsequently, a qualitative study is conducted to compare the usability of the retrieval results with their corresponding metrics. This comparison reveals a discrepancy between the best-performing model by performance metrics and the model that provides better results for answering potential research questions. While current models such as DINO and CLIP demonstrate their ability to retrieve images based on their semantics and contents with high reliability, they exhibit limited capabilities in retrieving other facets.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3604951.3605524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Images contain rich visual information that can be interpreted in multiple ways, each of which may be correct. However, current retrieval systems in computer vision predominantly focus on content-based and instance-based image retrieval, while other facets relevant to the querying person, such as temporal aspects or image syntax, are often neglected. This study addresses this issue by examining a retrieval system in a domain-specific document processing pipeline. A retrieval evaluation dataset, which focuses on the aforementioned tasks, is utilized to compare different promising approaches. Subsequently, a qualitative study is conducted to compare the usability of the retrieval results with their corresponding metrics. This comparison reveals a discrepancy between the best-performing model by performance metrics and the model that provides better results for answering potential research questions. While current models such as DINO and CLIP demonstrate their ability to retrieve images based on their semantics and contents with high reliability, they exhibit limited capabilities in retrieving other facets.