David Tschirschwitz, Franziska Klemstein, Henning Schmidgen, V. Rodehorst
Images contain rich visual information that can be interpreted in multiple ways, each of which may be correct. However, current retrieval systems in computer vision predominantly focus on content-based and instance-based image retrieval, while other facets relevant to the querying person, such as temporal aspects or image syntax, are often neglected. This study addresses this issue by examining a retrieval system in a domain-specific document processing pipeline. A retrieval evaluation dataset, which focuses on the aforementioned tasks, is utilized to compare different promising approaches. Subsequently, a qualitative study is conducted to compare the usability of the retrieval results with their corresponding metrics. This comparison reveals a discrepancy between the best-performing model by performance metrics and the model that provides better results for answering potential research questions. While current models such as DINO and CLIP demonstrate their ability to retrieve images based on their semantics and contents with high reliability, they exhibit limited capabilities in retrieving other facets.
{"title":"Drawing the Line: A Dual Evaluation Approach for Shaping Ground Truth in Image Retrieval Using Rich Visual Embeddings of Historical Images","authors":"David Tschirschwitz, Franziska Klemstein, Henning Schmidgen, V. Rodehorst","doi":"10.1145/3604951.3605524","DOIUrl":"https://doi.org/10.1145/3604951.3605524","url":null,"abstract":"Images contain rich visual information that can be interpreted in multiple ways, each of which may be correct. However, current retrieval systems in computer vision predominantly focus on content-based and instance-based image retrieval, while other facets relevant to the querying person, such as temporal aspects or image syntax, are often neglected. This study addresses this issue by examining a retrieval system in a domain-specific document processing pipeline. A retrieval evaluation dataset, which focuses on the aforementioned tasks, is utilized to compare different promising approaches. Subsequently, a qualitative study is conducted to compare the usability of the retrieval results with their corresponding metrics. This comparison reveals a discrepancy between the best-performing model by performance metrics and the model that provides better results for answering potential research questions. While current models such as DINO and CLIP demonstrate their ability to retrieve images based on their semantics and contents with high reliability, they exhibit limited capabilities in retrieving other facets.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124833123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ancient historical documents, such as Greek papyri, are crucial for understanding human knowledge and history. However, transcribing and translating these documents manually is a difficult and time-consuming process. As a result, an automatic algorithm is required to identify and interpret the writing on these ancient historical documents with accuracy and dependability. In this work, we introduce PapyTwin, a deep neural network which consists of two subnetworks, the first and second twin, that cooperate together to address the challenge of detecting Greek letters on ancient papyri. While the first twin network aims at uniforming the letter size across the images, the second twin network predicts letter bounding boxes based on these letter-uniformed images. Experiment results show that our proposing approach outperformed the baseline model by a large margin, suggesting that uniform letter size across images is a crucial factor in enhancing the performance of detection networks on ancient documents such as Greek papyri.
{"title":"PapyTwin net: a Twin network for Greek letters detection on ancient Papyri","authors":"Manh-Tu Vu, M. Beurton-Aimar","doi":"10.1145/3604951.3605522","DOIUrl":"https://doi.org/10.1145/3604951.3605522","url":null,"abstract":"Ancient historical documents, such as Greek papyri, are crucial for understanding human knowledge and history. However, transcribing and translating these documents manually is a difficult and time-consuming process. As a result, an automatic algorithm is required to identify and interpret the writing on these ancient historical documents with accuracy and dependability. In this work, we introduce PapyTwin, a deep neural network which consists of two subnetworks, the first and second twin, that cooperate together to address the challenge of detecting Greek letters on ancient papyri. While the first twin network aims at uniforming the letter size across the images, the second twin network predicts letter bounding boxes based on these letter-uniformed images. Experiment results show that our proposing approach outperformed the baseline model by a large margin, suggesting that uniform letter size across images is a crucial factor in enhancing the performance of detection networks on ancient documents such as Greek papyri.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"75 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121027722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Rezanezhad, Konstantin Baierer, Mike Gerber, Kai Labusch, Clemens Neudecker
The automated yet highly accurate layout analysis (segmentation) of historical document images remains a key challenge for the improvement of Optical Character Recognition (OCR) results. But historical documents exhibit a wide array of features that disturb layout analysis, such as multiple columns, drop capitals and illustrations, skewed or curved text lines, noise, annotations, etc. We present a document layout analysis (DLA) system for historical documents implemented by pixel-wise segmentation using convolutional neural networks. In addition, heuristic methods are applied to detect marginals and to determine the reading order of text regions. Our system can detect more layout classes (e.g. initials, marginals) and achieves higher accuracy than competitive approaches. We describe the algorithm, the different models and how they were trained and discuss our results in comparison to the state-of-the-art on the basis of three historical document datasets.
{"title":"Document Layout Analysis with Deep Learning and Heuristics","authors":"V. Rezanezhad, Konstantin Baierer, Mike Gerber, Kai Labusch, Clemens Neudecker","doi":"10.1145/3604951.3605513","DOIUrl":"https://doi.org/10.1145/3604951.3605513","url":null,"abstract":"The automated yet highly accurate layout analysis (segmentation) of historical document images remains a key challenge for the improvement of Optical Character Recognition (OCR) results. But historical documents exhibit a wide array of features that disturb layout analysis, such as multiple columns, drop capitals and illustrations, skewed or curved text lines, noise, annotations, etc. We present a document layout analysis (DLA) system for historical documents implemented by pixel-wise segmentation using convolutional neural networks. In addition, heuristic methods are applied to detect marginals and to determine the reading order of text regions. Our system can detect more layout classes (e.g. initials, marginals) and achieves higher accuracy than competitive approaches. We describe the algorithm, the different models and how they were trained and discuss our results in comparison to the state-of-the-art on the basis of three historical document datasets.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126450507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present novel software to process scans of historical documents to extract their layout information. We do this using a ResNet backbone with a feature pyramid head. We extract region information directly into PageXML. For baseline extraction, we use a two stage processing approach. The software has been applied successfully to several projects. The results show the feasibility to automatically label text lines and regions in historical documents.
{"title":"Laypa: A Novel Framework for Applying Segmentation Networks to Historical Documents","authors":"Stefan Klut, Rutger van Koert, R. Sluijter","doi":"10.1145/3604951.3605520","DOIUrl":"https://doi.org/10.1145/3604951.3605520","url":null,"abstract":"We present novel software to process scans of historical documents to extract their layout information. We do this using a ResNet backbone with a feature pyramid head. We extract region information directly into PageXML. For baseline extraction, we use a two stage processing approach. The software has been applied successfully to several projects. The results show the feasibility to automatically label text lines and regions in historical documents.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"65 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116468688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a deep-learning-based approach to writer retrieval and identification for papyri, with a focus on identifying fragments associated with a specific writer and those corresponding to the same image. We present a novel neural network architecture that combines a residual backbone with a feature mixing stage to improve retrieval performance, and the final descriptor is derived from a projection layer. The methodology is evaluated on two benchmarks: PapyRow, where we achieve a mAP of 26.6 % and 24.9 % on writer and page retrieval, and HisFragIR20, showing state-of-the-art performance (44.0 % and 29.3 % mAP). Furthermore, our network has an accuracy of 28.7 % for writer identification. Additionally, we conduct experiments on the influence of two binarization techniques on fragments and show that binarizing does not enhance performance. Our code and models are available to the community at https://github.com/marco-peer/hip23.
{"title":"Feature Mixing for Writer Retrieval and Identification on Papyri Fragments","authors":"Marco Peer, Robert Sablatnig","doi":"10.1145/3604951.3605515","DOIUrl":"https://doi.org/10.1145/3604951.3605515","url":null,"abstract":"This paper proposes a deep-learning-based approach to writer retrieval and identification for papyri, with a focus on identifying fragments associated with a specific writer and those corresponding to the same image. We present a novel neural network architecture that combines a residual backbone with a feature mixing stage to improve retrieval performance, and the final descriptor is derived from a projection layer. The methodology is evaluated on two benchmarks: PapyRow, where we achieve a mAP of 26.6 % and 24.9 % on writer and page retrieval, and HisFragIR20, showing state-of-the-art performance (44.0 % and 29.3 % mAP). Furthermore, our network has an accuracy of 28.7 % for writer identification. Additionally, we conduct experiments on the influence of two binarization techniques on fragments and show that binarizing does not enhance performance. Our code and models are available to the community at https://github.com/marco-peer/hip23.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130591248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Solène Tarride, Tristan Faine, Mélodie Boillet, H. Mouchère, Christopher Kermorvant
In this paper, we explore different ways of training a model for handwritten text recognition when multiple imperfect or noisy transcriptions are available. We consider various training configurations, such as selecting a single transcription, retaining all transcriptions, or computing an aggregated transcription from all available annotations. In addition, we evaluate the impact of quality-based data selection, where samples with low agreement are removed from the training set. Our experiments are carried out on municipal registers of the city of Belfort (France) written between 1790 and 1946. The results show that computing a consensus transcription or training on multiple transcriptions are good alternatives. However, selecting training samples based on the degree of agreement between annotators introduces a bias in the training data and does not improve the results. Our dataset is publicly available on Zenodo.
{"title":"Handwritten Text Recognition from Crowdsourced Annotations","authors":"Solène Tarride, Tristan Faine, Mélodie Boillet, H. Mouchère, Christopher Kermorvant","doi":"10.1145/3604951.3605517","DOIUrl":"https://doi.org/10.1145/3604951.3605517","url":null,"abstract":"In this paper, we explore different ways of training a model for handwritten text recognition when multiple imperfect or noisy transcriptions are available. We consider various training configurations, such as selecting a single transcription, retaining all transcriptions, or computing an aggregated transcription from all available annotations. In addition, we evaluate the impact of quality-based data selection, where samples with low agreement are removed from the training set. Our experiments are carried out on municipal registers of the city of Belfort (France) written between 1790 and 1946. The results show that computing a consensus transcription or training on multiple transcriptions are good alternatives. However, selecting training samples based on the degree of agreement between annotators introduces a bias in the training data and does not improve the results. Our dataset is publicly available on Zenodo.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121626613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Furkan Simsek, Brian Pfitzmann, Hendrik Raetz, Jona Otholt, Haojin Yang, C. Meinel
In this work, we propose DocLangID, a transfer learning approach to identify the language of unlabeled historical documents. We achieve this by first leveraging labeled data from a different but related domain of historical documents. Secondly, we implement a distance-based few-shot learning approach to adapt a convolutional neural network to new languages of the unlabeled dataset. By introducing small amounts of manually labeled examples from the set of unlabeled images, our feature extractor develops a better adaptability towards new and different data distributions of historical documents. We show that such a model can be effectively fine-tuned for the unlabeled set of images by only reusing the same few-shot examples. We showcase our work across 10 languages that mostly use the Latin script. Our experiments on historical documents demonstrate that our combined approach improves the language identification performance, achieving 74% recognition accuracy on the four unseen languages of the unlabeled dataset.
{"title":"DocLangID: Improving Few-Shot Training to Identify the Language of Historical Documents","authors":"Furkan Simsek, Brian Pfitzmann, Hendrik Raetz, Jona Otholt, Haojin Yang, C. Meinel","doi":"10.1145/3604951.3605512","DOIUrl":"https://doi.org/10.1145/3604951.3605512","url":null,"abstract":"In this work, we propose DocLangID, a transfer learning approach to identify the language of unlabeled historical documents. We achieve this by first leveraging labeled data from a different but related domain of historical documents. Secondly, we implement a distance-based few-shot learning approach to adapt a convolutional neural network to new languages of the unlabeled dataset. By introducing small amounts of manually labeled examples from the set of unlabeled images, our feature extractor develops a better adaptability towards new and different data distributions of historical documents. We show that such a model can be effectively fine-tuned for the unlabeled set of images by only reusing the same few-shot examples. We showcase our work across 10 languages that mostly use the Latin script. Our experiments on historical documents demonstrate that our combined approach improves the language identification performance, achieving 74% recognition accuracy on the four unseen languages of the unlabeled dataset.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130911370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lars Vögtlin, Anna Scius-Bertrand, Paul Maergner, Andreas Fischer, R. Ingold
Deep learning methods have shown strong performance in solving tasks for historical document image analysis. However, despite current libraries and frameworks, programming an experiment or a set of experiments and executing them can be time-consuming. This is why we propose an open-source deep learning framework, DIVA-DAF, which is based on PyTorch Lightning and specifically designed for historical document analysis. Pre-implemented tasks such as segmentation and classification can be easily used or customized. It is also easy to create one’s own tasks with the benefit of powerful modules for loading data, even large data sets, and different forms of ground truth. The applications conducted have demonstrated time savings for the programming of a document analysis task, as well as for different scenarios such as pre-training or changing the architecture. Thanks to its data module, the framework also allows to reduce the time of model training significantly.
{"title":"DIVA-DAF: A Deep Learning Framework for Historical Document Image Analysis","authors":"Lars Vögtlin, Anna Scius-Bertrand, Paul Maergner, Andreas Fischer, R. Ingold","doi":"10.1145/3604951.3605511","DOIUrl":"https://doi.org/10.1145/3604951.3605511","url":null,"abstract":"Deep learning methods have shown strong performance in solving tasks for historical document image analysis. However, despite current libraries and frameworks, programming an experiment or a set of experiments and executing them can be time-consuming. This is why we propose an open-source deep learning framework, DIVA-DAF, which is based on PyTorch Lightning and specifically designed for historical document analysis. Pre-implemented tasks such as segmentation and classification can be easily used or customized. It is also easy to create one’s own tasks with the benefit of powerful modules for loading data, even large data sets, and different forms of ground truth. The applications conducted have demonstrated time savings for the programming of a document analysis task, as well as for different scenarios such as pre-training or changing the architecture. Thanks to its data module, the framework also allows to reduce the time of model training significantly.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115552379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","authors":"","doi":"10.1145/3604951","DOIUrl":"https://doi.org/10.1145/3604951","url":null,"abstract":"","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"285 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124550845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}