Matthias Boenig, Konstantin Baierer, Volker Hartmann, M. Federbusch, Clemens Neudecker
{"title":"标记OCR基真值以便在存储库中使用","authors":"Matthias Boenig, Konstantin Baierer, Volker Hartmann, M. Federbusch, Clemens Neudecker","doi":"10.1145/3322905.3322916","DOIUrl":null,"url":null,"abstract":"The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT). To OCR historical documents with high accuracy, a wide variety and variability of GT is required to create highly specific models for specific document corpora. In this paper we present an XML-based format to exhaustively describe the features of GT for OCR relevant to training, storage and retrieval (GT metadata, GTM), as well as the tools for creating GT. We discuss the OCRD-ZIP format for bundling digitized books, including METS, images, transcription, GT metadata and more. We'll show how these data formats are used in different repository solutions within the OCR-D framework.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"91 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Labelling OCR Ground Truth for Usage in Repositories\",\"authors\":\"Matthias Boenig, Konstantin Baierer, Volker Hartmann, M. Federbusch, Clemens Neudecker\",\"doi\":\"10.1145/3322905.3322916\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT). To OCR historical documents with high accuracy, a wide variety and variability of GT is required to create highly specific models for specific document corpora. In this paper we present an XML-based format to exhaustively describe the features of GT for OCR relevant to training, storage and retrieval (GT metadata, GTM), as well as the tools for creating GT. We discuss the OCRD-ZIP format for bundling digitized books, including METS, images, transcription, GT metadata and more. We'll show how these data formats are used in different repository solutions within the OCR-D framework.\",\"PeriodicalId\":418911,\"journal\":{\"name\":\"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage\",\"volume\":\"91 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3322905.3322916\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3322905.3322916","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Labelling OCR Ground Truth for Usage in Repositories
The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT). To OCR historical documents with high accuracy, a wide variety and variability of GT is required to create highly specific models for specific document corpora. In this paper we present an XML-based format to exhaustively describe the features of GT for OCR relevant to training, storage and retrieval (GT metadata, GTM), as well as the tools for creating GT. We discuss the OCRD-ZIP format for bundling digitized books, including METS, images, transcription, GT metadata and more. We'll show how these data formats are used in different repository solutions within the OCR-D framework.