This contribution presents the work in progress to convert the Ancient Greek and Latin Dependency Treebank (AGLDT) into standoff annotation using PAULA XML. With an increasing number of annotations of any kind, it becomes more and more urgent that annotations related to the same texts be added standoff. Standoff annotation consists in adding any kind of annotation in separate documents, which are ultimately linked to a main text, the so-called "base text," which is meant to be unchangeable. References occur via a graph-based system of IDs, which allows an annotation layer (contained in a separate file) to be linked to another annotation layer (contained in another separate file). All the annotations/files create a labeled directed acyclic graph, whose root is represented by the base text. Standoff annotation enables easy interoperability and extension, in that single annotation layers can reference other layers of annotation independently, thus overcoming the problem of conflicting hierarchies. Moreover, standoff annotation also allows addition of different annotations of the same kind to the same text (e.g., two different interpretations of the POS tag for a given token). In the present contribution, I show how the annotations of the AGLDT can become standoff using PAULA XML, which is an open access format following the LAF principles. More precisely, I show the case study of Caesar's De Bello Civili. I detail the PAULA XML files created for its tokenization and sentence split, which are preliminary required to add morphosyntactic annotation.
{"title":"Standoff Annotation for the Ancient Greek and Latin Dependency Treebank","authors":"G. Celano","doi":"10.1145/3322905.3322919","DOIUrl":"https://doi.org/10.1145/3322905.3322919","url":null,"abstract":"This contribution presents the work in progress to convert the Ancient Greek and Latin Dependency Treebank (AGLDT) into standoff annotation using PAULA XML. With an increasing number of annotations of any kind, it becomes more and more urgent that annotations related to the same texts be added standoff. Standoff annotation consists in adding any kind of annotation in separate documents, which are ultimately linked to a main text, the so-called \"base text,\" which is meant to be unchangeable. References occur via a graph-based system of IDs, which allows an annotation layer (contained in a separate file) to be linked to another annotation layer (contained in another separate file). All the annotations/files create a labeled directed acyclic graph, whose root is represented by the base text. Standoff annotation enables easy interoperability and extension, in that single annotation layers can reference other layers of annotation independently, thus overcoming the problem of conflicting hierarchies. Moreover, standoff annotation also allows addition of different annotations of the same kind to the same text (e.g., two different interpretations of the POS tag for a given token). In the present contribution, I show how the annotations of the AGLDT can become standoff using PAULA XML, which is an open access format following the LAF principles. More precisely, I show the case study of Caesar's De Bello Civili. I detail the PAULA XML files created for its tokenization and sentence split, which are preliminary required to add morphosyntactic annotation.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129445443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Varthis, M. Poulos, Ilias Yarenis, S. Papavlasopoulos
The search through large corpora of unstructured text on the Web Domain is not an easy task and such services are not offered for the common user. One such corpus is the published works of east Christian fathers by Jacques Paul Migne, known as Patrologia Graeca (PG). In this paper, an application of a Databaseless model is presented for extracting information from the unstructured patristic works of PG on the Web Domain. The user queries terms that may exist in PG and retrieves all the paragraphs-fragments that contain these terms. The time for retrieving the information is faster than implementing this querying system in a common Relational Database Management System (RDMS). Our proposed system is portable, secure and can be easily maintained by institutions or organizations. The system auto-transforms the PG corpus into a Representational State Transfer Access Point Interface (REST API) for retrieving and processing the information, using the JavaScript Object Notation (JSON) format. The User Interface (UI) is completely distinguished from the backbone of the system while the system, on the other hand, can be easily extended for more complicated queries as well as to be applied to other corpora. Two kinds of user interfaces are described: The first one is completely static, useful for the average user using the Web browser. The second one illustrates the use of simple Shell Scripting, for searching and extracting statistical, syntactical and semantic information in real time. In both cases we try to strip down the complexity in order to accomplish the corpus transformation and searching, in a simple, secure and manageable way. Difficulties and key problems are also discussed.
在Web Domain中搜索大量非结构化文本并不是一件容易的事,而且这种服务也没有提供给普通用户。其中一个语料库是雅克·保罗·米涅(Jacques Paul Migne)出版的东方基督教教父的作品,被称为《希腊教父》(PG)。本文提出了一种基于无数据库模型的Web域PG非结构化教父作品信息提取方法。用户查询PG中可能存在的术语,并检索包含这些术语的所有段落片段。检索信息的时间比在普通的关系数据库管理系统(RDMS)中实现这个查询系统要快。我们建议的系统是便携的,安全的,并且可以很容易地被机构或组织维护。系统使用JavaScript对象表示法(JSON)格式,自动将PG语料库转换为Representational State Transfer Access Point Interface (REST API),用于检索和处理信息。用户界面(UI)与系统的主干完全不同,而另一方面,系统可以很容易地扩展到更复杂的查询以及应用于其他语料库。本文描述了两种用户界面:第一种是完全静态的,对于使用Web浏览器的普通用户很有用。第二个示例演示了如何使用简单的Shell Scripting实时搜索和提取统计、语法和语义信息。在这两种情况下,我们都试图降低复杂性,以简单、安全和可管理的方式完成语料库转换和搜索。并对难点和关键问题进行了讨论。
{"title":"Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability","authors":"E. Varthis, M. Poulos, Ilias Yarenis, S. Papavlasopoulos","doi":"10.1145/3322905.3322913","DOIUrl":"https://doi.org/10.1145/3322905.3322913","url":null,"abstract":"The search through large corpora of unstructured text on the Web Domain is not an easy task and such services are not offered for the common user. One such corpus is the published works of east Christian fathers by Jacques Paul Migne, known as Patrologia Graeca (PG). In this paper, an application of a Databaseless model is presented for extracting information from the unstructured patristic works of PG on the Web Domain. The user queries terms that may exist in PG and retrieves all the paragraphs-fragments that contain these terms. The time for retrieving the information is faster than implementing this querying system in a common Relational Database Management System (RDMS). Our proposed system is portable, secure and can be easily maintained by institutions or organizations. The system auto-transforms the PG corpus into a Representational State Transfer Access Point Interface (REST API) for retrieving and processing the information, using the JavaScript Object Notation (JSON) format. The User Interface (UI) is completely distinguished from the backbone of the system while the system, on the other hand, can be easily extended for more complicated queries as well as to be applied to other corpora. Two kinds of user interfaces are described: The first one is completely static, useful for the average user using the Web browser. The second one illustrates the use of simple Shell Scripting, for searching and extracting statistical, syntactical and semantic information in real time. In both cases we try to strip down the complexity in order to accomplish the corpus transformation and searching, in a simple, secure and manageable way. Difficulties and key problems are also discussed.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126915305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper describes the method and results of validation of 14 library catalogues. The format of the catalog record is Machine Readable Catalog (MARC21) which is the most popular metadata standards for describing books. The research investigates the structural features of the record and as a result finds and classifies different commonly found issues. The most frequent issue types are usage of undocumented schema elements, then improper values in places where a value should be taken from a dictionary, or should match to other strict requirements.
{"title":"Validating 126 million MARC records","authors":"P. Király","doi":"10.1145/3322905.3322929","DOIUrl":"https://doi.org/10.1145/3322905.3322929","url":null,"abstract":"The paper describes the method and results of validation of 14 library catalogues. The format of the catalog record is Machine Readable Catalog (MARC21) which is the most popular metadata standards for describing books. The research investigates the structural features of the record and as a result finds and classifies different commonly found issues. The most frequent issue types are usage of undocumented schema elements, then improper values in places where a value should be taken from a dictionary, or should match to other strict requirements.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116726910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.
{"title":"Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project","authors":"K. Depuydt, H. Brugman","doi":"10.1145/3322905.3322923","DOIUrl":"https://doi.org/10.1145/3322905.3322923","url":null,"abstract":"In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132746439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthias Boenig, Konstantin Baierer, Volker Hartmann, M. Federbusch, Clemens Neudecker
The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT). To OCR historical documents with high accuracy, a wide variety and variability of GT is required to create highly specific models for specific document corpora. In this paper we present an XML-based format to exhaustively describe the features of GT for OCR relevant to training, storage and retrieval (GT metadata, GTM), as well as the tools for creating GT. We discuss the OCRD-ZIP format for bundling digitized books, including METS, images, transcription, GT metadata and more. We'll show how these data formats are used in different repository solutions within the OCR-D framework.
{"title":"Labelling OCR Ground Truth for Usage in Repositories","authors":"Matthias Boenig, Konstantin Baierer, Volker Hartmann, M. Federbusch, Clemens Neudecker","doi":"10.1145/3322905.3322916","DOIUrl":"https://doi.org/10.1145/3322905.3322916","url":null,"abstract":"The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT). To OCR historical documents with high accuracy, a wide variety and variability of GT is required to create highly specific models for specific document corpora. In this paper we present an XML-based format to exhaustively describe the features of GT for OCR relevant to training, storage and retrieval (GT metadata, GTM), as well as the tools for creating GT. We discuss the OCRD-ZIP format for bundling digitized books, including METS, images, transcription, GT metadata and more. We'll show how these data formats are used in different repository solutions within the OCR-D framework.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127898347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Clausner, A. Antonacopoulos, C. Henshaw, J. Hayes
Numerical data of considerable significance is present in historical documents in tabular form. Due to the challenges involved in the extraction of this data from the scanned documents it is not available to researchers in a useful representation that unlocks the underlying statistical information. This paper sets out to create a better understanding of the problem of extracting and representing statistical information from numerical tables, in order to enable the creation of appropriate technical solutions and also for collection holders to appropriately plan their digitisation projects to better serve their readers. To that effect, after an initial overview of current practices in digitisation and representation of historical numerical data, the authors' findings are presented from a scoping exercise of the Wellcome Library's high-profile collection of the Medical Officer of Health reports. In addition to users' perspectives and a detailed examination of the nature and structure of the data in the reports, a study of the extraction and integration of the data is also described.
{"title":"Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study","authors":"C. Clausner, A. Antonacopoulos, C. Henshaw, J. Hayes","doi":"10.1145/3322905.3322932","DOIUrl":"https://doi.org/10.1145/3322905.3322932","url":null,"abstract":"Numerical data of considerable significance is present in historical documents in tabular form. Due to the challenges involved in the extraction of this data from the scanned documents it is not available to researchers in a useful representation that unlocks the underlying statistical information. This paper sets out to create a better understanding of the problem of extracting and representing statistical information from numerical tables, in order to enable the creation of appropriate technical solutions and also for collection holders to appropriately plan their digitisation projects to better serve their readers. To that effect, after an initial overview of current practices in digitisation and representation of historical numerical data, the authors' findings are presented from a scoping exercise of the Wellcome Library's high-profile collection of the Medical Officer of Health reports. In addition to users' perspectives and a detailed examination of the nature and structure of the data in the reports, a study of the extraction and integration of the data is also described.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129851159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Kettunen, T. Ruokolainen, Erno Liukkonen, Pierrick Tranouez, D. Antelme, T. Paquet
This paper describes first large scale article detection and extraction efforts on the Finnish Digi1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. The historical digital newspaper archive environment of the NLF is based on commercial docWorks2 software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in this respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laboratory of University of Rouen Normandy [11--13, 16, 17]. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869-1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and experimentation resulting in a consistent practice, we fixed the annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues. We then divided the annotated set into training and evaluation sets of 168 and 56 pages. We trained PIVAJ successfully and evaluated the results using the layout evaluation software developed by PRImA research laboratory of University of Salford [6]. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data.
{"title":"Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software","authors":"K. Kettunen, T. Ruokolainen, Erno Liukkonen, Pierrick Tranouez, D. Antelme, T. Paquet","doi":"10.1145/3322905.3322911","DOIUrl":"https://doi.org/10.1145/3322905.3322911","url":null,"abstract":"This paper describes first large scale article detection and extraction efforts on the Finnish Digi1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. The historical digital newspaper archive environment of the NLF is based on commercial docWorks2 software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in this respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laboratory of University of Rouen Normandy [11--13, 16, 17]. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869-1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and experimentation resulting in a consistent practice, we fixed the annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues. We then divided the annotated set into training and evaluation sets of 168 and 56 pages. We trained PIVAJ successfully and evaluated the results using the layout evaluation software developed by PRImA research laboratory of University of Salford [6]. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114202898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georg Rehm, Martin Lee, J. Schneider, Peter Bourgonje
We present a platform that enables the semantic analysis, enrichment, visualisation and presentation of a document collection in a way that enables human users to intuitively interact and explore the collection, in short, a curation platform or workbench. The data set used is the result of a research project, carried out by scholars from South Korea, in which official German government documents on the German re-unification were collected, intellectually curated, analysed, interpreted and published in multiple volumes. The documents we worked with are mostly in German, a small subset, mostly summaries, is in Korean. This paper describes the original research project that generated the data set and focuses upon a description of the platform and Natural Language Processing (NLP) pipeline adapted and extended for this project (e. g., OCR was added). Our key objective is to develop an interactive curation workbench that enables users to interact with the data set in several different ways that go beyond the current version of the published document collection as a set of PDF documents that are available online. The paper concludes with suggestions regarding the improvement of the platform and future work.
{"title":"Curation Technologies for Cultural Heritage Archives: Analysing and transforming a heterogeneous data set into an interactive curation workbench","authors":"Georg Rehm, Martin Lee, J. Schneider, Peter Bourgonje","doi":"10.1145/3322905.3322909","DOIUrl":"https://doi.org/10.1145/3322905.3322909","url":null,"abstract":"We present a platform that enables the semantic analysis, enrichment, visualisation and presentation of a document collection in a way that enables human users to intuitively interact and explore the collection, in short, a curation platform or workbench. The data set used is the result of a research project, carried out by scholars from South Korea, in which official German government documents on the German re-unification were collected, intellectually curated, analysed, interpreted and published in multiple volumes. The documents we worked with are mostly in German, a small subset, mostly summaries, is in Korean. This paper describes the original research project that generated the data set and focuses upon a description of the platform and Natural Language Processing (NLP) pipeline adapted and extended for this project (e. g., OCR was added). Our key objective is to develop an interactive curation workbench that enables users to interact with the data set in several different ways that go beyond the current version of the published document collection as a set of PDF documents that are available online. The paper concludes with suggestions regarding the improvement of the platform and future work.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121457540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present the first results of stylometric analysis of literary papyri. Specifically we perform a range of tests for unsupervised clustering of authors. We scrutinise both the best classic distance-based methods as well as the state-of-the-art network community detection techniqes. We report on obstacles concerning highly non-uniform distributions of text size and authorial samples combined with sparse feature space. We also note how clustering performance depends on regularisation of spelling by means of querying relevant annotations.
{"title":"Stylometry of literary papyri","authors":"Jeremi K. Ochab, Holger Essler","doi":"10.1145/3322905.3322930","DOIUrl":"https://doi.org/10.1145/3322905.3322930","url":null,"abstract":"In this paper we present the first results of stylometric analysis of literary papyri. Specifically we perform a range of tests for unsupervised clustering of authors. We scrutinise both the best classic distance-based methods as well as the state-of-the-art network community detection techniqes. We report on obstacles concerning highly non-uniform distributions of text size and authorial samples combined with sparse feature space. We also note how clustering performance depends on regularisation of spelling by means of querying relevant annotations.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132462968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The British Library is home to millions of items representing every age of written civilisation, including books, manuscripts and newspapers in all written languages. Large digitisation programmes currently underway are opening up access to this rich and unique historical content on an ever increasing scale. However, particularly for historical material written in non-Latin scripts, enabling enriched full-text discovery and analysis across the digitised output, something which would truly transform access and scholarship, is still out of reach. This is due in part to commercial text recognition solutions currently on the market today having largely been optimised for modern documents and Latin scripts. This paper will report on a series of initiatives undertaken by the British Library to investigate, evaluate and support new research into enhancing text recognition capabilities for two major digitised collections of non-Western language collections: printed Bangla and handwritten Arabic. It seeks to present lessons learned and opportunities gained from cross-disciplinary collaboration between the cultural heritage sector and researchers working at the cutting edge of text recognition, with a view towards informing and encouraging future such partnerships.
{"title":"Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector","authors":"T. Derrick, Nora McGregor","doi":"10.1145/3322905.3322907","DOIUrl":"https://doi.org/10.1145/3322905.3322907","url":null,"abstract":"The British Library is home to millions of items representing every age of written civilisation, including books, manuscripts and newspapers in all written languages. Large digitisation programmes currently underway are opening up access to this rich and unique historical content on an ever increasing scale. However, particularly for historical material written in non-Latin scripts, enabling enriched full-text discovery and analysis across the digitised output, something which would truly transform access and scholarship, is still out of reach. This is due in part to commercial text recognition solutions currently on the market today having largely been optimised for modern documents and Latin scripts. This paper will report on a series of initiatives undertaken by the British Library to investigate, evaluate and support new research into enhancing text recognition capabilities for two major digitised collections of non-Western language collections: printed Bangla and handwritten Arabic. It seeks to present lessons learned and opportunities gained from cross-disciplinary collaboration between the cultural heritage sector and researchers working at the cutting edge of text recognition, with a view towards informing and encouraging future such partnerships.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127242344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}