Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage最新文献

英文中文

Standoff Annotation for the Ancient Greek and Latin Dependency Treebank 古希腊语和拉丁语依存关系树库的对峙注释

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322919

G. Celano

This contribution presents the work in progress to convert the Ancient Greek and Latin Dependency Treebank (AGLDT) into standoff annotation using PAULA XML. With an increasing number of annotations of any kind, it becomes more and more urgent that annotations related to the same texts be added standoff. Standoff annotation consists in adding any kind of annotation in separate documents, which are ultimately linked to a main text, the so-called "base text," which is meant to be unchangeable. References occur via a graph-based system of IDs, which allows an annotation layer (contained in a separate file) to be linked to another annotation layer (contained in another separate file). All the annotations/files create a labeled directed acyclic graph, whose root is represented by the base text. Standoff annotation enables easy interoperability and extension, in that single annotation layers can reference other layers of annotation independently, thus overcoming the problem of conflicting hierarchies. Moreover, standoff annotation also allows addition of different annotations of the same kind to the same text (e.g., two different interpretations of the POS tag for a given token). In the present contribution, I show how the annotations of the AGLDT can become standoff using PAULA XML, which is an open access format following the LAF principles. More precisely, I show the case study of Caesar's De Bello Civili. I detail the PAULA XML files created for its tokenization and sentence split, which are preliminary required to add morphosyntactic annotation.

本文介绍了使用PAULA XML将古希腊语和拉丁语依赖树库(AGLDT)转换为对峙注释的工作。随着各类注释数量的不断增加，对同一文本的注释变得越来越迫切。对峙注释包括在单独的文档中添加任何类型的注释，这些文档最终链接到一个主文本，即所谓的“基本文本”，这意味着不可更改。引用通过基于图的id系统发生，该系统允许将注释层(包含在单独的文件中)链接到另一个注释层(包含在另一个单独的文件中)。所有注释/文件创建一个带标签的有向无环图，其根由基本文本表示。对峙注释实现了简单的互操作性和扩展性，因为单个注释层可以独立地引用其他注释层，从而克服了层次冲突的问题。此外，对峙注释还允许向同一文本添加相同类型的不同注释(例如，给定令牌的POS标记的两种不同解释)。在本文中，我将展示如何使用PAULA XML(一种遵循LAF原则的开放访问格式)将AGLDT的注释变为注释。更准确地说，我展示了凯撒的《文明之国》的案例研究。我详细介绍了为它的标记化和句子分割而创建的PAULA XML文件，这是添加形态语法注释所必需的。

{"title":"Standoff Annotation for the Ancient Greek and Latin Dependency Treebank","authors":"G. Celano","doi":"10.1145/3322905.3322919","DOIUrl":"https://doi.org/10.1145/3322905.3322919","url":null,"abstract":"This contribution presents the work in progress to convert the Ancient Greek and Latin Dependency Treebank (AGLDT) into standoff annotation using PAULA XML. With an increasing number of annotations of any kind, it becomes more and more urgent that annotations related to the same texts be added standoff. Standoff annotation consists in adding any kind of annotation in separate documents, which are ultimately linked to a main text, the so-called \"base text,\" which is meant to be unchangeable. References occur via a graph-based system of IDs, which allows an annotation layer (contained in a separate file) to be linked to another annotation layer (contained in another separate file). All the annotations/files create a labeled directed acyclic graph, whose root is represented by the base text. Standoff annotation enables easy interoperability and extension, in that single annotation layers can reference other layers of annotation independently, thus overcoming the problem of conflicting hierarchies. Moreover, standoff annotation also allows addition of different annotations of the same kind to the same text (e.g., two different interpretations of the POS tag for a given token). In the present contribution, I show how the annotations of the AGLDT can become standoff using PAULA XML, which is an open access format following the LAF principles. More precisely, I show the case study of Caesar's De Bello Civili. I detail the PAULA XML files created for its tokenization and sentence split, which are preliminary required to add morphosyntactic annotation.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129445443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability 实现了一个无数据库的Web REST API，用于Migne的《希腊百科全书》的非结构化文本，具有搜索功能和额外的语义和句法扩展性

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322913

E. Varthis, M. Poulos, Ilias Yarenis, S. Papavlasopoulos

The search through large corpora of unstructured text on the Web Domain is not an easy task and such services are not offered for the common user. One such corpus is the published works of east Christian fathers by Jacques Paul Migne, known as Patrologia Graeca (PG). In this paper, an application of a Databaseless model is presented for extracting information from the unstructured patristic works of PG on the Web Domain. The user queries terms that may exist in PG and retrieves all the paragraphs-fragments that contain these terms. The time for retrieving the information is faster than implementing this querying system in a common Relational Database Management System (RDMS). Our proposed system is portable, secure and can be easily maintained by institutions or organizations. The system auto-transforms the PG corpus into a Representational State Transfer Access Point Interface (REST API) for retrieving and processing the information, using the JavaScript Object Notation (JSON) format. The User Interface (UI) is completely distinguished from the backbone of the system while the system, on the other hand, can be easily extended for more complicated queries as well as to be applied to other corpora. Two kinds of user interfaces are described: The first one is completely static, useful for the average user using the Web browser. The second one illustrates the use of simple Shell Scripting, for searching and extracting statistical, syntactical and semantic information in real time. In both cases we try to strip down the complexity in order to accomplish the corpus transformation and searching, in a simple, secure and manageable way. Difficulties and key problems are also discussed.

在Web Domain中搜索大量非结构化文本并不是一件容易的事，而且这种服务也没有提供给普通用户。其中一个语料库是雅克·保罗·米涅(Jacques Paul Migne)出版的东方基督教教父的作品，被称为《希腊教父》(PG)。本文提出了一种基于无数据库模型的Web域PG非结构化教父作品信息提取方法。用户查询PG中可能存在的术语，并检索包含这些术语的所有段落片段。检索信息的时间比在普通的关系数据库管理系统(RDMS)中实现这个查询系统要快。我们建议的系统是便携的，安全的，并且可以很容易地被机构或组织维护。系统使用JavaScript对象表示法(JSON)格式，自动将PG语料库转换为Representational State Transfer Access Point Interface (REST API)，用于检索和处理信息。用户界面(UI)与系统的主干完全不同，而另一方面，系统可以很容易地扩展到更复杂的查询以及应用于其他语料库。本文描述了两种用户界面:第一种是完全静态的，对于使用Web浏览器的普通用户很有用。第二个示例演示了如何使用简单的Shell Scripting实时搜索和提取统计、语法和语义信息。在这两种情况下，我们都试图降低复杂性，以简单、安全和可管理的方式完成语料库转换和搜索。并对难点和关键问题进行了讨论。

{"title":"Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability","authors":"E. Varthis, M. Poulos, Ilias Yarenis, S. Papavlasopoulos","doi":"10.1145/3322905.3322913","DOIUrl":"https://doi.org/10.1145/3322905.3322913","url":null,"abstract":"The search through large corpora of unstructured text on the Web Domain is not an easy task and such services are not offered for the common user. One such corpus is the published works of east Christian fathers by Jacques Paul Migne, known as Patrologia Graeca (PG). In this paper, an application of a Databaseless model is presented for extracting information from the unstructured patristic works of PG on the Web Domain. The user queries terms that may exist in PG and retrieves all the paragraphs-fragments that contain these terms. The time for retrieving the information is faster than implementing this querying system in a common Relational Database Management System (RDMS). Our proposed system is portable, secure and can be easily maintained by institutions or organizations. The system auto-transforms the PG corpus into a Representational State Transfer Access Point Interface (REST API) for retrieving and processing the information, using the JavaScript Object Notation (JSON) format. The User Interface (UI) is completely distinguished from the backbone of the system while the system, on the other hand, can be easily extended for more complicated queries as well as to be applied to other corpora. Two kinds of user interfaces are described: The first one is completely static, useful for the average user using the Web browser. The second one illustrates the use of simple Shell Scripting, for searching and extracting statistical, syntactical and semantic information in real time. In both cases we try to strip down the complexity in order to accomplish the corpus transformation and searching, in a simple, secure and manageable way. Difficulties and key problems are also discussed.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126915305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Validating 126 million MARC records 验证1.26亿条MARC记录

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322929

P. Király

The paper describes the method and results of validation of 14 library catalogues. The format of the catalog record is Machine Readable Catalog (MARC21) which is the most popular metadata standards for describing books. The research investigates the structural features of the record and as a result finds and classifies different commonly found issues. The most frequent issue types are usage of undocumented schema elements, then improper values in places where a value should be taken from a dictionary, or should match to other strict requirements.

本文介绍了对14种图书目录进行验证的方法和结果。目录记录的格式是机器可读目录(MARC21)，这是最流行的描述图书的元数据标准。本研究考察了档案的结构特征，从而发现并分类了不同的常见问题。最常见的问题类型是使用未记录的模式元素，然后在应该从字典中获取值的地方使用不正确的值，或者应该匹配其他严格的要求。

引用次数: 9

Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project 将数字化材料转化为历时语料库:Nederlab项目中的元数据挑战

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322923

K. Depuydt, H. Brugman

In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.

在本文中，我们认为利用历史语料库数据需要文本元数据，而来自数字图书馆、档案馆或其他电子文本集合的数字对象的元数据不提供这种元数据。大多数文本集合在其元数据中描述包含文本的对象(书、报纸)。为了研究作者的风格，或者研究某一时期的语言，或者一种跨越时间的现象，需要对文本中的每个单词进行正确的元数据处理，这导致一些文本集的元数据方案非常复杂。我们专注于Nederlab语料库。Nederlab是一个研究环境，可以访问从6世纪到21世纪的荷兰文本的大型历时语料库，超过100亿单词。该语料库是使用来自研究人员、研究机构、档案馆和图书馆的现有数字化文本材料编制的。Nederlab的目标是提供工具和数据，使研究人员能够追踪荷兰语言、文化和社会的长期变化。这种类型的研究对文本附带的元数据有很高的要求。由于Nederlab语料库由不同的集合组成，每个集合都有自己的元数据，因此添加适当元数据的任务并不简单，因为内容提供者和语料库构建者的透视图存在差异。我们将描述所需的元数据方案，以及我们如何尝试为Nederlab大小的语料库实现这一方案。

{"title":"Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project","authors":"K. Depuydt, H. Brugman","doi":"10.1145/3322905.3322923","DOIUrl":"https://doi.org/10.1145/3322905.3322923","url":null,"abstract":"In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132746439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Labelling OCR Ground Truth for Usage in Repositories 标记OCR基真值以便在存储库中使用

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322916

Matthias Boenig, Konstantin Baierer, Volker Hartmann, M. Federbusch, Clemens Neudecker

The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT). To OCR historical documents with high accuracy, a wide variety and variability of GT is required to create highly specific models for specific document corpora. In this paper we present an XML-based format to exhaustively describe the features of GT for OCR relevant to training, storage and retrieval (GT metadata, GTM), as well as the tools for creating GT. We discuss the OCRD-ZIP format for bundling digitized books, including METS, images, transcription, GT metadata and more. We'll show how these data formats are used in different repository solutions within the OCR-D framework.

在过去十年中，深度/机器学习算法的快速发展在很大程度上取代了传统的基于模式/语言的OCR方法。训练这些新工具需要扫描图像以及它们的转录(Ground Truth, GT)。为了使OCR历史文档具有较高的准确性，需要广泛的种类和可变性的GT来为特定的文档语料库创建高度特定的模型。在本文中，我们提出了一种基于xml的格式，以详尽地描述与训练、存储和检索(GT元数据，GTM)相关的GT的特征，以及创建GT的工具。我们讨论了用于捆绑数字化图书的ocdr - zip格式，包括METS、图像、转录、GT元数据等。我们将展示如何在OCR-D框架内的不同存储库解决方案中使用这些数据格式。

引用次数: 5

Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study 从数字化数字表中提取统计信息:卫生医务人员报告范围研究

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322932

C. Clausner, A. Antonacopoulos, C. Henshaw, J. Hayes

Numerical data of considerable significance is present in historical documents in tabular form. Due to the challenges involved in the extraction of this data from the scanned documents it is not available to researchers in a useful representation that unlocks the underlying statistical information. This paper sets out to create a better understanding of the problem of extracting and representing statistical information from numerical tables, in order to enable the creation of appropriate technical solutions and also for collection holders to appropriately plan their digitisation projects to better serve their readers. To that effect, after an initial overview of current practices in digitisation and representation of historical numerical data, the authors' findings are presented from a scoping exercise of the Wellcome Library's high-profile collection of the Medical Officer of Health reports. In addition to users' perspectives and a detailed examination of the nature and structure of the data in the reports, a study of the extraction and integration of the data is also described.

相当重要的数字数据以表格形式出现在历史文献中。由于从扫描文件中提取这些数据所涉及的挑战，研究人员无法以有用的表示方式解锁潜在的统计信息。本文旨在更好地理解从数字表中提取和表示统计信息的问题，以便能够创建适当的技术解决方案，也使馆藏持有人能够适当地规划他们的数字化项目，以更好地为读者服务。为此，在对数字化和历史数字数据表示的当前实践进行初步概述之后，作者的发现来自于对惠康图书馆高调收集的卫生医疗官员报告的范围界定。除了用户的观点和对报告中数据的性质和结构的详细检查之外，还描述了对数据的提取和集成的研究。

{"title":"Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study","authors":"C. Clausner, A. Antonacopoulos, C. Henshaw, J. Hayes","doi":"10.1145/3322905.3322932","DOIUrl":"https://doi.org/10.1145/3322905.3322932","url":null,"abstract":"Numerical data of considerable significance is present in historical documents in tabular form. Due to the challenges involved in the extraction of this data from the scanned documents it is not available to researchers in a useful representation that unlocks the underlying statistical information. This paper sets out to create a better understanding of the problem of extracting and representing statistical information from numerical tables, in order to enable the creation of appropriate technical solutions and also for collection holders to appropriately plan their digitisation projects to better serve their readers. To that effect, after an initial overview of current practices in digitisation and representation of historical numerical data, the authors' findings are presented from a scoping exercise of the Wellcome Library's high-profile collection of the Medical Officer of Health reports. In addition to users' perspectives and a detailed examination of the nature and structure of the data in the reports, a study of the extraction and integration of the data is also described.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129851159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software 在1771-1929年的芬兰数字化历史报纸收集中检测文章:使用PIVAJ软件的早期结果

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322911

K. Kettunen, T. Ruokolainen, Erno Liukkonen, Pierrick Tranouez, D. Antelme, T. Paquet

This paper describes first large scale article detection and extraction efforts on the Finnish Digi1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. The historical digital newspaper archive environment of the NLF is based on commercial docWorks2 software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in this respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laboratory of University of Rouen Normandy [11--13, 16, 17]. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869-1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and experimentation resulting in a consistent practice, we fixed the annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues. We then divided the annotated set into training and evaluation sets of 168 and 56 pages. We trained PIVAJ successfully and evaluated the results using the layout evaluation software developed by PRImA research laboratory of University of Salford [6]. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data.

本文描述了对芬兰国家图书馆(NLF)的芬兰Digi1报纸材料的第一次大规模文章检测和提取工作，使用了一份报纸的数据，Uusi Suometar 1869-1898。国家图书馆的历史数字报纸档案环境是基于商业docWorks2软件的。软件可以进行文章的检测和提取，但是我们的材料在系统中这方面的表现似乎不太好。因此，我们一直在寻找一种替代的文章分词系统，现在我们把精力集中在诺曼底鲁昂大学LITIS实验室开发的基于PIVAJ机器学习的平台上[11- 13,16,17]。我们选择了一份报纸Uusi Suometar作为PIVAJ的训练和评估数据。我们建立了一个包含56期报纸的数据集，从1869年到1898年，每期4页，总共224页。在选取的56期期刊中，我们的第一个数据标注和实验阶段包括标注28期(112页)的子集并进行初步实验。在初步的注释和实验产生一致的实践之后，我们相应地修复了前28个问题的注释。随后，我们对剩下的28个问题进行了注释。然后我们将注释集分为168页的训练集和56页的评估集。我们成功训练了PIVAJ，并使用索尔福德大学PRImA研究实验室开发的布局评估软件对结果进行了评估[6]。我们的实验结果表明，在[6]中介绍的三种不同的评估场景下，PIVAJ在整个56页数据集上的成功率分别为67.9、76.1和92.2。总体而言，考虑到Uusi Suometar的不同问题在数据时间尺度上的不同布局，结果似乎是合理的。

{"title":"Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software","authors":"K. Kettunen, T. Ruokolainen, Erno Liukkonen, Pierrick Tranouez, D. Antelme, T. Paquet","doi":"10.1145/3322905.3322911","DOIUrl":"https://doi.org/10.1145/3322905.3322911","url":null,"abstract":"This paper describes first large scale article detection and extraction efforts on the Finnish Digi1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. The historical digital newspaper archive environment of the NLF is based on commercial docWorks2 software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in this respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laboratory of University of Rouen Normandy [11--13, 16, 17]. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869-1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and experimentation resulting in a consistent practice, we fixed the annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues. We then divided the annotated set into training and evaluation sets of 168 and 56 pages. We trained PIVAJ successfully and evaluated the results using the layout evaluation software developed by PRImA research laboratory of University of Salford [6]. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114202898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Curation Technologies for Cultural Heritage Archives: Analysing and transforming a heterogeneous data set into an interactive curation workbench 文化遗产档案的策展技术:分析并将异构数据集转化为交互式策展工作台

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322909

Georg Rehm, Martin Lee, J. Schneider, Peter Bourgonje

We present a platform that enables the semantic analysis, enrichment, visualisation and presentation of a document collection in a way that enables human users to intuitively interact and explore the collection, in short, a curation platform or workbench. The data set used is the result of a research project, carried out by scholars from South Korea, in which official German government documents on the German re-unification were collected, intellectually curated, analysed, interpreted and published in multiple volumes. The documents we worked with are mostly in German, a small subset, mostly summaries, is in Korean. This paper describes the original research project that generated the data set and focuses upon a description of the platform and Natural Language Processing (NLP) pipeline adapted and extended for this project (e. g., OCR was added). Our key objective is to develop an interactive curation workbench that enables users to interact with the data set in several different ways that go beyond the current version of the published document collection as a set of PDF documents that are available online. The paper concludes with suggestions regarding the improvement of the platform and future work.

我们提供了一个平台，可以对文档集合进行语义分析、丰富、可视化和呈现，使人类用户能够直观地交互和探索集合，简而言之，就是一个管理平台或工作台。所使用的数据集是由韩国学者开展的一个研究项目的结果，在该项目中，收集了有关德国统一的德国政府官方文件，进行了智力策划，分析，解释并出版了多卷。我们处理的文件大部分是德语，一小部分，主要是摘要，是韩语。本文描述了生成数据集的原始研究项目，并重点描述了为该项目调整和扩展的平台和自然语言处理(NLP)管道(例如，添加了OCR)。我们的主要目标是开发一个交互式管理工作台，使用户能够以几种不同的方式与数据集进行交互，这些方式超越了当前版本的已发布文档集，即一组在线可用的PDF文档。最后对平台的改进和今后的工作提出了建议。

{"title":"Curation Technologies for Cultural Heritage Archives: Analysing and transforming a heterogeneous data set into an interactive curation workbench","authors":"Georg Rehm, Martin Lee, J. Schneider, Peter Bourgonje","doi":"10.1145/3322905.3322909","DOIUrl":"https://doi.org/10.1145/3322905.3322909","url":null,"abstract":"We present a platform that enables the semantic analysis, enrichment, visualisation and presentation of a document collection in a way that enables human users to intuitively interact and explore the collection, in short, a curation platform or workbench. The data set used is the result of a research project, carried out by scholars from South Korea, in which official German government documents on the German re-unification were collected, intellectually curated, analysed, interpreted and published in multiple volumes. The documents we worked with are mostly in German, a small subset, mostly summaries, is in Korean. This paper describes the original research project that generated the data set and focuses upon a description of the platform and Natural Language Processing (NLP) pipeline adapted and extended for this project (e. g., OCR was added). Our key objective is to develop an interactive curation workbench that enables users to interact with the data set in several different ways that go beyond the current version of the published document collection as a set of PDF documents that are available online. The paper concludes with suggestions regarding the improvement of the platform and future work.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121457540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Stylometry of literary papyri

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322930

Jeremi K. Ochab, Holger Essler

In this paper we present the first results of stylometric analysis of literary papyri. Specifically we perform a range of tests for unsupervised clustering of authors. We scrutinise both the best classic distance-based methods as well as the state-of-the-art network community detection techniqes. We report on obstacles concerning highly non-uniform distributions of text size and authorial samples combined with sparse feature space. We also note how clustering performance depends on regularisation of spelling by means of querying relevant annotations.

在本文中，我们提出了文学纸莎草文体分析的第一个结果。具体来说，我们对作者的无监督聚类进行了一系列测试。我们仔细研究了最经典的基于距离的方法以及最先进的网络社区检测技术。我们报告了关于文本大小和作者样本与稀疏特征空间相结合的高度不均匀分布的障碍。我们还注意到聚类性能如何依赖于查询相关注释的拼写规范化。

引用次数: 3

Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector 跨界别合作以丰富文化遗产界别的非西方语言资料

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322907

T. Derrick, Nora McGregor

The British Library is home to millions of items representing every age of written civilisation, including books, manuscripts and newspapers in all written languages. Large digitisation programmes currently underway are opening up access to this rich and unique historical content on an ever increasing scale. However, particularly for historical material written in non-Latin scripts, enabling enriched full-text discovery and analysis across the digitised output, something which would truly transform access and scholarship, is still out of reach. This is due in part to commercial text recognition solutions currently on the market today having largely been optimised for modern documents and Latin scripts. This paper will report on a series of initiatives undertaken by the British Library to investigate, evaluate and support new research into enhancing text recognition capabilities for two major digitised collections of non-Western language collections: printed Bangla and handwritten Arabic. It seeks to present lessons learned and opportunities gained from cross-disciplinary collaboration between the cultural heritage sector and researchers working at the cutting edge of text recognition, with a view towards informing and encouraging future such partnerships.

大英图书馆收藏了数百万件代表每个时代文字文明的物品，包括各种文字的书籍、手稿和报纸。目前正在进行的大型数字化项目正在以越来越大的规模开放对这些丰富而独特的历史内容的访问。然而，特别是对于用非拉丁文字书写的历史材料，在数字化输出中实现丰富的全文发现和分析，这将真正改变访问和学术，仍然遥不可及。这部分是由于目前市场上的商业文本识别解决方案在很大程度上针对现代文档和拉丁脚本进行了优化。本文将报告大英图书馆开展的一系列举措，以调查、评估和支持新的研究，以增强两种主要的非西方语言馆藏的文本识别能力:印刷孟加拉语和手写阿拉伯语。它旨在展示文化遗产部门和从事文本识别前沿工作的研究人员之间跨学科合作的经验教训和机会，以期为未来的这种伙伴关系提供信息和鼓励。

{"title":"Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector","authors":"T. Derrick, Nora McGregor","doi":"10.1145/3322905.3322907","DOIUrl":"https://doi.org/10.1145/3322905.3322907","url":null,"abstract":"The British Library is home to millions of items representing every age of written civilisation, including books, manuscripts and newspapers in all written languages. Large digitisation programmes currently underway are opening up access to this rich and unique historical content on an ever increasing scale. However, particularly for historical material written in non-Latin scripts, enabling enriched full-text discovery and analysis across the digitised output, something which would truly transform access and scholarship, is still out of reach. This is due in part to commercial text recognition solutions currently on the market today having largely been optimised for modern documents and Latin scripts. This paper will report on a series of initiatives undertaken by the British Library to investigate, evaluate and support new research into enhancing text recognition capabilities for two major digitised collections of non-Western language collections: printed Bangla and handwritten Arabic. It seeks to present lessons learned and opportunities gained from cross-disciplinary collaboration between the cultural heritage sector and researchers working at the cutting edge of text recognition, with a view towards informing and encouraging future such partnerships.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127242344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀