将数字化材料转化为历时语料库:Nederlab项目中的元数据挑战

K. Depuydt, H. Brugman
{"title":"将数字化材料转化为历时语料库:Nederlab项目中的元数据挑战","authors":"K. Depuydt, H. Brugman","doi":"10.1145/3322905.3322923","DOIUrl":null,"url":null,"abstract":"In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project\",\"authors\":\"K. Depuydt, H. Brugman\",\"doi\":\"10.1145/3322905.3322923\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.\",\"PeriodicalId\":418911,\"journal\":{\"name\":\"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3322905.3322923\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3322905.3322923","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在本文中,我们认为利用历史语料库数据需要文本元数据,而来自数字图书馆、档案馆或其他电子文本集合的数字对象的元数据不提供这种元数据。大多数文本集合在其元数据中描述包含文本的对象(书、报纸)。为了研究作者的风格,或者研究某一时期的语言,或者一种跨越时间的现象,需要对文本中的每个单词进行正确的元数据处理,这导致一些文本集的元数据方案非常复杂。我们专注于Nederlab语料库。Nederlab是一个研究环境,可以访问从6世纪到21世纪的荷兰文本的大型历时语料库,超过100亿单词。该语料库是使用来自研究人员、研究机构、档案馆和图书馆的现有数字化文本材料编制的。Nederlab的目标是提供工具和数据,使研究人员能够追踪荷兰语言、文化和社会的长期变化。这种类型的研究对文本附带的元数据有很高的要求。由于Nederlab语料库由不同的集合组成,每个集合都有自己的元数据,因此添加适当元数据的任务并不简单,因为内容提供者和语料库构建者的透视图存在差异。我们将描述所需的元数据方案,以及我们如何尝试为Nederlab大小的语料库实现这一方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project
In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software OCR for Greek polytonic (multi accent) historical printed documents: development, optimization and quality control Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts Validating 126 million MARC records Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1