利用树库和深度学习对楔形文字进行语言注释

IF 1.1 3区文学 0 HUMANITIES, MULTIDISCIPLINARY Digital Scholarship in the Humanities Pub Date : 2024-02-02 DOI:10.1093/llc/fqae002

Matthew Ong, Shai Gordin

{"title":"利用树库和深度学习对楔形文字进行语言注释","authors":"Matthew Ong, Shai Gordin","doi":"10.1093/llc/fqae002","DOIUrl":null,"url":null,"abstract":"We describe an efficient pipeline for morpho-syntactically annotating an ancient language corpus which takes advantage of bootstrapping techniques. This pipeline is designed for ancient language scholars looking to jump-start their own treebank projects, which can in turn serve further pedagogical research projects in the target language. We situate our work in the field of similar ancient language treebank projects, arguing that our approach shows that individual humanities scholars can leverage current machine-learning tools to produce their own richly annotated corpora. We illustrate this pipeline by producing a new Akkadian-language treebank based on two volumes from the online editions of the State Archives of Assyria project hosted on Oracc, as well as a spaCy language model named AkkParser trained on that treebank. Both of these are made publicly available for annotating other Akkadian corpora. In addition, we discuss linguistic issues particular to the Neo-Assyrian letter corpus and data-encoding complications of cuneiform texts in Oracc. The strategies, language models, and processing scripts we developed to handle both linguistic and data-encoding issues in this project will be of special interest to scholars seeking to develop their own cuneiform treebanks.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"245 1","pages":""},"PeriodicalIF":1.1000,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Linguistic annotation of cuneiform texts using treebanks and deep learning\",\"authors\":\"Matthew Ong, Shai Gordin\",\"doi\":\"10.1093/llc/fqae002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We describe an efficient pipeline for morpho-syntactically annotating an ancient language corpus which takes advantage of bootstrapping techniques. This pipeline is designed for ancient language scholars looking to jump-start their own treebank projects, which can in turn serve further pedagogical research projects in the target language. We situate our work in the field of similar ancient language treebank projects, arguing that our approach shows that individual humanities scholars can leverage current machine-learning tools to produce their own richly annotated corpora. We illustrate this pipeline by producing a new Akkadian-language treebank based on two volumes from the online editions of the State Archives of Assyria project hosted on Oracc, as well as a spaCy language model named AkkParser trained on that treebank. Both of these are made publicly available for annotating other Akkadian corpora. In addition, we discuss linguistic issues particular to the Neo-Assyrian letter corpus and data-encoding complications of cuneiform texts in Oracc. The strategies, language models, and processing scripts we developed to handle both linguistic and data-encoding issues in this project will be of special interest to scholars seeking to develop their own cuneiform treebanks.\",\"PeriodicalId\":45315,\"journal\":{\"name\":\"Digital Scholarship in the Humanities\",\"volume\":\"245 1\",\"pages\":\"\"},\"PeriodicalIF\":1.1000,\"publicationDate\":\"2024-02-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Scholarship in the Humanities\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://doi.org/10.1093/llc/fqae002\",\"RegionNum\":3,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"HUMANITIES, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1093/llc/fqae002","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

我们介绍了一种利用引导技术对古语语料进行形态-句法注释的高效方法。该管道专为希望启动自己的树状库项目的古语学者而设计，这些项目反过来又能为目标语言的进一步教学研究项目服务。我们将自己的工作定位在类似的古语树库项目领域，认为我们的方法表明，人文学者个人可以利用当前的机器学习工具制作自己的丰富注释语料库。我们在 Oracc 上托管的亚述国家档案馆（State Archives of Assyria）项目在线版本的两卷基础上制作了一个新的阿卡德语树状库，并在该树状库的基础上训练了一个名为 AkkParser 的 spaCy 语言模型，以此来说明这一方法。这两个模型都已公开发布，可用于注释其他阿卡德语语料库。此外，我们还讨论了新亚述字母语料库特有的语言问题以及 Oracc 中楔形文字的数据编码复杂性。在这个项目中，我们为处理语言和数据编码问题而开发的策略、语言模型和处理脚本将对寻求开发自己的楔形文字树库的学者有特别的意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Linguistic annotation of cuneiform texts using treebanks and deep learning

We describe an efficient pipeline for morpho-syntactically annotating an ancient language corpus which takes advantage of bootstrapping techniques. This pipeline is designed for ancient language scholars looking to jump-start their own treebank projects, which can in turn serve further pedagogical research projects in the target language. We situate our work in the field of similar ancient language treebank projects, arguing that our approach shows that individual humanities scholars can leverage current machine-learning tools to produce their own richly annotated corpora. We illustrate this pipeline by producing a new Akkadian-language treebank based on two volumes from the online editions of the State Archives of Assyria project hosted on Oracc, as well as a spaCy language model named AkkParser trained on that treebank. Both of these are made publicly available for annotating other Akkadian corpora. In addition, we discuss linguistic issues particular to the Neo-Assyrian letter corpus and data-encoding complications of cuneiform texts in Oracc. The strategies, language models, and processing scripts we developed to handle both linguistic and data-encoding issues in this project will be of special interest to scholars seeking to develop their own cuneiform treebanks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Scholarship in the Humanities Multiple-

CiteScore

1.80

自引率

25.00%

发文量

期刊介绍： DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.