From Manuscript to Tagged Corpora An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East

B. Kindt, Chahan Vidal-Gorène
{"title":"From Manuscript to Tagged Corpora\n An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East","authors":"B. Kindt, Chahan Vidal-Gorène","doi":"10.30687/arm/9372-8175/2022/01/005","DOIUrl":null,"url":null,"abstract":"Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).","PeriodicalId":6386,"journal":{"name":"«Проблемы прогнозирования» 2022 №1","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"«Проблемы прогнозирования» 2022 №1","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30687/arm/9372-8175/2022/01/005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从手稿到标记语料库:古代亚美尼亚语或其他基督教东方资源不足语言的自动化过程
创建一个由完整的语言注释丰富的数字语料库是一项工作,它通常集成了几个手动步骤的获取,处理和数据显示。处理的前提是存在专门的和专门的分析工具,适应语料库中使用的语言状态。本文描述了从扫描文档中构建亚美尼亚语料库的半监督过程。该方法基于Calfa和GREgORI预先训练的一系列应用程序,可以对文本进行完整的处理,从自动输入到语言分析和数据显示。我们基于17世纪四福音书手稿(Walters MS W541 = BAL W541, Amida Gospels, ff)的数字化副本,对这种方法和模型专业化的好处进行了评估。(路1:1 - 78)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Classical Armenian Deixis Issues of Translation A New Look at Old Armenisms in Kartvelian The Poetic Middle Armenian of Kafas in the Alexander Romance The Anonymous Saint in the Armenian Tradition Alexi(an)os the Voluntary Pauper or the Anonymous ‘Man of God’? Շքակոխեմ զմեր զփրկութիւնն, hapax nella traduzione armena dell’Epideixis di Sant’Ireneo di Lione: ‘gettare sopra come ombra la nostra salvezza’
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1