Corpus of the Archival Documents of the Don Cossack Army: Problems of Morphological Analysis

O. Gorban, M. Kosova, E. Sheptukhina, A. Svetlov
{"title":"Corpus of the Archival Documents of the Don Cossack Army: Problems of Morphological Analysis","authors":"O. Gorban, M. Kosova, E. Sheptukhina, A. Svetlov","doi":"10.15688/jvolsu2.2022.6.4","DOIUrl":null,"url":null,"abstract":"The article presents the results of the collective project aimed at comprising a special annotated diachronic corpus of documents of the 18 th – 19 th cen. from the \"Mikhailovsky Stanitsa Ataman\" Archive Fund (State Archive of Volgograd Region, Russia). In the course of the work, linguistic, technical and software tasks related to meta-marking, morphological tagging and representation of marked texts in an electronic search environment were solved. The texts are written in cursive script of the 18 th cen. with the use of the old Cyrillic letters, which have spelling specificity. To work correctly with them, an add-on to the stemming tool MyStem by I. Segalovich was created. This application adds to the MyStem the following capabilities: the option to work with the old Cyrillic symbols, a convenient graphical interface; it provides the opportunity to remove homonymy manually, enables marked text exporting to an external data storage and processing system. Morphological analysis of some texts revealed the presence of nominal case form variants, which were not noted in the \"Russian Grammar\" by M.V. Lomonosov, in modern studies of literary texts of the 18 th century. These findings point to effectiveness of automatic tagging which allows word form correction. The research results substantiated text tagging software tools adjustment for the extension of homonymous forms grammatical analysis options, aimed at identification and manual removal of homonymy. A quantitative analysis of these variants will allow the authors to evaluate their significance for the regional administrative language. The information obtained confirms the importance of the corpus creation for studying the history of the Russian language.","PeriodicalId":42545,"journal":{"name":"Vestnik Volgogradskogo Gosudarstvennogo Universiteta-Seriya 2-Yazykoznanie","volume":"30 1","pages":""},"PeriodicalIF":0.2000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vestnik Volgogradskogo Gosudarstvennogo Universiteta-Seriya 2-Yazykoznanie","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15688/jvolsu2.2022.6.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

The article presents the results of the collective project aimed at comprising a special annotated diachronic corpus of documents of the 18 th – 19 th cen. from the "Mikhailovsky Stanitsa Ataman" Archive Fund (State Archive of Volgograd Region, Russia). In the course of the work, linguistic, technical and software tasks related to meta-marking, morphological tagging and representation of marked texts in an electronic search environment were solved. The texts are written in cursive script of the 18 th cen. with the use of the old Cyrillic letters, which have spelling specificity. To work correctly with them, an add-on to the stemming tool MyStem by I. Segalovich was created. This application adds to the MyStem the following capabilities: the option to work with the old Cyrillic symbols, a convenient graphical interface; it provides the opportunity to remove homonymy manually, enables marked text exporting to an external data storage and processing system. Morphological analysis of some texts revealed the presence of nominal case form variants, which were not noted in the "Russian Grammar" by M.V. Lomonosov, in modern studies of literary texts of the 18 th century. These findings point to effectiveness of automatic tagging which allows word form correction. The research results substantiated text tagging software tools adjustment for the extension of homonymous forms grammatical analysis options, aimed at identification and manual removal of homonymy. A quantitative analysis of these variants will allow the authors to evaluate their significance for the regional administrative language. The information obtained confirms the importance of the corpus creation for studying the history of the Russian language.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
顿河哥萨克军档案文献语料库:形态分析问题
文章介绍了集体项目的结果,旨在组成一个特殊的注释历时语料库的文件,18 - 19世纪。来自“Mikhailovsky Stanitsa Ataman”档案基金(俄罗斯伏尔加格勒地区国家档案馆)。在工作过程中,解决了与元标记、形态标记和电子搜索环境中标记文本的表示相关的语言、技术和软件任务。这些文字是用18世纪的草书书写的。使用古老的西里尔字母,这些字母有拼写的特殊性。为了正确使用它们,我们创建了一个由I. Segalovich开发的词干工具system的附加组件。该应用程序为系统增加了以下功能:使用旧西里尔符号的选项,方便的图形界面;它提供了手动删除同音的机会,支持将标记文本导出到外部数据存储和处理系统。在对18世纪文学文本的现代研究中,M.V.罗蒙诺索夫在《俄语语法》中没有注意到一些文本的词形分析显示了名义格形式变体的存在。这些发现表明了自动标注的有效性,它允许词形校正。研究结果证实了文本标注软件工具的调整,为同音形式的扩展提供了语法分析选项,旨在识别和人工去除同音。对这些变体的定量分析将使作者能够评估它们对区域行政语言的意义。获得的信息证实了语料库创建对俄语历史研究的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
0.20
自引率
50.00%
发文量
87
审稿时长
6 weeks
期刊最新文献
Verbal Collocations with Components “(Nouveau) Coronavirus” and “COVID-19” in French The Metaphor of the State and Ways of Expressing It in Russian Official Speech Style as a Relational Polyvalent Category Objectification Features of Social Exclusion and Social Inclusion Categories in the Russian Language (Exemplified by “Glubinka” and “Glush” Concepts) Methods for Estimating the Language Conflict Potential
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1