历史语料库“手稿”抄本的编码与转码

V. Baranov, R. Gnutikov, K. I. Zinatshin
{"title":"历史语料库“手稿”抄本的编码与转码","authors":"V. Baranov, R. Gnutikov, K. I. Zinatshin","doi":"10.22213/2618-9763-2021-4-82-89","DOIUrl":null,"url":null,"abstract":"The article considers capabilities of using Cyrillic blocks of the Unicode Standard for the purpose of creating transcriptions, which would represent graphics of medieval Slavonic manuscripts. In addition, much attention is given to the fact that the Unicode Standard provides variants of Cyrillic letters, which means that one can accurately enough record graphic features of manuscripts. However, some variants of certain letters are still missing, and that is why there exists a need to use additional agreements of character encoding, which code points are placed in special blocks and Private Use Areas and not in standard ranges of Unicode. The Manuscript - a historical corpus - is the example of a big machine-readable collection of medieval Slavonic manuscripts. It was created on the base of Oracle DBMS with the use of a specialized system of codes and fonts. Transference of the corpus to other technological platforms or usage of external software (including separate texts, parts of corpora, selections) for analysis of linguistic data would be possible only after downloaded files are recoded to the Unicode Standard. A comparative analysis of the character blocks used in the corpus and in the current version 14.0 of the Unicode Standard leads to the conclusion that recoding either results in losses of graphic features or requires usage of a supplementary set of varying characters with code points of Private Use Areas. Instances when there are two or more characters of the Unicode Standard that correspond to one recoded character of the Manuscript are analyzed. It is also stated that numerous ligatures and certain singular graphemes are missing in the standard blocks and in the blocks of Private Use Areas.","PeriodicalId":431363,"journal":{"name":"Social’no-ekonomiceskoe upravlenie: teoria i praktika","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ENCODING AND TRANSCODING OF TRANSCRIPTIONS OF THE HISTORICAL CORPUS “MANUSCRIPT”\",\"authors\":\"V. Baranov, R. Gnutikov, K. I. Zinatshin\",\"doi\":\"10.22213/2618-9763-2021-4-82-89\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The article considers capabilities of using Cyrillic blocks of the Unicode Standard for the purpose of creating transcriptions, which would represent graphics of medieval Slavonic manuscripts. In addition, much attention is given to the fact that the Unicode Standard provides variants of Cyrillic letters, which means that one can accurately enough record graphic features of manuscripts. However, some variants of certain letters are still missing, and that is why there exists a need to use additional agreements of character encoding, which code points are placed in special blocks and Private Use Areas and not in standard ranges of Unicode. The Manuscript - a historical corpus - is the example of a big machine-readable collection of medieval Slavonic manuscripts. It was created on the base of Oracle DBMS with the use of a specialized system of codes and fonts. Transference of the corpus to other technological platforms or usage of external software (including separate texts, parts of corpora, selections) for analysis of linguistic data would be possible only after downloaded files are recoded to the Unicode Standard. A comparative analysis of the character blocks used in the corpus and in the current version 14.0 of the Unicode Standard leads to the conclusion that recoding either results in losses of graphic features or requires usage of a supplementary set of varying characters with code points of Private Use Areas. Instances when there are two or more characters of the Unicode Standard that correspond to one recoded character of the Manuscript are analyzed. It is also stated that numerous ligatures and certain singular graphemes are missing in the standard blocks and in the blocks of Private Use Areas.\",\"PeriodicalId\":431363,\"journal\":{\"name\":\"Social’no-ekonomiceskoe upravlenie: teoria i praktika\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Social’no-ekonomiceskoe upravlenie: teoria i praktika\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22213/2618-9763-2021-4-82-89\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Social’no-ekonomiceskoe upravlenie: teoria i praktika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22213/2618-9763-2021-4-82-89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文考虑了使用Unicode标准的西里尔字母块来创建转录的能力,这将代表中世纪斯拉夫手稿的图形。此外,Unicode标准还提供了西里尔字母的变体,这意味着人们可以足够准确地记录手稿的图形特征。然而,某些字母的某些变体仍然缺失,这就是为什么需要使用额外的字符编码协议,这些代码点被放置在特殊块和私有使用区域中,而不是在Unicode的标准范围内。手稿-一个历史语料库-是一个大型机器可读中世纪斯拉夫手稿集合的例子。它是在Oracle DBMS的基础上创建的,使用了专门的代码和字体系统。只有将下载的文件重新编码为Unicode标准后,才能将语料库转移到其他技术平台或使用外部软件(包括单独的文本,语料库的一部分,选段)进行语言数据分析。对语料库中使用的字符块和当前版本的Unicode标准14.0中使用的字符块进行比较分析得出的结论是,重新编码要么会导致图形特征的丢失,要么需要使用带有私有使用区域代码点的不同字符的补充集。当有两个或多个Unicode标准字符对应于手稿的一个重新编码字符时,将分析实例。报告还指出,在标准街区和私人使用区域的街区中,缺少许多连词和某些单数字素。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ENCODING AND TRANSCODING OF TRANSCRIPTIONS OF THE HISTORICAL CORPUS “MANUSCRIPT”
The article considers capabilities of using Cyrillic blocks of the Unicode Standard for the purpose of creating transcriptions, which would represent graphics of medieval Slavonic manuscripts. In addition, much attention is given to the fact that the Unicode Standard provides variants of Cyrillic letters, which means that one can accurately enough record graphic features of manuscripts. However, some variants of certain letters are still missing, and that is why there exists a need to use additional agreements of character encoding, which code points are placed in special blocks and Private Use Areas and not in standard ranges of Unicode. The Manuscript - a historical corpus - is the example of a big machine-readable collection of medieval Slavonic manuscripts. It was created on the base of Oracle DBMS with the use of a specialized system of codes and fonts. Transference of the corpus to other technological platforms or usage of external software (including separate texts, parts of corpora, selections) for analysis of linguistic data would be possible only after downloaded files are recoded to the Unicode Standard. A comparative analysis of the character blocks used in the corpus and in the current version 14.0 of the Unicode Standard leads to the conclusion that recoding either results in losses of graphic features or requires usage of a supplementary set of varying characters with code points of Private Use Areas. Instances when there are two or more characters of the Unicode Standard that correspond to one recoded character of the Manuscript are analyzed. It is also stated that numerous ligatures and certain singular graphemes are missing in the standard blocks and in the blocks of Private Use Areas.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
DATA-ANALYSIS OF FERTILITY AND MORTALITY RATES IN DIFFERENTIATION BY REGIONS OF THE RUSSIAN FEDERATION DAMAGED IMPACT OF POOR LITERARY ENVIRONMENT ON HUMAN INTELLIGENCE AND LEARNING ABILITY FREQUENCY STATISTICALLY STABLE COMBINATIONS WITH THE ЖИЗНЬ COMPONENT IN THE ANCIENT RUSSIAN CHRONICLES PECULIARITIES OF FINANCIAL AND BUDGETARY CONTROL IN MODERN REALITIES DICTOGLOSS AS AN ELEMENT TO FORM TRANSLATING SKILLS OF FUTURE TRANSLATORS
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1