历史语料库“手稿”抄本的编码与转码

Social’no-ekonomiceskoe upravlenie: teoria i praktika Pub Date : 2021-12-26 DOI:10.22213/2618-9763-2021-4-82-89

V. Baranov, R. Gnutikov, K. I. Zinatshin

{"title":"历史语料库“手稿”抄本的编码与转码","authors":"V. Baranov, R. Gnutikov, K. I. Zinatshin","doi":"10.22213/2618-9763-2021-4-82-89","DOIUrl":null,"url":null,"abstract":"The article considers capabilities of using Cyrillic blocks of the Unicode Standard for the purpose of creating transcriptions, which would represent graphics of medieval Slavonic manuscripts. In addition, much attention is given to the fact that the Unicode Standard provides variants of Cyrillic letters, which means that one can accurately enough record graphic features of manuscripts. However, some variants of certain letters are still missing, and that is why there exists a need to use additional agreements of character encoding, which code points are placed in special blocks and Private Use Areas and not in standard ranges of Unicode. The Manuscript - a historical corpus - is the example of a big machine-readable collection of medieval Slavonic manuscripts. It was created on the base of Oracle DBMS with the use of a specialized system of codes and fonts. Transference of the corpus to other technological platforms or usage of external software (including separate texts, parts of corpora, selections) for analysis of linguistic data would be possible only after downloaded files are recoded to the Unicode Standard. A comparative analysis of the character blocks used in the corpus and in the current version 14.0 of the Unicode Standard leads to the conclusion that recoding either results in losses of graphic features or requires usage of a supplementary set of varying characters with code points of Private Use Areas. Instances when there are two or more characters of the Unicode Standard that correspond to one recoded character of the Manuscript are analyzed. It is also stated that numerous ligatures and certain singular graphemes are missing in the standard blocks and in the blocks of Private Use Areas.","PeriodicalId":431363,"journal":{"name":"Social’no-ekonomiceskoe upravlenie: teoria i praktika","volume":"115 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ENCODING AND TRANSCODING OF TRANSCRIPTIONS OF THE HISTORICAL CORPUS “MANUSCRIPT”\",\"authors\":\"V. Baranov, R. Gnutikov, K. I. Zinatshin\",\"doi\":\"10.22213/2618-9763-2021-4-82-89\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The article considers capabilities of using Cyrillic blocks of the Unicode Standard for the purpose of creating transcriptions, which would represent graphics of medieval Slavonic manuscripts. In addition, much attention is given to the fact that the Unicode Standard provides variants of Cyrillic letters, which means that one can accurately enough record graphic features of manuscripts. However, some variants of certain letters are still missing, and that is why there exists a need to use additional agreements of character encoding, which code points are placed in special blocks and Private Use Areas and not in standard ranges of Unicode. The Manuscript - a historical corpus - is the example of a big machine-readable collection of medieval Slavonic manuscripts. It was created on the base of Oracle DBMS with the use of a specialized system of codes and fonts. Transference of the corpus to other technological platforms or usage of external software (including separate texts, parts of corpora, selections) for analysis of linguistic data would be possible only after downloaded files are recoded to the Unicode Standard. A comparative analysis of the character blocks used in the corpus and in the current version 14.0 of the Unicode Standard leads to the conclusion that recoding either results in losses of graphic features or requires usage of a supplementary set of varying characters with code points of Private Use Areas. Instances when there are two or more characters of the Unicode Standard that correspond to one recoded character of the Manuscript are analyzed. It is also stated that numerous ligatures and certain singular graphemes are missing in the standard blocks and in the blocks of Private Use Areas.\",\"PeriodicalId\":431363,\"journal\":{\"name\":\"Social’no-ekonomiceskoe upravlenie: teoria i praktika\",\"volume\":\"115 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Social’no-ekonomiceskoe upravlenie: teoria i praktika\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22213/2618-9763-2021-4-82-89\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Social’no-ekonomiceskoe upravlenie: teoria i praktika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22213/2618-9763-2021-4-82-89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文考虑了使用Unicode标准的西里尔字母块来创建转录的能力，这将代表中世纪斯拉夫手稿的图形。此外，Unicode标准还提供了西里尔字母的变体，这意味着人们可以足够准确地记录手稿的图形特征。然而，某些字母的某些变体仍然缺失，这就是为什么需要使用额外的字符编码协议，这些代码点被放置在特殊块和私有使用区域中，而不是在Unicode的标准范围内。手稿-一个历史语料库-是一个大型机器可读中世纪斯拉夫手稿集合的例子。它是在Oracle DBMS的基础上创建的，使用了专门的代码和字体系统。只有将下载的文件重新编码为Unicode标准后，才能将语料库转移到其他技术平台或使用外部软件(包括单独的文本，语料库的一部分，选段)进行语言数据分析。对语料库中使用的字符块和当前版本的Unicode标准14.0中使用的字符块进行比较分析得出的结论是，重新编码要么会导致图形特征的丢失，要么需要使用带有私有使用区域代码点的不同字符的补充集。当有两个或多个Unicode标准字符对应于手稿的一个重新编码字符时，将分析实例。报告还指出，在标准街区和私人使用区域的街区中，缺少许多连词和某些单数字素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ENCODING AND TRANSCODING OF TRANSCRIPTIONS OF THE HISTORICAL CORPUS “MANUSCRIPT”

The article considers capabilities of using Cyrillic blocks of the Unicode Standard for the purpose of creating transcriptions, which would represent graphics of medieval Slavonic manuscripts. In addition, much attention is given to the fact that the Unicode Standard provides variants of Cyrillic letters, which means that one can accurately enough record graphic features of manuscripts. However, some variants of certain letters are still missing, and that is why there exists a need to use additional agreements of character encoding, which code points are placed in special blocks and Private Use Areas and not in standard ranges of Unicode. The Manuscript - a historical corpus - is the example of a big machine-readable collection of medieval Slavonic manuscripts. It was created on the base of Oracle DBMS with the use of a specialized system of codes and fonts. Transference of the corpus to other technological platforms or usage of external software (including separate texts, parts of corpora, selections) for analysis of linguistic data would be possible only after downloaded files are recoded to the Unicode Standard. A comparative analysis of the character blocks used in the corpus and in the current version 14.0 of the Unicode Standard leads to the conclusion that recoding either results in losses of graphic features or requires usage of a supplementary set of varying characters with code points of Private Use Areas. Instances when there are two or more characters of the Unicode Standard that correspond to one recoded character of the Manuscript are analyzed. It is also stated that numerous ligatures and certain singular graphemes are missing in the standard blocks and in the blocks of Private Use Areas.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Social’no-ekonomiceskoe upravlenie: teoria i praktika

自引率

0.00%

发文量