{"title":"历史语料库“手稿”抄本的编码与转码","authors":"V. Baranov, R. Gnutikov, K. I. Zinatshin","doi":"10.22213/2618-9763-2021-4-82-89","DOIUrl":null,"url":null,"abstract":"The article considers capabilities of using Cyrillic blocks of the Unicode Standard for the purpose of creating transcriptions, which would represent graphics of medieval Slavonic manuscripts. In addition, much attention is given to the fact that the Unicode Standard provides variants of Cyrillic letters, which means that one can accurately enough record graphic features of manuscripts. However, some variants of certain letters are still missing, and that is why there exists a need to use additional agreements of character encoding, which code points are placed in special blocks and Private Use Areas and not in standard ranges of Unicode. The Manuscript - a historical corpus - is the example of a big machine-readable collection of medieval Slavonic manuscripts. It was created on the base of Oracle DBMS with the use of a specialized system of codes and fonts. Transference of the corpus to other technological platforms or usage of external software (including separate texts, parts of corpora, selections) for analysis of linguistic data would be possible only after downloaded files are recoded to the Unicode Standard. A comparative analysis of the character blocks used in the corpus and in the current version 14.0 of the Unicode Standard leads to the conclusion that recoding either results in losses of graphic features or requires usage of a supplementary set of varying characters with code points of Private Use Areas. Instances when there are two or more characters of the Unicode Standard that correspond to one recoded character of the Manuscript are analyzed. It is also stated that numerous ligatures and certain singular graphemes are missing in the standard blocks and in the blocks of Private Use Areas.","PeriodicalId":431363,"journal":{"name":"Social’no-ekonomiceskoe upravlenie: teoria i praktika","volume":"115 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ENCODING AND TRANSCODING OF TRANSCRIPTIONS OF THE HISTORICAL CORPUS “MANUSCRIPT”\",\"authors\":\"V. Baranov, R. Gnutikov, K. I. Zinatshin\",\"doi\":\"10.22213/2618-9763-2021-4-82-89\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The article considers capabilities of using Cyrillic blocks of the Unicode Standard for the purpose of creating transcriptions, which would represent graphics of medieval Slavonic manuscripts. In addition, much attention is given to the fact that the Unicode Standard provides variants of Cyrillic letters, which means that one can accurately enough record graphic features of manuscripts. However, some variants of certain letters are still missing, and that is why there exists a need to use additional agreements of character encoding, which code points are placed in special blocks and Private Use Areas and not in standard ranges of Unicode. The Manuscript - a historical corpus - is the example of a big machine-readable collection of medieval Slavonic manuscripts. It was created on the base of Oracle DBMS with the use of a specialized system of codes and fonts. Transference of the corpus to other technological platforms or usage of external software (including separate texts, parts of corpora, selections) for analysis of linguistic data would be possible only after downloaded files are recoded to the Unicode Standard. A comparative analysis of the character blocks used in the corpus and in the current version 14.0 of the Unicode Standard leads to the conclusion that recoding either results in losses of graphic features or requires usage of a supplementary set of varying characters with code points of Private Use Areas. Instances when there are two or more characters of the Unicode Standard that correspond to one recoded character of the Manuscript are analyzed. It is also stated that numerous ligatures and certain singular graphemes are missing in the standard blocks and in the blocks of Private Use Areas.\",\"PeriodicalId\":431363,\"journal\":{\"name\":\"Social’no-ekonomiceskoe upravlenie: teoria i praktika\",\"volume\":\"115 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Social’no-ekonomiceskoe upravlenie: teoria i praktika\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22213/2618-9763-2021-4-82-89\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Social’no-ekonomiceskoe upravlenie: teoria i praktika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22213/2618-9763-2021-4-82-89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ENCODING AND TRANSCODING OF TRANSCRIPTIONS OF THE HISTORICAL CORPUS “MANUSCRIPT”
The article considers capabilities of using Cyrillic blocks of the Unicode Standard for the purpose of creating transcriptions, which would represent graphics of medieval Slavonic manuscripts. In addition, much attention is given to the fact that the Unicode Standard provides variants of Cyrillic letters, which means that one can accurately enough record graphic features of manuscripts. However, some variants of certain letters are still missing, and that is why there exists a need to use additional agreements of character encoding, which code points are placed in special blocks and Private Use Areas and not in standard ranges of Unicode. The Manuscript - a historical corpus - is the example of a big machine-readable collection of medieval Slavonic manuscripts. It was created on the base of Oracle DBMS with the use of a specialized system of codes and fonts. Transference of the corpus to other technological platforms or usage of external software (including separate texts, parts of corpora, selections) for analysis of linguistic data would be possible only after downloaded files are recoded to the Unicode Standard. A comparative analysis of the character blocks used in the corpus and in the current version 14.0 of the Unicode Standard leads to the conclusion that recoding either results in losses of graphic features or requires usage of a supplementary set of varying characters with code points of Private Use Areas. Instances when there are two or more characters of the Unicode Standard that correspond to one recoded character of the Manuscript are analyzed. It is also stated that numerous ligatures and certain singular graphemes are missing in the standard blocks and in the blocks of Private Use Areas.