ODT格式文本文档结构元素信息提取算法的实现

A. Berezhkov, G. S. Larionova, V. Martsinkevich, V. Tereshchenko
{"title":"ODT格式文本文档结构元素信息提取算法的实现","authors":"A. Berezhkov, G. S. Larionova, V. Martsinkevich, V. Tereshchenko","doi":"10.17587/it.29.307-315","DOIUrl":null,"url":null,"abstract":"The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.","PeriodicalId":37476,"journal":{"name":"Radioelektronika, Nanosistemy, Informacionnye Tehnologii","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Implementation of an Algorithm for Extracting Information about Structural Elements of Text Documents in ODT Format\",\"authors\":\"A. Berezhkov, G. S. Larionova, V. Martsinkevich, V. Tereshchenko\",\"doi\":\"10.17587/it.29.307-315\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.\",\"PeriodicalId\":37476,\"journal\":{\"name\":\"Radioelektronika, Nanosistemy, Informacionnye Tehnologii\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radioelektronika, Nanosistemy, Informacionnye Tehnologii\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17587/it.29.307-315\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Materials Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radioelektronika, Nanosistemy, Informacionnye Tehnologii","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17587/it.29.307-315","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Materials Science","Score":null,"Total":0}
引用次数: 0

摘要

考虑了ODT格式的数字文档的XML标记对用于创建该文档的工具的依赖性。在比较中不仅使用专门的工具,而且还使用那些不直接使用ODT格式的工具来识别最易受攻击的点。还描述了从文档的结构元素(如表、列表和图像)中提取数据的特性。提出并描述了一种获取用于创建数字文档自动规范控制系统的样式属性的算法的实现。结果表明,ODT格式的非严格标准导致XML标记依赖于用于创建文档的文本编辑器。因此,在开发文档解析算法时可以依赖的标记数量有限。然而,正如本文所演示的那样,该任务是可行的。同样,默认值、绕过块文档的算法描述和结构元素构成了为后续创建分类器和规范控制过程自动化准备数据的基础。因此,本文提出的算法和所执行的XML标记分析是解决创建自动化文档标准控制系统问题的有效工具,并且该算法具有进一步改进的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Implementation of an Algorithm for Extracting Information about Structural Elements of Text Documents in ODT Format
The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Radioelektronika, Nanosistemy, Informacionnye Tehnologii
Radioelektronika, Nanosistemy, Informacionnye Tehnologii Materials Science-Materials Science (miscellaneous)
CiteScore
0.60
自引率
0.00%
发文量
38
期刊介绍: Journal “Radioelectronics. Nanosystems. Information Technologies” (abbr RENSIT) publishes original articles, reviews and brief reports, not previously published, on topical problems in radioelectronics (including biomedical) and fundamentals of information, nano- and biotechnologies and adjacent areas of physics and mathematics. The authors of the journal are academicians, corresponding members and foreign members of the Russian Academy of Natural Sciences (RANS) and their colleagues, as well as other russian and foreign authors on the proposal of the members of RANS, which can be obtained by the author before sending articles to the editor or after its arrival on the recommendation of a member of the editorial board or another member of the RANS, who gave the opinion on the article at the request of the editior. The editors will accept articles in both Russian and English languages. Articles are internally peer reviewed (double-blind peer review) by members of the Editorial Board. Some articles undergo external review, if necessary. Designed for researchers, graduate students, physics students of senior courses and teachers. It turns out 2 times a year (that includes 2 rooms)
期刊最新文献
Methods and features of measuring the thermal resistance of integrated microwave amplifiers on heterojunction bipolar transistors Extended orthogonal feedback precoding for spatial multiplexing systems Holographic Method for Localization of a Moving Underwater Sound Source in the Presence of Intense Internal Waves A Neoteric View of sp2 Amorphous Carbon Temperature influence on the formation of Langmuir monolayers with Ni arachidic acid and Ni arachidate clusters
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1