Software extract data from word-based documents situationally-oriented approach

IF 0.3 Q4 MATHEMATICS, APPLIED Journal of Applied Mathematics & Informatics Pub Date : 2021-12-24 DOI:10.37791/2687-0649-2021-16-6-66-83

V. Mironov, A. Gusarenko, N. Yusupova

{"title":"Software extract data from word-based documents situationally-oriented approach","authors":"V. Mironov, A. Gusarenko, N. Yusupova","doi":"10.37791/2687-0649-2021-16-6-66-83","DOIUrl":null,"url":null,"abstract":"The article discusses the use of situation-oriented approach to software processing word-documents. The documents under consideration are prepared by the user in the environment of the Microsoft Word processor or its analogs and are used in the future as data sources. The openness of the Office Open XML and Open Document Format made it possible to apply the concept of virtual documents mapped to ZIP archives for programmatic access to XML components of word documents in a situational environment. The importance of developing preliminary agreements regarding the placement of information in the document for subsequent search and retrieval, for example, using pre-prepared templates, is substantiated. For the DOCX and ODT formats, the article discusses the use of key phrases, bookmarks, content controls, custom XML components to organize the extraction of entered data. For each option, tree-like models of access to the extracted data, as well as the corresponding XPath expressions, are built. It is noted that the use of one or another option depends on the functionality and limitations of the word processor and is characterized by varying complexity of developing a blank template, entering data by the user and programming data extraction. The applied solution is based on entering metadata into the article using content controls placed in a stub template and bound to elements of a custom XML component. The developed hierarchical situational model of HSM provides extraction of an XML component, loading it into a DOM object and XSLT transformations to obtain the resulting data: an error report and JavaScript code for subsequent use of the extracted metadata.","PeriodicalId":44195,"journal":{"name":"Journal of Applied Mathematics & Informatics","volume":"7 1","pages":""},"PeriodicalIF":0.3000,"publicationDate":"2021-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Mathematics & Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.37791/2687-0649-2021-16-6-66-83","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

The article discusses the use of situation-oriented approach to software processing word-documents. The documents under consideration are prepared by the user in the environment of the Microsoft Word processor or its analogs and are used in the future as data sources. The openness of the Office Open XML and Open Document Format made it possible to apply the concept of virtual documents mapped to ZIP archives for programmatic access to XML components of word documents in a situational environment. The importance of developing preliminary agreements regarding the placement of information in the document for subsequent search and retrieval, for example, using pre-prepared templates, is substantiated. For the DOCX and ODT formats, the article discusses the use of key phrases, bookmarks, content controls, custom XML components to organize the extraction of entered data. For each option, tree-like models of access to the extracted data, as well as the corresponding XPath expressions, are built. It is noted that the use of one or another option depends on the functionality and limitations of the word processor and is characterized by varying complexity of developing a blank template, entering data by the user and programming data extraction. The applied solution is based on entering metadata into the article using content controls placed in a stub template and bound to elements of a custom XML component. The developed hierarchical situational model of HSM provides extraction of an XML component, loading it into a DOM object and XSLT transformations to obtain the resulting data: an error report and JavaScript code for subsequent use of the extracted metadata.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

软件从基于文字的文档中提取数据的方法是面向情境的

本文讨论了面向情景的方法在软件处理word文档中的应用。所考虑的文档由用户在Microsoft Word处理程序或其类似程序的环境中编写，并在将来用作数据源。Office Open XML和Open Document Format的开放性使得将虚拟文档映射到ZIP档案的概念应用于情景环境中对word文档的XML组件的编程访问成为可能。关于在文件中放置资料以便随后搜索和检索的初步协议的重要性得到证实，例如，使用预先编制的模板。对于DOCX和ODT格式，本文讨论了如何使用关键短语、书签、内容控件、自定义XML组件来组织输入数据的提取。对于每个选项，都构建了访问提取数据的树状模型以及相应的XPath表达式。应当指出，使用一种或另一种选择取决于文字处理机的功能和限制，其特点是开发空白模板、由用户输入数据和编写数据提取程序的复杂性各不相同。所应用的解决方案基于使用放置在存根模板中的内容控件将元数据输入到文章中，并绑定到自定义XML组件的元素。开发的HSM分层情景模型提供XML组件的提取、将其加载到DOM对象和XSLT转换以获得结果数据:一个错误报告和用于随后使用提取的元数据的JavaScript代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊