结构化描述符和事实描述符的组合用于文档流分割

2016 12th IAPR Workshop on Document Analysis Systems (DAS) Pub Date : 2016-04-11 DOI:10.1109/DAS.2016.21

Romain Karpinski, A. Belaïd

{"title":"结构化描述符和事实描述符的组合用于文档流分割","authors":"Romain Karpinski, A. Belaïd","doi":"10.1109/DAS.2016.21","DOIUrl":null,"url":null,"abstract":"This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"208 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Combination of Structural and Factual Descriptors for Document Stream Segmentation\",\"authors\":\"Romain Karpinski, A. Belaïd\",\"doi\":\"10.1109/DAS.2016.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.\",\"PeriodicalId\":197359,\"journal\":{\"name\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"volume\":\"208 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DAS.2016.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2016.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

本文扩展了[4]之前所做的工作。在流程中没有关于文档分离的信息，系统通过检查连续的页面对寻找连续性或断裂描述符来逐步操作。为了更好地提取这些描述符并减少其提取中的歧义，采用了四个文件级别:记录、技术文件、基本文件和案例。在每个级别上，首先提取结构描述符和事实描述符，然后在页面或文档对之间进行比较。为了加强对描述符的兴趣并将系统集中在对中的等效描述符上，描述符伴随着它们的上下文。通过确定页面中的物理和逻辑结构，可以方便地提取上下文。基于这些描述符的上下文规则用于确定对之间的连续性、断裂或不确定性。为了克服当前页中信息空白的问题，使用日志簿来收集记录的所有先前页面中的描述符，并使用缓冲区来延迟比较。后两点被添加到先前的工作中，广泛加强了当前系统，使其精度提高了6%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Combination of Structural and Factual Descriptors for Document Stream Segmentation

This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

自引率

0.00%

发文量