结构化描述符和事实描述符的组合用于文档流分割

Romain Karpinski, A. Belaïd
{"title":"结构化描述符和事实描述符的组合用于文档流分割","authors":"Romain Karpinski, A. Belaïd","doi":"10.1109/DAS.2016.21","DOIUrl":null,"url":null,"abstract":"This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Combination of Structural and Factual Descriptors for Document Stream Segmentation\",\"authors\":\"Romain Karpinski, A. Belaïd\",\"doi\":\"10.1109/DAS.2016.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.\",\"PeriodicalId\":197359,\"journal\":{\"name\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DAS.2016.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2016.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

本文扩展了[4]之前所做的工作。在流程中没有关于文档分离的信息,系统通过检查连续的页面对寻找连续性或断裂描述符来逐步操作。为了更好地提取这些描述符并减少其提取中的歧义,采用了四个文件级别:记录、技术文件、基本文件和案例。在每个级别上,首先提取结构描述符和事实描述符,然后在页面或文档对之间进行比较。为了加强对描述符的兴趣并将系统集中在对中的等效描述符上,描述符伴随着它们的上下文。通过确定页面中的物理和逻辑结构,可以方便地提取上下文。基于这些描述符的上下文规则用于确定对之间的连续性、断裂或不确定性。为了克服当前页中信息空白的问题,使用日志簿来收集记录的所有先前页面中的描述符,并使用缓冲区来延迟比较。后两点被添加到先前的工作中,广泛加强了当前系统,使其精度提高了6%以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Combination of Structural and Factual Descriptors for Document Stream Segmentation
This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Handwritten and Machine-Printed Text Discrimination Using a Template Matching Approach General Pattern Run-Length Transform for Writer Identification Automatic Selection of Parameters for Document Image Enhancement Using Image Quality Assessment Large Scale Continuous Dating of Medieval Scribes Using a Combined Image and Language Model Performance of an Off-Line Signature Verification Method Based on Texture Features on a Large Indic-Script Signature Dataset
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1