{"title":"基于模型的退化文档图像的表单元检测和内容提取","authors":"Zhixin Shi, S. Setlur, V. Govindaraju","doi":"10.1145/2432553.2432565","DOIUrl":null,"url":null,"abstract":"This paper describes a novel method for detection and extraction of contents of table cells from handwritten document images. Given a model of the table and a document image containing a table, the hand-drawn or pre-printed table is detected and the contents of the table cells are extracted automatically. The algorithms described are designed to handle degraded binary document images. The target images may include a wide variety of noise, ranging from clutter noise, salt-and-pepper noise to non-text objects such as graphics and logos.\n The presented algorithm effectively eliminates extraneous noise and identifies potentially matching table layout candidates by detecting horizontal and vertical table line candidates. A table is represented as a matrix based on the locations of intersections of horizontal and vertical table lines, and a matching algorithm searches for the best table structure that matches the given layout model and using the matching score to eliminate spurious table line candidates. The optimally matched table candidate is then used for cell content extraction.\n This method was tested on a set of document page images containing tables from the challenge set of the DARPA MADCAT Arabic handwritten document image data. Preliminary results indicate that the method is effective and is capable of reliably extracting text from the table cells.","PeriodicalId":410986,"journal":{"name":"DAR '12","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Model based table cell detection and content extraction from degraded document images\",\"authors\":\"Zhixin Shi, S. Setlur, V. Govindaraju\",\"doi\":\"10.1145/2432553.2432565\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes a novel method for detection and extraction of contents of table cells from handwritten document images. Given a model of the table and a document image containing a table, the hand-drawn or pre-printed table is detected and the contents of the table cells are extracted automatically. The algorithms described are designed to handle degraded binary document images. The target images may include a wide variety of noise, ranging from clutter noise, salt-and-pepper noise to non-text objects such as graphics and logos.\\n The presented algorithm effectively eliminates extraneous noise and identifies potentially matching table layout candidates by detecting horizontal and vertical table line candidates. A table is represented as a matrix based on the locations of intersections of horizontal and vertical table lines, and a matching algorithm searches for the best table structure that matches the given layout model and using the matching score to eliminate spurious table line candidates. The optimally matched table candidate is then used for cell content extraction.\\n This method was tested on a set of document page images containing tables from the challenge set of the DARPA MADCAT Arabic handwritten document image data. Preliminary results indicate that the method is effective and is capable of reliably extracting text from the table cells.\",\"PeriodicalId\":410986,\"journal\":{\"name\":\"DAR '12\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"DAR '12\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2432553.2432565\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"DAR '12","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2432553.2432565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Model based table cell detection and content extraction from degraded document images
This paper describes a novel method for detection and extraction of contents of table cells from handwritten document images. Given a model of the table and a document image containing a table, the hand-drawn or pre-printed table is detected and the contents of the table cells are extracted automatically. The algorithms described are designed to handle degraded binary document images. The target images may include a wide variety of noise, ranging from clutter noise, salt-and-pepper noise to non-text objects such as graphics and logos.
The presented algorithm effectively eliminates extraneous noise and identifies potentially matching table layout candidates by detecting horizontal and vertical table line candidates. A table is represented as a matrix based on the locations of intersections of horizontal and vertical table lines, and a matching algorithm searches for the best table structure that matches the given layout model and using the matching score to eliminate spurious table line candidates. The optimally matched table candidate is then used for cell content extraction.
This method was tested on a set of document page images containing tables from the challenge set of the DARPA MADCAT Arabic handwritten document image data. Preliminary results indicate that the method is effective and is capable of reliably extracting text from the table cells.