{"title":"Chart image understanding and numerical data extraction","authors":"Ales Mishchenko, N. Vassilieva","doi":"10.1109/ICDIM.2011.6093320","DOIUrl":null,"url":null,"abstract":"Chart images in digital documents are an important source of valuable information that is largely under-utilized for data indexing and information extraction purposes. We developed a framework to automatically extract data carried by charts and convert them to XML format. The proposed algorithm classifies image by chart type, detects graphical and textual components, extracts semantic relations between graphics and text. Classification is performed by a novel model-based method, which was extensively tested against the state-of-the-art supervised learning methods and showed high accuracy, comparable to those of the best supervised approaches. The proposed text detection algorithm is applied prior to optical character recognition and leads to significant improvement in text recognition rate (up to 20 times better). The analysis of graphical components and their relations to textual cues allows the recovering of chart data. For testing purpose, a benchmark set was created with the XML/SWF Chart tool. By comparing the recovered data and the original data used for chart generation, we are able to evaluate our information extraction framework and confirm its validity.","PeriodicalId":355775,"journal":{"name":"2011 Sixth International Conference on Digital Information Management","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 Sixth International Conference on Digital Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2011.6093320","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28
Abstract
Chart images in digital documents are an important source of valuable information that is largely under-utilized for data indexing and information extraction purposes. We developed a framework to automatically extract data carried by charts and convert them to XML format. The proposed algorithm classifies image by chart type, detects graphical and textual components, extracts semantic relations between graphics and text. Classification is performed by a novel model-based method, which was extensively tested against the state-of-the-art supervised learning methods and showed high accuracy, comparable to those of the best supervised approaches. The proposed text detection algorithm is applied prior to optical character recognition and leads to significant improvement in text recognition rate (up to 20 times better). The analysis of graphical components and their relations to textual cues allows the recovering of chart data. For testing purpose, a benchmark set was created with the XML/SWF Chart tool. By comparing the recovered data and the original data used for chart generation, we are able to evaluate our information extraction framework and confirm its validity.