从学术文献中自动提取数字

Proceedings of the 2015 ACM Symposium on Document Engineering Pub Date : 2015-09-08 DOI:10.1145/2682571.2797085

Sagnik Ray Choudhury, P. Mitra, C. Lee Giles

{"title":"从学术文献中自动提取数字","authors":"Sagnik Ray Choudhury, P. Mitra, C. Lee Giles","doi":"10.1145/2682571.2797085","DOIUrl":null,"url":null,"abstract":"Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.","PeriodicalId":106339,"journal":{"name":"Proceedings of the 2015 ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":"{\"title\":\"Automatic Extraction of Figures from Scholarly Documents\",\"authors\":\"Sagnik Ray Choudhury, P. Mitra, C. Lee Giles\",\"doi\":\"10.1145/2682571.2797085\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.\",\"PeriodicalId\":106339,\"journal\":{\"name\":\"Proceedings of the 2015 ACM Symposium on Document Engineering\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"28\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2015 ACM Symposium on Document Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2682571.2797085\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2682571.2797085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

摘要

学术论文(期刊和会议论文，技术报告等)通常包含多个“图形”，如图表，流程图和其他手动生成的图像，以象征性地表示和说明视觉上重要的概念，发现和结果。这些图形可以用于自动数据提取或语义分析。令人惊讶的是，从PDF文档中大规模自动提取这些数据却很少受到关注。在这里，我们讨论了如何为这样的提取任务建立启发式独立可训练模型以及如何大规模提取图形的挑战。受表提取最新发展的推动，我们定义了三个新的评估指标:数字精度、数字召回率和数字f1得分。我们的数据集包括200个pdf文件的样本，随机从500万学术pdf文件中收集，并手动标记了180个图形位置。我们工作的初步结果表明准确率大于80%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Automatic Extraction of Figures from Scholarly Documents

Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助