基于自适应深度学习的交互式表提取系统

26th International Conference on Intelligent User Interfaces - Companion Pub Date : 2021-02-16 DOI:10.1145/3397482.3450718

N. Wang, D. Burdick, Yunyao Li

{"title":"基于自适应深度学习的交互式表提取系统","authors":"N. Wang, D. Burdick, Yunyao Li","doi":"10.1145/3397482.3450718","DOIUrl":null,"url":null,"abstract":"Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achieve with one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training data representing this variety and (3) the inherent ambiguity and subjectivity of table definitions between end-users. Meanwhile, building customized models from scratch can be difficult due to the expensive nature of annotating table data. We attempt to solve these challenges with TableLab by providing a system where users and models seamlessly work together to quickly customize high-quality extraction models with a few labelled examples for the user’s document collection, which contains pages with tables. Given an input document collection, TableLab first detects tables with similar structures (templates) by clustering embeddings from the extraction model. Document collections often contain tables created with a limited set of templates or similar structures. It then selects a few representative table examples already extracted with a pre-trained base deep learning model. Via an easy-to-use user interface, users provide feedback to these selections without necessarily having to identify every single error. TableLab then applies such feedback to finetune the pre-trained model and returns the results of the finetuned model back to the user. The user can choose to repeat this process iteratively until obtaining a customized model with satisfactory performance.","PeriodicalId":216190,"journal":{"name":"26th International Conference on Intelligent User Interfaces - Companion","volume":"216 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"TableLab: An Interactive Table Extraction System with Adaptive Deep Learning\",\"authors\":\"N. Wang, D. Burdick, Yunyao Li\",\"doi\":\"10.1145/3397482.3450718\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achieve with one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training data representing this variety and (3) the inherent ambiguity and subjectivity of table definitions between end-users. Meanwhile, building customized models from scratch can be difficult due to the expensive nature of annotating table data. We attempt to solve these challenges with TableLab by providing a system where users and models seamlessly work together to quickly customize high-quality extraction models with a few labelled examples for the user’s document collection, which contains pages with tables. Given an input document collection, TableLab first detects tables with similar structures (templates) by clustering embeddings from the extraction model. Document collections often contain tables created with a limited set of templates or similar structures. It then selects a few representative table examples already extracted with a pre-trained base deep learning model. Via an easy-to-use user interface, users provide feedback to these selections without necessarily having to identify every single error. TableLab then applies such feedback to finetune the pre-trained model and returns the results of the finetuned model back to the user. The user can choose to repeat this process iteratively until obtaining a customized model with satisfactory performance.\",\"PeriodicalId\":216190,\"journal\":{\"name\":\"26th International Conference on Intelligent User Interfaces - Companion\",\"volume\":\"216 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"26th International Conference on Intelligent User Interfaces - Companion\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3397482.3450718\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"26th International Conference on Intelligent User Interfaces - Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397482.3450718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

从PDF和图像文档中提取表格是现实世界中普遍存在的任务。由于(1)表样式的多样性，(2)缺乏代表这种多样性的训练数据，以及(3)最终用户之间表定义固有的模糊性和主观性，单个开箱即用模型很难实现完美的提取质量。同时，从头开始构建定制模型可能很困难，因为注释表数据的成本很高。我们试图通过提供一个系统来解决这些问题，在这个系统中，用户和模型可以无缝地协同工作，快速定制高质量的提取模型，并为用户的文档集合(包含带有表格的页面)提供一些带标签的示例。给定一个输入文档集合，TableLab首先通过从提取模型中聚类嵌入来检测具有相似结构(模板)的表。文档集合通常包含用一组有限的模板或类似结构创建的表。然后，它选择几个有代表性的表示例，这些表示例已经用预训练的基础深度学习模型提取出来。通过一个易于使用的用户界面，用户可以对这些选择提供反馈，而不必识别每一个错误。然后，TableLab应用这些反馈对预训练模型进行微调，并将微调后的模型结果返回给用户。用户可以选择迭代地重复这个过程，直到获得一个性能满意的定制模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TableLab: An Interactive Table Extraction System with Adaptive Deep Learning

Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achieve with one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training data representing this variety and (3) the inherent ambiguity and subjectivity of table definitions between end-users. Meanwhile, building customized models from scratch can be difficult due to the expensive nature of annotating table data. We attempt to solve these challenges with TableLab by providing a system where users and models seamlessly work together to quickly customize high-quality extraction models with a few labelled examples for the user’s document collection, which contains pages with tables. Given an input document collection, TableLab first detects tables with similar structures (templates) by clustering embeddings from the extraction model. Document collections often contain tables created with a limited set of templates or similar structures. It then selects a few representative table examples already extracted with a pre-trained base deep learning model. Via an easy-to-use user interface, users provide feedback to these selections without necessarily having to identify every single error. TableLab then applies such feedback to finetune the pre-trained model and returns the results of the finetuned model back to the user. The user can choose to repeat this process iteratively until obtaining a customized model with satisfactory performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

26th International Conference on Intelligent User Interfaces - Companion

自引率

0.00%

发文量