基于自适应深度学习的交互式表提取系统

N. Wang, D. Burdick, Yunyao Li
{"title":"基于自适应深度学习的交互式表提取系统","authors":"N. Wang, D. Burdick, Yunyao Li","doi":"10.1145/3397482.3450718","DOIUrl":null,"url":null,"abstract":"Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achieve with one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training data representing this variety and (3) the inherent ambiguity and subjectivity of table definitions between end-users. Meanwhile, building customized models from scratch can be difficult due to the expensive nature of annotating table data. We attempt to solve these challenges with TableLab by providing a system where users and models seamlessly work together to quickly customize high-quality extraction models with a few labelled examples for the user’s document collection, which contains pages with tables. Given an input document collection, TableLab first detects tables with similar structures (templates) by clustering embeddings from the extraction model. Document collections often contain tables created with a limited set of templates or similar structures. It then selects a few representative table examples already extracted with a pre-trained base deep learning model. Via an easy-to-use user interface, users provide feedback to these selections without necessarily having to identify every single error. TableLab then applies such feedback to finetune the pre-trained model and returns the results of the finetuned model back to the user. The user can choose to repeat this process iteratively until obtaining a customized model with satisfactory performance.","PeriodicalId":216190,"journal":{"name":"26th International Conference on Intelligent User Interfaces - Companion","volume":"216 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"TableLab: An Interactive Table Extraction System with Adaptive Deep Learning\",\"authors\":\"N. Wang, D. Burdick, Yunyao Li\",\"doi\":\"10.1145/3397482.3450718\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achieve with one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training data representing this variety and (3) the inherent ambiguity and subjectivity of table definitions between end-users. Meanwhile, building customized models from scratch can be difficult due to the expensive nature of annotating table data. We attempt to solve these challenges with TableLab by providing a system where users and models seamlessly work together to quickly customize high-quality extraction models with a few labelled examples for the user’s document collection, which contains pages with tables. Given an input document collection, TableLab first detects tables with similar structures (templates) by clustering embeddings from the extraction model. Document collections often contain tables created with a limited set of templates or similar structures. It then selects a few representative table examples already extracted with a pre-trained base deep learning model. Via an easy-to-use user interface, users provide feedback to these selections without necessarily having to identify every single error. TableLab then applies such feedback to finetune the pre-trained model and returns the results of the finetuned model back to the user. The user can choose to repeat this process iteratively until obtaining a customized model with satisfactory performance.\",\"PeriodicalId\":216190,\"journal\":{\"name\":\"26th International Conference on Intelligent User Interfaces - Companion\",\"volume\":\"216 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"26th International Conference on Intelligent User Interfaces - Companion\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3397482.3450718\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"26th International Conference on Intelligent User Interfaces - Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397482.3450718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

摘要

从PDF和图像文档中提取表格是现实世界中普遍存在的任务。由于(1)表样式的多样性,(2)缺乏代表这种多样性的训练数据,以及(3)最终用户之间表定义固有的模糊性和主观性,单个开箱即用模型很难实现完美的提取质量。同时,从头开始构建定制模型可能很困难,因为注释表数据的成本很高。我们试图通过提供一个系统来解决这些问题,在这个系统中,用户和模型可以无缝地协同工作,快速定制高质量的提取模型,并为用户的文档集合(包含带有表格的页面)提供一些带标签的示例。给定一个输入文档集合,TableLab首先通过从提取模型中聚类嵌入来检测具有相似结构(模板)的表。文档集合通常包含用一组有限的模板或类似结构创建的表。然后,它选择几个有代表性的表示例,这些表示例已经用预训练的基础深度学习模型提取出来。通过一个易于使用的用户界面,用户可以对这些选择提供反馈,而不必识别每一个错误。然后,TableLab应用这些反馈对预训练模型进行微调,并将微调后的模型结果返回给用户。用户可以选择迭代地重复这个过程,直到获得一个性能满意的定制模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TableLab: An Interactive Table Extraction System with Adaptive Deep Learning
Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achieve with one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training data representing this variety and (3) the inherent ambiguity and subjectivity of table definitions between end-users. Meanwhile, building customized models from scratch can be difficult due to the expensive nature of annotating table data. We attempt to solve these challenges with TableLab by providing a system where users and models seamlessly work together to quickly customize high-quality extraction models with a few labelled examples for the user’s document collection, which contains pages with tables. Given an input document collection, TableLab first detects tables with similar structures (templates) by clustering embeddings from the extraction model. Document collections often contain tables created with a limited set of templates or similar structures. It then selects a few representative table examples already extracted with a pre-trained base deep learning model. Via an easy-to-use user interface, users provide feedback to these selections without necessarily having to identify every single error. TableLab then applies such feedback to finetune the pre-trained model and returns the results of the finetuned model back to the user. The user can choose to repeat this process iteratively until obtaining a customized model with satisfactory performance.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Over-sketching Operation to Realize Geometrical and Topological Editing across Multiple Objects in Sketch-based CAD Interface TIEVis: a Visual Analytics Dashboard for Temporal Information Extracted from Clinical Reports SynZ: Enhanced Synthetic Dataset for Training UI Element Detectors User-Controlled Content Translation in Social Media VisRec: A Hands-on Tutorial on Deep Learning for Visual Recommender Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1