DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition

Elvis Koci, Maik Thiele, Josephine Rehak, Oscar Romero, Wolfgang Lehner
{"title":"DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition","authors":"Elvis Koci, Maik Thiele, Josephine Rehak, Oscar Romero, Wolfgang Lehner","doi":"10.1109/ICDAR.2019.00207","DOIUrl":null,"url":null,"abstract":"This paper presents DECO (Dresden Enron COrpus), a dataset of spreadsheet files, annotated on the basis of layout and contents. It comprises of 1,165 files, extracted from the Enron corpus. Three different annotators (judges) assigned layout roles (e.g., Header, Data, and Notes) to non-empty cells and marked the borders of tables. Files that do not contain tables were flagged using categories such as Template, Form, and Report. Subsequently, a thorough analysis is performed to uncover the characteristics of the overall dataset and specific annotations. The results are discussed in this paper, providing several takeaways for future works. Furthermore, this work describes in detail the annotation methodology, going through the individual steps. The dataset, methodology, and tools are made publicly available, so that they can be adopted for further studies. DECO is available at: https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2019.00207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

Abstract

This paper presents DECO (Dresden Enron COrpus), a dataset of spreadsheet files, annotated on the basis of layout and contents. It comprises of 1,165 files, extracted from the Enron corpus. Three different annotators (judges) assigned layout roles (e.g., Header, Data, and Notes) to non-empty cells and marked the borders of tables. Files that do not contain tables were flagged using categories such as Template, Form, and Report. Subsequently, a thorough analysis is performed to uncover the characteristics of the overall dataset and specific annotations. The results are discussed in this paper, providing several takeaways for future works. Furthermore, this work describes in detail the annotation methodology, going through the individual steps. The dataset, methodology, and tools are made publicly available, so that they can be adopted for further studies. DECO is available at: https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DECO:用于布局和表识别的注释电子表格数据集
本文介绍了DECO(德累斯顿安然语料库),一个电子表格文件的数据集,在布局和内容的基础上进行了注释。它包括从安然语料库中提取的1,165个文件。三个不同的注释者(法官)为非空单元格分配布局角色(例如,Header、Data和Notes),并标记表格的边界。使用模板、表单和报告等类别标记不包含表的文件。随后,执行彻底的分析,以揭示整个数据集和特定注释的特征。本文对研究结果进行了讨论,并对今后的工作提出了几点建议。此外,本工作详细描述了注释方法,通过各个步骤。数据集、方法和工具都是公开的,以便它们可以用于进一步的研究。DECO网站:https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Article Segmentation in Digitised Newspapers with a 2D Markov Model ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images DICE: Deep Intelligent Contextual Embedding for Twitter Sentiment Analysis Blind Source Separation Based Framework for Multispectral Document Images Binarization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1