ExtracTable: Extracting Tables from Raw Data Files

Datenbanksysteme für Business, Technologie und Web Pub Date : 1900-01-01 DOI:10.18420/BTW2023-20

Leonardo Hübscher, Lan Jiang, Felix Naumann

引用次数: 0

Abstract

: Raw data, especially in text-files, comes in many shapes and forms, often tailored toward human readability. They include preambles and footnotes, are formatted visually, and in general do not follow csv-guidelines. The ability to easily ingest such files into data systems opens up many opportunities for data analysis and processing. With ExtracTable, we present a system that can automatically ingest a large variety of raw data files, including text files and poorly structured csv-files by detecting row patterns and thus separating their values into coherent columns. We manually annotated 957 files of a wide variety containing 1208 tables. We show experimentally that ExtracTable can correctly parse 90% of all lines in structured files and 76% of all lines in files with a visual layout only, significantly outperforming state-of-the-art.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ExtracTable:从原始数据文件中提取表

原始数据，特别是文本文件中的原始数据，有多种形状和形式，通常根据人类的可读性进行调整。它们包括前言和脚注，以视觉方式格式化，通常不遵循csv指南。将这些文件轻松地摄取到数据系统中的能力为数据分析和处理提供了许多机会。通过ExtracTable，我们提供了一个系统，它可以通过检测行模式并将它们的值分离到一致的列中，自动摄取大量的原始数据文件，包括文本文件和结构不良的csv文件。我们手动注释了957个文件，其中包含1208个表。我们通过实验证明，ExtracTable可以正确解析结构化文件中90%的所有行，以及仅使用视觉布局的文件中76%的所有行，显著优于目前的水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Datenbanksysteme für Business, Technologie und Web

自引率

0.00%

发文量

期刊最新文献

SportsTables: A new Corpus for Semantic Type Detection Accelerating Large Table Scan using Processing-In-Memory Technology The InsightsNet Climate Change Corpus (ICCC) On the State of German (Abstractive) Text Summarization The Easiest Way of Turning your Relational Database into a Blockchain - and the Cost of Doing So