Auto-Detect: Data-Driven Error Detection in Tables

Zhipeng Huang, Yeye He
{"title":"Auto-Detect: Data-Driven Error Detection in Tables","authors":"Zhipeng Huang, Yeye He","doi":"10.1145/3183713.3196889","DOIUrl":null,"url":null,"abstract":"Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without considering a more global notion of compatibility that can be inferred from large corpora of clean tables. We propose \\sj, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection, which is a significant departure from existing rule-based methods. Our approach can automatically detect incompatible values, by leveraging an ensemble of judiciously selected generalization languages, each of which uses different generalizations and is sensitive to different types of errors. Errors so detected are based on global statistics, which is robust and aligns well with human intuition of errors. We test \\sj on a large set of public Wikipedia tables, as well as proprietary enterprise Excel files. While both of these test sets are supposed to be of high-quality, \\sj makes surprising discoveries of over tens of thousands of errors in both cases, which are manually verified to be of high precision (over 0.98). Our labeled benchmark set on Wikipedia tables is released for future research.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183713.3196889","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29

Abstract

Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without considering a more global notion of compatibility that can be inferred from large corpora of clean tables. We propose \sj, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection, which is a significant departure from existing rule-based methods. Our approach can automatically detect incompatible values, by leveraging an ensemble of judiciously selected generalization languages, each of which uses different generalizations and is sensitive to different types of errors. Errors so detected are based on global statistics, which is robust and aligns well with human intuition of errors. We test \sj on a large set of public Wikipedia tables, as well as proprietary enterprise Excel files. While both of these test sets are supposed to be of high-quality, \sj makes surprising discoveries of over tens of thousands of errors in both cases, which are manually verified to be of high precision (over 0.98). Our labeled benchmark set on Wikipedia tables is released for future research.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
自动检测:表中数据驱动的错误检测
给定一列值,现有的方法通常使用类似regex的规则,通过查找与其他值不一致的异常值来检测错误。这种技术仅根据给定输入列中的值做出局部决策,而不考虑从大型干净表语料库推断出的更全局的兼容性概念。我们提出了\sj,这是一种基于统计的技术,它利用来自大型语料库的共现统计数据进行错误检测,这与现有的基于规则的方法有很大的不同。我们的方法可以自动检测不兼容的值,通过利用一组明智选择的泛化语言,每种语言使用不同的泛化,对不同类型的错误敏感。这样检测到的错误是基于全局统计的,这是健壮的,并且与人类对错误的直觉很好地一致。我们在一大组公共维基百科表以及专有企业Excel文件上测试了\sj。虽然这两个测试集都应该是高质量的,但\sj在这两种情况下都令人惊讶地发现了超过数万个错误,这些错误被人工验证为高精度(超过0.98)。我们在维基百科表上的标记基准集被发布用于未来的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Meta-Dataflows: Efficient Exploratory Dataflow Jobs Columnstore and B+ tree - Are Hybrid Physical Designs Important? Demonstration of VerdictDB, the Platform-Independent AQP System Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration Session details: Keynote1
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1