TabReformer: Unsupervised Representation Learning for Erroneous Data Detection

Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader
{"title":"TabReformer: Unsupervised Representation Learning for Erroneous Data Detection","authors":"Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader","doi":"10.1145/3447541","DOIUrl":null,"url":null,"abstract":"Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 29"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3447541","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IMS transactions on data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447541","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TabReformer:用于错误数据检测的无监督表示学习
错误检测是任何数据分析管道中至关重要的初步阶段。现有的错误检测技术通常针对特定类型的错误。此外,大多数检测模型要么需要用户定义的规则,要么需要大量手工标记的训练示例。因此,在本文中,我们提出了TabReformer,这是一个学习表格数据的双向编码器表示的模型。提出的模型包括两个主要阶段。在第一阶段,TabReformer遵循具有多个自关注层的编码器架构,对单元格之间的依赖关系进行建模,并捕获元级表示。此外,该模型利用高斯误差线性单元激活函数与屏蔽数据模型目标,以实现更深入的概率理解。在第二阶段,对模型参数进行微调,以完成错误数据检测任务。该模型使用数据增强模块生成更多的错误示例来表示少数类。实验评估考虑了具有不同类型误差和分布的广泛数据库。实证结果表明,与现有技术相比,我们的解决方案可以将召回值平均提高32.95%,同时减少高达48.86%的人工工作量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Recent Developments in Privacy-Preserving Mining of Clinical Data. PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning A Survey on the Role of Centrality as Seed Nodes for Information Propagation in Large Scale Network DataStorm: Coupled, Continuous Simulations for Complex Urban Environments TabReformer: Unsupervised Representation Learning for Erroneous Data Detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1