TabReformer: Unsupervised Representation Learning for Erroneous Data Detection

ACM/IMS transactions on data science Pub Date : 2021-05-17 DOI:10.1145/3447541

Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader

{"title":"TabReformer: Unsupervised Representation Learning for Erroneous Data Detection","authors":"Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader","doi":"10.1145/3447541","DOIUrl":null,"url":null,"abstract":"Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 29"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3447541","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IMS transactions on data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447541","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TabReformer：用于错误数据检测的无监督表示学习

错误检测是任何数据分析管道中至关重要的初步阶段。现有的错误检测技术通常针对特定类型的错误。此外，大多数检测模型要么需要用户定义的规则，要么需要大量手工标记的训练示例。因此，在本文中，我们提出了TabReformer，这是一个学习表格数据的双向编码器表示的模型。提出的模型包括两个主要阶段。在第一阶段，TabReformer遵循具有多个自关注层的编码器架构，对单元格之间的依赖关系进行建模，并捕获元级表示。此外，该模型利用高斯误差线性单元激活函数与屏蔽数据模型目标，以实现更深入的概率理解。在第二阶段，对模型参数进行微调，以完成错误数据检测任务。该模型使用数据增强模块生成更多的错误示例来表示少数类。实验评估考虑了具有不同类型误差和分布的广泛数据库。实证结果表明，与现有技术相比，我们的解决方案可以将召回值平均提高32.95%，同时减少高达48.86%的人工工作量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM/IMS transactions on data science

自引率

0.00%

发文量