Learn2Clean:优化Web数据准备的任务顺序

The World Wide Web Conference Pub Date : 2019-05-13 DOI:10.1145/3308558.3313602

Laure Berti-Équille

{"title":"Learn2Clean:优化Web数据准备的任务顺序","authors":"Laure Berti-Équille","doi":"10.1145/3308558.3313602","DOIUrl":null,"url":null,"abstract":"Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"380 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":"{\"title\":\"Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation\",\"authors\":\"Laure Berti-Équille\",\"doi\":\"10.1145/3308558.3313602\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.\",\"PeriodicalId\":23013,\"journal\":{\"name\":\"The World Wide Web Conference\",\"volume\":\"380 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"31\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The World Wide Web Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3308558.3313602\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The World Wide Web Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3308558.3313602","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

摘要

数据清理和准备一直是数据科学中的一个长期挑战，以避免从脏数据中获得不正确的结果和误导性的结论。对于给定的数据集和给定的基于机器学习的任务，过多的数据预处理技术和替代数据管理策略可能导致质量性能不平等的显著不同的输出。然而，目前大多数关于数据清理和自动化机器学习的工作都集中在开发清理算法或用户引导系统上，或者认为依赖于一种有原则的方法来选择数据预处理步骤的顺序，从而导致数据的最佳质量性能。在本文中，我们提出了Learn2Clean，这是一种基于Q-Learning的方法，这是一种无模型强化学习技术，它为给定的数据集、ML模型和质量性能指标选择用于预处理数据的最佳任务序列，从而使ML模型结果的质量最大化。作为对我们的方法在Web数据分析上下文中的初步验证，我们在对真实数据进行聚类、回归和分类的数据准备方面给出了一些有希望的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation

Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The World Wide Web Conference

自引率

0.00%

发文量

期刊最新文献

Decoupled Smoothing on Graphs Think Outside the Dataset: Finding Fraudulent Reviews using Cross-Dataset Analysis Augmenting Knowledge Tracing by Considering Forgetting Behavior Enhancing Fashion Recommendation with Visual Compatibility Relationship Judging a Book by Its Cover: The Effect of Facial Perception on Centrality in Social Networks