Automating Exploratory Data Analysis via Machine Learning: An Overview

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-05-29 DOI:10.1145/3318464.3383126

T. Milo, Amit Somech

{"title":"Automating Exploratory Data Analysis via Machine Learning: An Overview","authors":"T. Milo, Amit Somech","doi":"10.1145/3318464.3383126","DOIUrl":null,"url":null,"abstract":"Exploratory Data Analysis (EDA) is an important initial step for any knowledge discovery process, in which data scientists interactively explore unfamiliar datasets by issuing a sequence of analysis operations (e.g. filter, aggregation, and visualization). Since EDA is long known as a difficult task, requiring profound analytical skills, experience, and domain knowledge, a plethora of systems have been devised over the last decade in order to facilitate EDA. In particular, advancements in machine learning research have created exciting opportunities, not only for better facilitating EDA, but to fully automate the process. In this tutorial, we review recent lines of work for automating EDA. Starting from recommender systems for suggesting a single exploratory action, going through kNN-based classifiers and active-learning methods for predicting users' interestingness preferences, and finally to fully automating EDA using state-of-the-art methods such as deep reinforcement learning and sequence-to-sequence models. We conclude the tutorial with a discussion on the main challenges and open questions to be dealt with in order to ultimately reduce the manual effort required for EDA.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318464.3383126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 45

Abstract

Exploratory Data Analysis (EDA) is an important initial step for any knowledge discovery process, in which data scientists interactively explore unfamiliar datasets by issuing a sequence of analysis operations (e.g. filter, aggregation, and visualization). Since EDA is long known as a difficult task, requiring profound analytical skills, experience, and domain knowledge, a plethora of systems have been devised over the last decade in order to facilitate EDA. In particular, advancements in machine learning research have created exciting opportunities, not only for better facilitating EDA, but to fully automate the process. In this tutorial, we review recent lines of work for automating EDA. Starting from recommender systems for suggesting a single exploratory action, going through kNN-based classifiers and active-learning methods for predicting users' interestingness preferences, and finally to fully automating EDA using state-of-the-art methods such as deep reinforcement learning and sequence-to-sequence models. We conclude the tutorial with a discussion on the main challenges and open questions to be dealt with in order to ultimately reduce the manual effort required for EDA.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过机器学习自动化探索性数据分析:概述

探索性数据分析(EDA)是任何知识发现过程中重要的初始步骤，在EDA中，数据科学家通过发布一系列分析操作(例如过滤、聚合和可视化)来交互式地探索不熟悉的数据集。由于EDA长期以来被认为是一项困难的任务，需要深厚的分析技能、经验和领域知识，因此在过去十年中，为了促进EDA，已经设计了大量的系统。特别是，机器学习研究的进步创造了令人兴奋的机会，不仅可以更好地促进EDA，而且可以实现整个过程的自动化。在本教程中，我们回顾了自动化EDA的最新工作。从建议单个探索性操作的推荐系统开始，通过基于knn的分类器和主动学习方法来预测用户的兴趣偏好，最后使用最先进的方法(如深度强化学习和序列到序列模型)完全自动化EDA。为了最终减少EDA所需的手工工作，我们将讨论需要处理的主要挑战和开放问题，以此来结束本教程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量