Reproducible experiments for generating pre-processing pipelines for AutoETL

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Systems Pub Date : 2024-02-01 Epub Date: 2023-11-02 DOI:10.1016/j.is.2023.102314

Joseph Giovanelli , Besim Bilalli , Alberto Abelló , Fernando Silva-Coira , Guillermo de Bernardo

{"title":"Reproducible experiments for generating pre-processing pipelines for AutoETL","authors":"Joseph Giovanelli , Besim Bilalli , Alberto Abelló , Fernando Silva-Coira , Guillermo de Bernardo","doi":"10.1016/j.is.2023.102314","DOIUrl":null,"url":null,"abstract":"<div><p>This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of supervised learning algorithms. With the recent shift towards data-centric approaches, where instead of the model, the dataset is systematically changed for better model performance, data pre-processing is receiving a lot of attention. Yet, its impact over the final analysis is not widely recognized, primarily due to the lack of publicly available experiments that quantify it. To bridge this gap, this work introduces a set of reproducible experiments on the impact of data pre-processing by providing a detailed reproducibility protocol together with a software tool and a set of extensible datasets, which allow for all the experiments and results of our aforementioned work to be reproduced. We introduce a set of strongly reproducible experiments based on a collection of intermediate results, and a set of weakly reproducible experiments (Lastra-Dıaz, 0000) that allows reproducing our end-to-end optimization process and evaluation of all the methods reported in our primary paper. The reproducibility protocol is created in Docker and tested in Windows and Linux. In brief, our primary work (i) develops a method for generating effective prototypes, as templates or logical sequences of pre-processing transformations, and (ii) instantiates the prototypes into pipelines, in the form of executable or physical sequences of actual operators that implement the respective transformations. For the first, a set of heuristic rules learned from extensive experiments are used, and for the second techniques from Automated Machine Learning (AutoML) are applied.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"120 ","pages":"Article 102314"},"PeriodicalIF":3.4000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437923001503","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/2 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of supervised learning algorithms. With the recent shift towards data-centric approaches, where instead of the model, the dataset is systematically changed for better model performance, data pre-processing is receiving a lot of attention. Yet, its impact over the final analysis is not widely recognized, primarily due to the lack of publicly available experiments that quantify it. To bridge this gap, this work introduces a set of reproducible experiments on the impact of data pre-processing by providing a detailed reproducibility protocol together with a software tool and a set of extensible datasets, which allow for all the experiments and results of our aforementioned work to be reproduced. We introduce a set of strongly reproducible experiments based on a collection of intermediate results, and a set of weakly reproducible experiments (Lastra-Dıaz, 0000) that allows reproducing our end-to-end optimization process and evaluation of all the methods reported in our primary paper. The reproducibility protocol is created in Docker and tested in Windows and Linux. In brief, our primary work (i) develops a method for generating effective prototypes, as templates or logical sequences of pre-processing transformations, and (ii) instantiates the prototypes into pipelines, in the form of executable or physical sequences of actual operators that implement the respective transformations. For the first, a set of heuristic rules learned from extensive experiments are used, and for the second techniques from Automated Machine Learning (AutoML) are applied.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

生成AutoETL预处理管道的可重复实验

这项工作是Giovanelli等人(2022)报告的实验和结果的可重复性论文的同伴，其中评估了数据预处理管道，以找到减少监督学习算法分类误差的管道原型。随着最近转向以数据为中心的方法，即系统地改变数据集而不是模型以获得更好的模型性能，数据预处理受到了很多关注。然而，它对最终分析的影响并没有得到广泛认可，主要是因为缺乏公开的实验来量化它。为了弥补这一差距，本工作通过提供详细的可重复性协议以及软件工具和一组可扩展的数据集，引入了一组关于数据预处理影响的可重复实验，从而允许我们上述工作的所有实验和结果被复制。我们介绍了一组基于中间结果集合的强可重复性实验，以及一组弱可重复性实验(Lastra-Dıaz, 0000)，可以重现我们的端到端优化过程，并对我们主要论文中报告的所有方法进行评估。可重复性协议是在Docker中创建的，并在Windows和Linux中进行了测试。简而言之，我们的主要工作(i)开发了一种生成有效原型的方法，作为预处理转换的模板或逻辑序列，以及(ii)以实现各自转换的实际操作符的可执行或物理序列的形式将原型实例化到管道中。对于第一种方法，使用了从大量实验中学习到的一组启发式规则，对于第二种方法，应用了自动机器学习(AutoML)的技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.