自动机器学习进展如何？ AutoML 工具包的特点和挑战

IF 3.5 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Empirical Software Engineering Pub Date : 2024-06-13 DOI:10.1007/s10664-024-10450-y

Md Abdullah Al Alamin, Gias Uddin

{"title":"自动机器学习进展如何？ AutoML 工具包的特点和挑战","authors":"Md Abdullah Al Alamin, Gias Uddin","doi":"10.1007/s10664-024-10450-y","DOIUrl":null,"url":null,"abstract":"<p>Automated Machine Learning aka AutoML toolkits are low/no-code software that aim to democratize ML system application development by ensuring rapid prototyping of ML models and by enabling collaboration across different stakeholders in ML system design (e.g., domain experts, data scientists, etc.). It is thus important to know the state of current AutoML toolkits and the challenges ML practitioners face while using those toolkits. In this paper, we first offer a characterization of currently available AutoML toolits by analyzing 37 top AutoML tools and platforms. We find that the top AutoML platforms are mostly cloud-based. Most of the tools are optimized for the adoption of shallow ML models. Second, we present an empirical study of 14.3K AutoML related posts from Stack Overflow (SO) that we analyzed using topic modelling algorithm LDA (Latent Dirichlet Allocation) to understand the challenges of ML practitioners while using the AutoML toolkits. We find 13 topics in the AutoML related discussions in SO. The 13 topics are grouped into four categories: MLOps (43% of all questions), Model (28% questions), Data (27% questions), and Documentation (2% questions). Most questions are asked during Model training (29%) and Data preparation (25%) phases. AutoML practitioners find the MLOps topic category most challenging. Topics related to the MLOps category are the most prevalent and popular for cloud-based AutoML toolkits. Based on our study findings, we provide 15 recommendations to improve the adoption and development of AutoML toolkits.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"61 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How far are we with automated machine learning? characterization and challenges of AutoML toolkits\",\"authors\":\"Md Abdullah Al Alamin, Gias Uddin\",\"doi\":\"10.1007/s10664-024-10450-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Automated Machine Learning aka AutoML toolkits are low/no-code software that aim to democratize ML system application development by ensuring rapid prototyping of ML models and by enabling collaboration across different stakeholders in ML system design (e.g., domain experts, data scientists, etc.). It is thus important to know the state of current AutoML toolkits and the challenges ML practitioners face while using those toolkits. In this paper, we first offer a characterization of currently available AutoML toolits by analyzing 37 top AutoML tools and platforms. We find that the top AutoML platforms are mostly cloud-based. Most of the tools are optimized for the adoption of shallow ML models. Second, we present an empirical study of 14.3K AutoML related posts from Stack Overflow (SO) that we analyzed using topic modelling algorithm LDA (Latent Dirichlet Allocation) to understand the challenges of ML practitioners while using the AutoML toolkits. We find 13 topics in the AutoML related discussions in SO. The 13 topics are grouped into four categories: MLOps (43% of all questions), Model (28% questions), Data (27% questions), and Documentation (2% questions). Most questions are asked during Model training (29%) and Data preparation (25%) phases. AutoML practitioners find the MLOps topic category most challenging. Topics related to the MLOps category are the most prevalent and popular for cloud-based AutoML toolkits. Based on our study findings, we provide 15 recommendations to improve the adoption and development of AutoML toolkits.</p>\",\"PeriodicalId\":11525,\"journal\":{\"name\":\"Empirical Software Engineering\",\"volume\":\"61 1\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Empirical Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10664-024-10450-y\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10450-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

自动化机器学习（又称 AutoML）工具包是一种低代码/无代码软件，旨在通过确保 ML 模型的快速原型开发以及实现 ML 系统设计中不同利益相关者（如领域专家、数据科学家等）之间的协作，实现 ML 系统应用开发的民主化。因此，了解当前 AutoML 工具包的状况以及 ML 从业人员在使用这些工具包时面临的挑战非常重要。在本文中，我们首先通过分析 37 个顶级 AutoML 工具和平台，对当前可用的 AutoML 工具包进行了描述。我们发现，顶级 AutoML 平台大多基于云。大多数工具都针对浅层 ML 模型的采用进行了优化。其次，我们使用主题建模算法 LDA（潜在德里希特分配）分析了 Stack Overflow (SO) 中 14.3K 篇与 AutoML 相关的帖子，对其进行了实证研究，以了解 ML 从业人员在使用 AutoML 工具包时面临的挑战。我们在 SO 中与 AutoML 相关的讨论中发现了 13 个主题。这 13 个主题分为四类：MLOps（占所有问题的 43%）、模型（占 28%）、数据（占 27%）和文档（占 2%）。大多数问题是在模型培训（29%）和数据准备（25%）阶段提出的。AutoML 从业人员认为 MLOps 主题类别最具挑战性。与 MLOps 类别相关的主题在基于云的 AutoML 工具包中最为普遍和流行。根据研究结果，我们提出了 15 项建议，以改进 AutoML 工具包的采用和开发。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

How far are we with automated machine learning? characterization and challenges of AutoML toolkits

Automated Machine Learning aka AutoML toolkits are low/no-code software that aim to democratize ML system application development by ensuring rapid prototyping of ML models and by enabling collaboration across different stakeholders in ML system design (e.g., domain experts, data scientists, etc.). It is thus important to know the state of current AutoML toolkits and the challenges ML practitioners face while using those toolkits. In this paper, we first offer a characterization of currently available AutoML toolits by analyzing 37 top AutoML tools and platforms. We find that the top AutoML platforms are mostly cloud-based. Most of the tools are optimized for the adoption of shallow ML models. Second, we present an empirical study of 14.3K AutoML related posts from Stack Overflow (SO) that we analyzed using topic modelling algorithm LDA (Latent Dirichlet Allocation) to understand the challenges of ML practitioners while using the AutoML toolkits. We find 13 topics in the AutoML related discussions in SO. The 13 topics are grouped into four categories: MLOps (43% of all questions), Model (28% questions), Data (27% questions), and Documentation (2% questions). Most questions are asked during Model training (29%) and Data preparation (25%) phases. AutoML practitioners find the MLOps topic category most challenging. Topics related to the MLOps category are the most prevalent and popular for cloud-based AutoML toolkits. Based on our study findings, we provide 15 recommendations to improve the adoption and development of AutoML toolkits.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.