An empirical study of challenges in machine learning asset management

IF 3.5 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Empirical Software Engineering Pub Date : 2024-06-15 DOI:10.1007/s10664-024-10474-4
Zhimin Zhao, Yihao Chen, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan
{"title":"An empirical study of challenges in machine learning asset management","authors":"Zhimin Zhao, Yihao Chen, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan","doi":"10.1007/s10664-024-10474-4","DOIUrl":null,"url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">\n<i>Context:</i>\n</h3><p>In machine learning (ML) applications, assets include not only the ML models themselves, but also the datasets, algorithms, and deployment tools that are essential in the development, training, and implementation of these models. Efficient management of ML assets is critical to ensure optimal resource utilization, consistent model performance, and a streamlined ML development lifecycle. This practice contributes to faster iterations, adaptability, reduced time from model development to deployment, and the delivery of reliable and timely outputs.</p><h3 data-test=\"abstract-sub-heading\">\n<i>Objective:</i>\n</h3><p>Despite research on ML asset management, there is still a significant knowledge gap on operational challenges, such as model versioning, data traceability, and collaboration issues, faced by asset management tool users. These challenges are crucial because they could directly impact the efficiency, reproducibility, and overall success of machine learning projects. Our study aims to bridge this empirical gap by analyzing user experience, feedback, and needs from Q &amp;A posts, shedding light on the real-world challenges they face and the solutions they have found.</p><h3 data-test=\"abstract-sub-heading\">\n<i>Method:</i>\n</h3><p>We examine 15, 065 Q &amp;A posts from multiple developer discussion platforms, including Stack Overflow, tool-specific forums, and GitHub/GitLab. Using a mixed-method approach, we classify the posts into knowledge inquiries and problem inquiries. We then apply BERTopic to extract challenge topics and compare their prevalence. Finally, we use the open card sorting approach to summarize solutions from solved inquiries, then cluster them with BERTopic, and analyze the relationship between challenges and solutions.</p><h3 data-test=\"abstract-sub-heading\">\n<i>Results:</i>\n</h3><p>We identify 133 distinct topics in ML asset management-related inquiries, grouped into 16 macro-topics, with software environment and dependency, model deployment and service, and model creation and training emerging as the most discussed. Additionally, we identify 79 distinct solution topics, classified under 18 macro-topics, with software environment and dependency, feature and component development, and file and directory management as the most proposed.</p><h3 data-test=\"abstract-sub-heading\">\n<i>Conclusions:</i>\n</h3><p>This study highlights critical areas within ML asset management that need further exploration, particularly around prevalent macro-topics identified as pain points for ML practitioners, emphasizing the need for collaborative efforts between academia, industry, and the broader research community.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":3.5000,"publicationDate":"2024-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10474-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Context:

In machine learning (ML) applications, assets include not only the ML models themselves, but also the datasets, algorithms, and deployment tools that are essential in the development, training, and implementation of these models. Efficient management of ML assets is critical to ensure optimal resource utilization, consistent model performance, and a streamlined ML development lifecycle. This practice contributes to faster iterations, adaptability, reduced time from model development to deployment, and the delivery of reliable and timely outputs.

Objective:

Despite research on ML asset management, there is still a significant knowledge gap on operational challenges, such as model versioning, data traceability, and collaboration issues, faced by asset management tool users. These challenges are crucial because they could directly impact the efficiency, reproducibility, and overall success of machine learning projects. Our study aims to bridge this empirical gap by analyzing user experience, feedback, and needs from Q &A posts, shedding light on the real-world challenges they face and the solutions they have found.

Method:

We examine 15, 065 Q &A posts from multiple developer discussion platforms, including Stack Overflow, tool-specific forums, and GitHub/GitLab. Using a mixed-method approach, we classify the posts into knowledge inquiries and problem inquiries. We then apply BERTopic to extract challenge topics and compare their prevalence. Finally, we use the open card sorting approach to summarize solutions from solved inquiries, then cluster them with BERTopic, and analyze the relationship between challenges and solutions.

Results:

We identify 133 distinct topics in ML asset management-related inquiries, grouped into 16 macro-topics, with software environment and dependency, model deployment and service, and model creation and training emerging as the most discussed. Additionally, we identify 79 distinct solution topics, classified under 18 macro-topics, with software environment and dependency, feature and component development, and file and directory management as the most proposed.

Conclusions:

This study highlights critical areas within ML asset management that need further exploration, particularly around prevalent macro-topics identified as pain points for ML practitioners, emphasizing the need for collaborative efforts between academia, industry, and the broader research community.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
机器学习资产管理挑战的实证研究
背景:在机器学习(ML)应用中,资产不仅包括 ML 模型本身,还包括对这些模型的开发、训练和实施至关重要的数据集、算法和部署工具。高效管理 ML 资产对于确保最佳资源利用、一致的模型性能和简化 ML 开发生命周期至关重要。这种做法有助于加快迭代速度、提高适应性、缩短从模型开发到部署的时间,并及时交付可靠的产出。目标:尽管对 ML 资产管理进行了研究,但在资产管理工具用户所面临的操作挑战(如模型版本、数据可追溯性和协作问题)方面,仍然存在巨大的知识差距。这些挑战至关重要,因为它们会直接影响机器学习项目的效率、可重复性和整体成功。我们的研究旨在通过分析 Q &A 帖子中的用户体验、反馈和需求,揭示他们在现实世界中面临的挑战和找到的解决方案,从而弥补这一经验上的不足。方法:我们研究了来自多个开发人员讨论平台(包括 Stack Overflow、特定工具论坛和 GitHub/GitLab)的 15 065 篇 Q &A 帖子。我们采用混合方法将帖子分为知识咨询和问题咨询。然后,我们应用 BERTopic 提取挑战主题并比较其流行程度。结果:我们在与 ML 资产管理相关的咨询中发现了 133 个不同的主题,并将其归类为 16 个宏观主题,其中软件环境和依赖性、模型部署和服务以及模型创建和训练是讨论最多的主题。结论:本研究强调了智能语言资产管理中需要进一步探索的关键领域,特别是围绕智能语言从业人员痛点的普遍宏观主题,强调了学术界、产业界和更广泛的研究社区之间合作的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Empirical Software Engineering
Empirical Software Engineering 工程技术-计算机:软件工程
CiteScore
8.50
自引率
12.20%
发文量
169
审稿时长
>12 weeks
期刊介绍: Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.
期刊最新文献
An empirical study on developers’ shared conversations with ChatGPT in GitHub pull requests and issues Quality issues in machine learning software systems An empirical study of token-based micro commits Software product line testing: a systematic literature review Consensus task interaction trace recommender to guide developers’ software navigation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1