Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!

Sebastian Schelter, Stefan Grafberger
{"title":"Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!","authors":"Sebastian Schelter, Stefan Grafberger","doi":"arxiv-2409.10081","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) applications that learn from data are increasingly used\nto automate impactful decisions. Unfortunately, these applications often fall\nshort of adequately managing critical data and complying with upcoming\nregulations. A technical reason for the persistence of these issues is that the\ndata pipelines in common ML libraries and cloud services lack fundamental\ndeclarative, data-centric abstractions. Recent research has shown how such\nabstractions enable techniques like provenance tracking and automatic\ninspection to help manage ML pipelines. Unfortunately, these approaches lack\nadoption in the real world because they require clean ML pipeline code written\nwith declarative APIs, instead of the messy imperative Python code that data\nscientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their\nestablished development practices. Instead, we propose to circumvent this \"code\nabstraction gap\" by leveraging the code generation capabilities of large\nlanguage models (LLMs). Our idea is to rewrite messy data science code to a\ncustom-tailored declarative pipeline abstraction, which we implement as a\nproof-of-concept in our prototype Lester. We detail its application for a\nchallenging compliance management example involving \"incremental view\nmaintenance\" of deployed ML pipelines. The code rewrites for our running\nexample show the potential of LLMs to make messy data science code declarative,\ne.g., by identifying hand-coded joins in Python and turning them into joins on\ndataframes, or by generating declarative feature encoders from NumPy code.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) applications that learn from data are increasingly used to automate impactful decisions. Unfortunately, these applications often fall short of adequately managing critical data and complying with upcoming regulations. A technical reason for the persistence of these issues is that the data pipelines in common ML libraries and cloud services lack fundamental declarative, data-centric abstractions. Recent research has shown how such abstractions enable techniques like provenance tracking and automatic inspection to help manage ML pipelines. Unfortunately, these approaches lack adoption in the real world because they require clean ML pipeline code written with declarative APIs, instead of the messy imperative Python code that data scientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their established development practices. Instead, we propose to circumvent this "code abstraction gap" by leveraging the code generation capabilities of large language models (LLMs). Our idea is to rewrite messy data science code to a custom-tailored declarative pipeline abstraction, which we implement as a proof-of-concept in our prototype Lester. We detail its application for a challenging compliance management example involving "incremental view maintenance" of deployed ML pipelines. The code rewrites for our running example show the potential of LLMs to make messy data science code declarative, e.g., by identifying hand-coded joins in Python and turning them into joins on dataframes, or by generating declarative feature encoders from NumPy code.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
杂乱的代码让 ML 管道管理变得困难?只需让 LLM 重写代码!
从数据中学习的机器学习 (ML) 应用程序越来越多地用于自动做出有影响力的决策。遗憾的是,这些应用往往无法充分管理关键数据并遵守即将出台的法规。这些问题长期存在的一个技术原因是,常见的 ML 库和云服务中的数据管道缺乏以数据为中心的基本抽象。最近的研究表明,这种抽象如何使出处跟踪和自动检查等技术能够帮助管理 ML 管道。遗憾的是,这些方法在现实世界中缺乏采用,因为它们需要用声明式应用程序接口编写简洁的 ML 管道代码,而不是数据科学家通常为数据准备编写的混乱的命令式 Python 代码。我们认为,期望数据科学家改变既定的开发实践是不现实的。相反,我们建议利用大型语言模型(LLM)的代码生成能力来规避这种 "代码抽象差距"。我们的想法是将凌乱的数据科学代码重写为定制的声明式流水线抽象,我们在原型 Lester 中实现了这一概念验证。我们详细介绍了它在一个具有挑战性的合规性管理示例中的应用,该示例涉及已部署 ML 管道的 "增量视图维护"。我们正在运行的示例的代码重写显示了 LLM 在使杂乱的数据科学代码声明化方面的潜力,例如,通过识别 Python 中手工编码的连接并将其转化为数据帧上的连接,或者通过从 NumPy 代码生成声明性特征编码器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1