Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!

arXiv - CS - Databases Pub Date : 2024-09-16 DOI:arxiv-2409.10081

Sebastian Schelter, Stefan Grafberger

{"title":"Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!","authors":"Sebastian Schelter, Stefan Grafberger","doi":"arxiv-2409.10081","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) applications that learn from data are increasingly used\nto automate impactful decisions. Unfortunately, these applications often fall\nshort of adequately managing critical data and complying with upcoming\nregulations. A technical reason for the persistence of these issues is that the\ndata pipelines in common ML libraries and cloud services lack fundamental\ndeclarative, data-centric abstractions. Recent research has shown how such\nabstractions enable techniques like provenance tracking and automatic\ninspection to help manage ML pipelines. Unfortunately, these approaches lack\nadoption in the real world because they require clean ML pipeline code written\nwith declarative APIs, instead of the messy imperative Python code that data\nscientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their\nestablished development practices. Instead, we propose to circumvent this \"code\nabstraction gap\" by leveraging the code generation capabilities of large\nlanguage models (LLMs). Our idea is to rewrite messy data science code to a\ncustom-tailored declarative pipeline abstraction, which we implement as a\nproof-of-concept in our prototype Lester. We detail its application for a\nchallenging compliance management example involving \"incremental view\nmaintenance\" of deployed ML pipelines. The code rewrites for our running\nexample show the potential of LLMs to make messy data science code declarative,\ne.g., by identifying hand-coded joins in Python and turning them into joins on\ndataframes, or by generating declarative feature encoders from NumPy code.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning (ML) applications that learn from data are increasingly used to automate impactful decisions. Unfortunately, these applications often fall short of adequately managing critical data and complying with upcoming regulations. A technical reason for the persistence of these issues is that the data pipelines in common ML libraries and cloud services lack fundamental declarative, data-centric abstractions. Recent research has shown how such abstractions enable techniques like provenance tracking and automatic inspection to help manage ML pipelines. Unfortunately, these approaches lack adoption in the real world because they require clean ML pipeline code written with declarative APIs, instead of the messy imperative Python code that data scientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their established development practices. Instead, we propose to circumvent this "code abstraction gap" by leveraging the code generation capabilities of large language models (LLMs). Our idea is to rewrite messy data science code to a custom-tailored declarative pipeline abstraction, which we implement as a proof-of-concept in our prototype Lester. We detail its application for a challenging compliance management example involving "incremental view maintenance" of deployed ML pipelines. The code rewrites for our running example show the potential of LLMs to make messy data science code declarative, e.g., by identifying hand-coded joins in Python and turning them into joins on dataframes, or by generating declarative feature encoders from NumPy code.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

杂乱的代码让 ML 管道管理变得困难？只需让 LLM 重写代码！

从数据中学习的机器学习 (ML) 应用程序越来越多地用于自动做出有影响力的决策。遗憾的是，这些应用往往无法充分管理关键数据并遵守即将出台的法规。这些问题长期存在的一个技术原因是，常见的 ML 库和云服务中的数据管道缺乏以数据为中心的基本抽象。最近的研究表明，这种抽象如何使出处跟踪和自动检查等技术能够帮助管理 ML 管道。遗憾的是，这些方法在现实世界中缺乏采用，因为它们需要用声明式应用程序接口编写简洁的 ML 管道代码，而不是数据科学家通常为数据准备编写的混乱的命令式 Python 代码。我们认为，期望数据科学家改变既定的开发实践是不现实的。相反，我们建议利用大型语言模型（LLM）的代码生成能力来规避这种 "代码抽象差距"。我们的想法是将凌乱的数据科学代码重写为定制的声明式流水线抽象，我们在原型 Lester 中实现了这一概念验证。我们详细介绍了它在一个具有挑战性的合规性管理示例中的应用，该示例涉及已部署 ML 管道的 "增量视图维护"。我们正在运行的示例的代码重写显示了 LLM 在使杂乱的数据科学代码声明化方面的潜力，例如，通过识别 Python 中手工编码的连接并将其转化为数据帧上的连接，或者通过从 NumPy 代码生成声明性特征编码器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量

期刊最新文献

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes