Optimizing Data Pipelines for Machine Learning in Feature Stores

Proc. VLDB Endow. Pub Date : 2023-09-01 DOI:10.14778/3625054.3625060

Rui Liu, Kwanghyun Park, Fotis Psallidas, Xiaoyong Zhu, Jinghui Mo, Rathijit Sen, Matteo Interlandi, Konstantinos Karanasos, Yuanyuan Tian, Jesús Camacho-Rodríguez

引用次数: 0

Abstract

Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new "DBMS-for-ML" with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungry---with ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在特征库中优化机器学习的数据管道

数据管道（即将原始数据转换为特征）对于机器学习（ML）模型至关重要，但其开发和管理却非常耗时。最近，特征库作为一种新的 "DBMS-for-ML "出现了，其前提是让数据科学家和工程师能够定义和管理他们的数据管道。虽然从功能角度看，当前的特征库实现了它们的承诺，但它们却非常耗费资源--有大量机会实施数据库式的优化来提高它们的性能。在本文中，我们提出了一套新颖的优化方案，专门针对数据管道中的关键操作--时间点连接。我们在广泛使用的特征存储 Feathr 上实现了这些优化，并在 TPCx-AI 基准和真实世界在线零售场景的使用案例中对其进行了评估。全面的实验分析表明，与最先进的基线相比，我们的优化能将数据管道的速度提高 3 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量

期刊最新文献

Cryptographically Secure Private Record Linkage Using Locality-Sensitive Hashing Utility-aware Payment Channel Network Rebalance Relational Query Synthesis ⋈ Decision Tree Learning Billion-Scale Bipartite Graph Embedding: A Global-Local Induced Approach Query Refinement for Diversity Constraint Satisfaction