Rui Liu, Kwanghyun Park, Fotis Psallidas, Xiaoyong Zhu, Jinghui Mo, Rathijit Sen, Matteo Interlandi, Konstantinos Karanasos, Yuanyuan Tian, Jesús Camacho-Rodríguez
{"title":"Optimizing Data Pipelines for Machine Learning in Feature Stores","authors":"Rui Liu, Kwanghyun Park, Fotis Psallidas, Xiaoyong Zhu, Jinghui Mo, Rathijit Sen, Matteo Interlandi, Konstantinos Karanasos, Yuanyuan Tian, Jesús Camacho-Rodríguez","doi":"10.14778/3625054.3625060","DOIUrl":null,"url":null,"abstract":"Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new \"DBMS-for-ML\" with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungry---with ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"24 1","pages":"4230-4239"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3625054.3625060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new "DBMS-for-ML" with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungry---with ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.