JoinBoost:仅使用SQL在规范化数据上生长树

Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu
{"title":"JoinBoost:仅使用SQL在规范化数据上生长树","authors":"Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu","doi":"10.48550/arXiv.2307.00422","DOIUrl":null,"url":null,"abstract":"Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL?\n \n We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the\n Y\n variable to the residual in the\n non-materialized join result.\n Although this view update problem is generally ambiguous, we identify\n addition-to-multiplication preserving\n , the key property of variance semi-ring to support\n rmse\n the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"JoinBoost: Grow Trees Over Normalized Data Using Only SQL\",\"authors\":\"Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu\",\"doi\":\"10.48550/arXiv.2307.00422\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL?\\n \\n We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the\\n Y\\n variable to the residual in the\\n non-materialized join result.\\n Although this view update problem is generally ambiguous, we identify\\n addition-to-multiplication preserving\\n , the key property of variance semi-ring to support\\n rmse\\n the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).\\n\",\"PeriodicalId\":20467,\"journal\":{\"name\":\"Proc. VLDB Endow.\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. VLDB Endow.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2307.00422\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2307.00422","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

虽然在表格数据中占主导地位,但在规范化数据库(例如LightGBM, XGBoost)上训练树模型的ML库要求将数据反规范化为单个表,具体化并导出。此过程不可扩展,速度慢,并且存在安全风险。In-DB ML旨在训练dbms中的模型,以避免数据移动并提供数据治理。与其修改DBMS来支持In-DB ML,还不如为专门的ML库提供有竞争力的树训练性能……只使用SQL?我们介绍了JoinBoost,一个Python库,它将规范化数据库上的树训练算法重写为纯SQL。它可移植到任何DBMS,提供与专用ML库竞争的性能,并随DBMS的底层功能进行扩展。JoinBoost从算法和系统的角度扩展了先前的工作。在算法上,我们通过将Y变量更新为非物化连接结果中的残差来支持因式梯度提升。虽然这种视图更新问题通常是模棱两可的,但我们确定了方差半环支持rmse的关键属性——保持加法到乘法,这是最广泛使用的准则。在系统方面,我们认为残留更新是性能瓶颈。在列式dbms上,可以通过创建一个新的残值列并将其作为投影添加来最小化这种开销。我们用DuckDB上的两个实现验证了这一点,为了可移植性,对其内部没有或只有很少的修改。我们的实验表明,与LightGBM相比,JoinBoost在随机森林(梯度增强)方面的速度快了3倍(1.1倍),比最先进的In-DB ML系统快了一个数量级。此外,JoinBoost在#特征、数据库大小(TPC-DS SF=1000)和连接图复杂性(星系模式)方面的扩展性远远超过LightGBM。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
JoinBoost: Grow Trees Over Normalized Data Using Only SQL
Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the Y variable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving , the key property of variance semi-ring to support rmse the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Cryptographically Secure Private Record Linkage Using Locality-Sensitive Hashing Utility-aware Payment Channel Network Rebalance Relational Query Synthesis ⋈ Decision Tree Learning Billion-Scale Bipartite Graph Embedding: A Global-Local Induced Approach Query Refinement for Diversity Constraint Satisfaction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1