PyFroid: Scaling Data Analysis on a Commodity Workstation

Advances in database technology : proceedings. International Conference on Extending Database Technology Pub Date : 2023-01-01 DOI:10.48786/edbt.2024.06

Venkatesh Emani, A. Floratou, C. Curino

{"title":"PyFroid: Scaling Data Analysis on a Commodity Workstation","authors":"Venkatesh Emani, A. Floratou, C. Curino","doi":"10.48786/edbt.2024.06","DOIUrl":null,"url":null,"abstract":"Almost every organization today is promoting data-driven decision making leveraging advances in data science. According to various surveys, data scientists spend up to 80% of their time cleaning and transforming data. Although data management systems have been carefully optimized for such tasks over several decades, they are seldom leveraged by data scientists who prefer to use libraries such as Pandas, sacrificing performance and scalability in favor of familiarity and ease of use. As a result, data scientists are not able to fully leverage the hardware capabilities of commodity workstations and either end up working on a small sample of their data locally or migrate to more heavyweight frameworks in a cluster environment. In this paper, we present PyFroid, a framework that leverages lightweight relational databases to improve the performance and scalability of Pandas, allowing data scientists to operate on much larger datasets on a commodity workstation. PyFroid has zero learning curve as it maintains all the Pandas APIs and is fully compatible with the tools that data scientists use (e.g., Python notebooks). We experimentally demonstrate that, compared to Pandas, PyFroid is able to analyze up to 20X more data on the same machine, provide comparable or better performance for small datasets as well as near-memory data sizes, and consume much less resources.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"61-67"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in database technology : proceedings. International Conference on Extending Database Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48786/edbt.2024.06","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Almost every organization today is promoting data-driven decision making leveraging advances in data science. According to various surveys, data scientists spend up to 80% of their time cleaning and transforming data. Although data management systems have been carefully optimized for such tasks over several decades, they are seldom leveraged by data scientists who prefer to use libraries such as Pandas, sacrificing performance and scalability in favor of familiarity and ease of use. As a result, data scientists are not able to fully leverage the hardware capabilities of commodity workstations and either end up working on a small sample of their data locally or migrate to more heavyweight frameworks in a cluster environment. In this paper, we present PyFroid, a framework that leverages lightweight relational databases to improve the performance and scalability of Pandas, allowing data scientists to operate on much larger datasets on a commodity workstation. PyFroid has zero learning curve as it maintains all the Pandas APIs and is fully compatible with the tools that data scientists use (e.g., Python notebooks). We experimentally demonstrate that, compared to Pandas, PyFroid is able to analyze up to 20X more data on the same machine, provide comparable or better performance for small datasets as well as near-memory data sizes, and consume much less resources.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PyFroid:在商品工作站上扩展数据分析

如今，几乎每个组织都在利用数据科学的进步推动数据驱动的决策制定。根据各种调查，数据科学家花费高达80%的时间来清理和转换数据。尽管数据管理系统在过去几十年中已经针对这些任务进行了精心优化，但数据科学家很少利用它们，他们更喜欢使用Pandas等库，牺牲性能和可伸缩性，以换取熟悉性和易用性。因此，数据科学家无法充分利用商品工作站的硬件功能，最终只能在本地处理一小部分数据样本，或者迁移到集群环境中更重量级的框架。在本文中，我们介绍了PyFroid，这是一个利用轻量级关系数据库来提高Pandas的性能和可伸缩性的框架，允许数据科学家在普通工作站上操作更大的数据集。PyFroid的学习曲线为零，因为它维护了所有Pandas api，并且与数据科学家使用的工具(例如Python笔记本)完全兼容。我们通过实验证明，与Pandas相比，PyFroid能够在同一台机器上分析多达20倍的数据，对于小型数据集以及近内存数据大小提供相当或更好的性能，并且消耗更少的资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Advances in database technology : proceedings. International Conference on Extending Database Technology

自引率

0.00%

发文量

期刊最新文献

Computing Generic Abstractions from Application Datasets Fair Spatial Indexing: A paradigm for Group Spatial Fairness. Data Coverage for Detecting Representation Bias in Image Datasets: A Crowdsourcing Approach Auditing for Spatial Fairness TransEdge: Supporting Efficient Read Queries Across Untrusted Edge Nodes