Supercharging distributed computing environments for high-performance data engineering

Niranda Perera, A. Sarker, Kaiying Shan, Alex Fetea, Supun Kamburugamuve, Thejaka Amila Kanewala, Chathura Widanage, Mills Staylor, Tianle Zhong, V. Abeykoon, Gregor von Laszewski, Geoffrey Fox
{"title":"Supercharging distributed computing environments for high-performance data engineering","authors":"Niranda Perera, A. Sarker, Kaiying Shan, Alex Fetea, Supun Kamburugamuve, Thejaka Amila Kanewala, Chathura Widanage, Mills Staylor, Tianle Zhong, V. Abeykoon, Gregor von Laszewski, Geoffrey Fox","doi":"10.3389/fhpcp.2024.1384619","DOIUrl":null,"url":null,"abstract":"The data engineering and data science community has embraced the idea of using Python and R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these frameworks are now ever more important in order to process terabytes of data. They can easily exceed the capabilities of a single machine but also demand significant developer time and effort due to their convenience and ability to manipulate data with high-level abstractions that can be optimized. Therefore it is essential to design scalable dataframe solutions. There have been multiple efforts to be integrated into the most efficient fashion to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask and Ray's distributed computing features look very promising, we perceive that the Dask Dataframes and Ray Datasets still have room for optimization In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask and Ray infrastructure (supercharging them!). To achieve this, we integrate a high-performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30 × more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to leveraging the native C++ execution of Cylon. We believe the performance of Cylon in conjunction with CylonFlow extends beyond the data engineering domain and can be used to consolidate high-performance computing and distributed computing ecosystems.","PeriodicalId":474805,"journal":{"name":"Frontiers in High Performance Computing","volume":"40 18","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in High Performance Computing","FirstCategoryId":"0","ListUrlMain":"https://doi.org/10.3389/fhpcp.2024.1384619","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The data engineering and data science community has embraced the idea of using Python and R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these frameworks are now ever more important in order to process terabytes of data. They can easily exceed the capabilities of a single machine but also demand significant developer time and effort due to their convenience and ability to manipulate data with high-level abstractions that can be optimized. Therefore it is essential to design scalable dataframe solutions. There have been multiple efforts to be integrated into the most efficient fashion to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask and Ray's distributed computing features look very promising, we perceive that the Dask Dataframes and Ray Datasets still have room for optimization In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask and Ray infrastructure (supercharging them!). To achieve this, we integrate a high-performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30 × more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to leveraging the native C++ execution of Cylon. We believe the performance of Cylon in conjunction with CylonFlow extends beyond the data engineering domain and can be used to consolidate high-performance computing and distributed computing ecosystems.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
为高性能数据工程的分布式计算环境增压
数据工程和数据科学界已经接受了在常规应用中使用 Python 和 R 数据框架的理念。在大数据革命和人工智能的推动下,这些框架在处理 TB 级数据方面变得越来越重要。它们可以轻松超越单台机器的能力,但也需要开发人员花费大量的时间和精力,因为它们使用高级抽象来处理数据,方便快捷,而且可以进行优化。因此,设计可扩展的数据帧解决方案至关重要。为了以最有效的方式解决这一问题,人们做出了多种努力,其中最引人注目的是使用分布式计算环境(如 Dask 和 Ray)开发的数据帧系统。尽管 Dask 和 Ray 的分布式计算功能看起来很有前途,但我们认为 Dask 数据框架和 Ray 数据集仍有优化的空间。在本文中,我们介绍了 CylonFlow,这是一种可供选择的分布式数据框架执行方法,可在相同的 Dask 和 Ray 基础架构上实现最先进的性能和可扩展性(为它们增压!)。为了实现这一目标,我们将原本基于完全不同执行范式的高性能数据帧系统 Cylon 集成到 Dask 和 Ray 中。我们的实验表明,在数据帧操作流水线上,CylonFlow 的分布式性能比 Dask Dataframes 高出 30 倍。有趣的是,由于利用了 Cylon 的本地 C++ 执行,CylonFlow 还实现了卓越的顺序性能。我们相信,Cylon 的性能与 CylonFlow 的结合将超越数据工程领域,可用于整合高性能计算和分布式计算生态系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Supercharging distributed computing environments for high-performance data engineering A multiphysics coupling framework for exascale simulation of fracture evolution in subsurface energy applications SmartORC: smart orchestration of resources in the compute continuum Opportunities for enhancing MLCommons efforts while leveraging insights from educational MLCommons earthquake benchmarks efforts
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1