High Performance Data Engineering Everywhere

Chathura Widanage, Niranda Perera, V. Abeykoon, Supun Kamburugamuve, Thejaka Amila Kanewala, Hasara Maithree, P. Wickramasinghe, A. Uyar, Gurhan Gunduz, G. Fox
{"title":"High Performance Data Engineering Everywhere","authors":"Chathura Widanage, Niranda Perera, V. Abeykoon, Supun Kamburugamuve, Thejaka Amila Kanewala, Hasara Maithree, P. Wickramasinghe, A. Uyar, Gurhan Gunduz, G. Fox","doi":"10.1109/SMDS49396.2020.00022","DOIUrl":null,"url":null,"abstract":"The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.","PeriodicalId":385149,"journal":{"name":"2020 IEEE International Conference on Smart Data Services (SMDS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Smart Data Services (SMDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMDS49396.2020.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高性能数据工程无处不在
机器学习和深度学习领域取得的惊人进步是企业和研究界在大数据时代的一个亮点。现代应用程序需要的资源超出了单个节点提供的能力。然而,这只是整个数据处理环境面临的问题的一小部分,它还必须支持大量的数据工程,用于数据前后处理、通信和系统集成。数据分析工具的一个重要需求是能够轻松地与多种语言的现有框架集成,从而提高用户的生产力和效率。所有这些都需要一种高效且高度分布式的数据处理集成方法,然而当今许多流行的数据分析工具无法同时满足所有这些需求。在本文中,我们介绍了Cylon,一个开源的高性能分布式数据处理库,可以与现有的大数据和AI/ML框架无缝集成。它是在紧凑的数据结构之上使用灵活的c++核心开发的,并向c++、Java和Python提供语言绑定。我们将详细讨论Cylon的体系结构,并揭示如何将其作为库导入现有应用程序或作为独立框架运行。最初的实验表明,赛昂增强了Apache Spark和Dask等流行工具,在关键操作和更好的组件连接方面实现了重大性能改进。最后,我们展示了它的设计如何使赛昂以最小的开销跨平台使用,其中包括流行的人工智能工具,如PyTorch, Tensorflow和Jupyter笔记本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
S3QLRDF: Property Table Partitioning Scheme for Distributed SPARQL Querying of large-scale RDF data BC-Sketch: A Simple Reversible Sketch for Detecting Network Anomalies 2020 IEEE International Conference on Smart Data Services (SMDS) SMDS 2020 Scalable and Hybrid Ensemble-Based Causality Discovery Stargazer: A Deep Learning Approach for Estimating the Performance of Edge- Based Clustering Applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1