High Performance Data Engineering Everywhere

2020 IEEE International Conference on Smart Data Services (SMDS) Pub Date : 2020-07-19 DOI:10.1109/SMDS49396.2020.00022

Chathura Widanage, Niranda Perera, V. Abeykoon, Supun Kamburugamuve, Thejaka Amila Kanewala, Hasara Maithree, P. Wickramasinghe, A. Uyar, Gurhan Gunduz, G. Fox

{"title":"High Performance Data Engineering Everywhere","authors":"Chathura Widanage, Niranda Perera, V. Abeykoon, Supun Kamburugamuve, Thejaka Amila Kanewala, Hasara Maithree, P. Wickramasinghe, A. Uyar, Gurhan Gunduz, G. Fox","doi":"10.1109/SMDS49396.2020.00022","DOIUrl":null,"url":null,"abstract":"The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.","PeriodicalId":385149,"journal":{"name":"2020 IEEE International Conference on Smart Data Services (SMDS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Smart Data Services (SMDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMDS49396.2020.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高性能数据工程无处不在

机器学习和深度学习领域取得的惊人进步是企业和研究界在大数据时代的一个亮点。现代应用程序需要的资源超出了单个节点提供的能力。然而，这只是整个数据处理环境面临的问题的一小部分，它还必须支持大量的数据工程，用于数据前后处理、通信和系统集成。数据分析工具的一个重要需求是能够轻松地与多种语言的现有框架集成，从而提高用户的生产力和效率。所有这些都需要一种高效且高度分布式的数据处理集成方法，然而当今许多流行的数据分析工具无法同时满足所有这些需求。在本文中，我们介绍了Cylon，一个开源的高性能分布式数据处理库，可以与现有的大数据和AI/ML框架无缝集成。它是在紧凑的数据结构之上使用灵活的c++核心开发的，并向c++、Java和Python提供语言绑定。我们将详细讨论Cylon的体系结构，并揭示如何将其作为库导入现有应用程序或作为独立框架运行。最初的实验表明，赛昂增强了Apache Spark和Dask等流行工具，在关键操作和更好的组件连接方面实现了重大性能改进。最后，我们展示了它的设计如何使赛昂以最小的开销跨平台使用，其中包括流行的人工智能工具，如PyTorch, Tensorflow和Jupyter笔记本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE International Conference on Smart Data Services (SMDS)

自引率

0.00%

发文量