Accelerating Spark Datasets by Inlining Deserialization

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI:10.1109/IPDPS.2017.111

Jan Wroblewski, K. Ishizaki, H. Inoue, Moriyoshi Ohara

引用次数: 1

Abstract

Apache Spark is a framework for distributed computing that supports the map-reduce programming model. The SQL module of Spark contains Datasets, i.e., distributed collections of records stored in a serialized low-level format in a manually managed chunk of memory. However, the functions users provide to the map-reduce computations expect Java objects. Datasets perform an additional deserialization step beforehand to support the user-provided function, which increases the overhead. We tackled this problem by replacing map functions with their counterparts that accepted the serialized data. This allowed us to skip the unnecessary part of deserialization and achieve faster data processing speeds.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过内联反序列化加速Spark数据集

Apache Spark是一个支持map-reduce编程模型的分布式计算框架。Spark的SQL模块包含数据集，即以序列化的低级格式存储在手动管理的内存块中的分布式记录集合。然而，用户提供给map-reduce计算的函数期望Java对象。数据集在支持用户提供的函数之前执行一个额外的反序列化步骤，这增加了开销。我们通过将map函数替换为接受序列化数据的对应函数来解决这个问题。这允许我们跳过不必要的反序列化部分，并实现更快的数据处理速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量