PaPar: A Parallel Data Partitioning Framework for Big Data Applications

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI:10.1109/IPDPS.2017.119

Hao Wang, Jing Zhang, Da Zhang, S. Pumma, Wu-chun Feng

{"title":"PaPar: A Parallel Data Partitioning Framework for Big Data Applications","authors":"Hao Wang, Jing Zhang, Da Zhang, S. Pumma, Wu-chun Feng","doi":"10.1109/IPDPS.2017.119","DOIUrl":null,"url":null,"abstract":"Today, big data applications can generate largescale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions not only depends on input data size but also algorithms that will be applied on data. As a result, many research efforts have been undertaken to explore user-defined partitioning methods for different types of applications and algorithms. However, manually writing application-specific partitioning methods requires significant coding effort, and finding the optimal data partitioning strategy is particularly challenging even for developers that have mastered sufficient application knowledge. In this paper, we propose PaPar, a Parallel data Partitioning framework for big data applications, to simplify the implementations of data partitioning algorithms. PaPar provides a set of computational operators and distribution strategies for programmers to describe desired data partitioning methods. Taking an input data configuration file and a workflow configuration file as the input, PaPar can automatically generate the parallel partitioning codes by formalizing the user-defined workflow as a sequence of key-value operations and matrixvector multiplications, and efficiently mapping to the parallel implementations with MPI and MapReduce. We apply our approach on two applications: muBLAST, a MPI implementation of BLAST algorithms for biological sequence search; and PowerLyra, a computation and partitioning method for skewed graphs. The experimental results show that compared to the partitioning methods of applications, the codes generated by PaPar can produce the same data partitions with comparable or less partitioning time.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2017.119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Today, big data applications can generate largescale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions not only depends on input data size but also algorithms that will be applied on data. As a result, many research efforts have been undertaken to explore user-defined partitioning methods for different types of applications and algorithms. However, manually writing application-specific partitioning methods requires significant coding effort, and finding the optimal data partitioning strategy is particularly challenging even for developers that have mastered sufficient application knowledge. In this paper, we propose PaPar, a Parallel data Partitioning framework for big data applications, to simplify the implementations of data partitioning algorithms. PaPar provides a set of computational operators and distribution strategies for programmers to describe desired data partitioning methods. Taking an input data configuration file and a workflow configuration file as the input, PaPar can automatically generate the parallel partitioning codes by formalizing the user-defined workflow as a sequence of key-value operations and matrixvector multiplications, and efficiently mapping to the parallel implementations with MPI and MapReduce. We apply our approach on two applications: muBLAST, a MPI implementation of BLAST algorithms for biological sequence search; and PowerLyra, a computation and partitioning method for skewed graphs. The experimental results show that compared to the partitioning methods of applications, the codes generated by PaPar can produce the same data partitions with comparable or less partitioning time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向大数据应用的并行数据分区框架

今天，大数据应用程序可以以前所未有的速度生成大规模数据集;科学家们已经转向并行和分布式系统进行数据分析。尽管许多大数据处理系统提供了先进的数据分区和解决计算倾斜的机制，但由于不同分区的运行时间不仅取决于输入数据的大小，还取决于将应用于数据的算法，因此很难有效地实现抗倾斜机制。因此，已经进行了许多研究工作来探索针对不同类型的应用程序和算法的用户定义划分方法。然而，手动编写特定于应用程序的分区方法需要大量的编码工作，即使对于掌握了足够的应用程序知识的开发人员来说，找到最佳的数据分区策略也特别具有挑战性。本文提出了一种用于大数据应用的并行数据分区框架PaPar，以简化数据分区算法的实现。PaPar为程序员提供了一组计算运算符和分布策略来描述所需的数据划分方法。PaPar以输入数据配置文件和工作流配置文件为输入，通过将用户定义的工作流形式化为一系列键值操作和矩阵向量乘法，自动生成并行分区代码，并有效地映射到MPI和MapReduce的并行实现中。我们将我们的方法应用于两个应用程序:muBLAST，用于生物序列搜索的BLAST算法的MPI实现;和PowerLyra，一个计算和划分歪斜图的方法。实验结果表明，与应用程序的分区方法相比，PaPar生成的代码可以在相当或更少的分区时间内生成相同的数据分区。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量