Scalable bootstrapping for python

Proceedings of the 22nd ACM international conference on Information & Knowledge Management Pub Date : 2013-10-27 DOI:10.1145/2505515.2505630

P. Birsinger, R. Xia, A. Fox

{"title":"Scalable bootstrapping for python","authors":"P. Birsinger, R. Xia, A. Fox","doi":"10.1145/2505515.2505630","DOIUrl":null,"url":null,"abstract":"High-level productivity languages such as Python, Matlab, and R are popular choices for scientists doing data analysis. However, for today's increasingly large datasets, applications written in these languages may run too slowly, if at all. In such cases, an experienced programmer must typically rewrite the application in a less-productive performant language such as C or C++, but this work is intricate, tedious, and often non-reusable. To bridge this gap between programmer productivity and performance, we extend an existing framework that uses just-in-time code generation and compilation. This framework uses the SEJITS methodology, (Selective Embedded Just-In-Time Specialization [11]), converting programs written in domain specific embedded languages (DSELs) to programs in languages suitable for high performance or parallel computation. We present a Python DSEL for a recently developed, scalable bootstrapping method; the DSEL executes efficiently in a distributed cluster. In previous work [18, Prasad et al. created a DSEL compiler for the same DSEL (with minor differences) to generate OpenMP or Cilk code. In this work, we create a new DSEL compiler which instead emits code to run on Spark [16], a distributed processing framework. Using two example applications of bootstrapping, we show that the resulting distributed code achieves near-perfect strong scaling from 4 to 32 eight-core computers (32 to 256 cores) on datasets up to hundreds of gigabytes in size. With our DSEL, a data scientist can write a single program in serial Python that can run \"toy\" problems in plain Python, non-toy problems fitting on a single computer in OpenMP or Cilk, and non-toy problems with large datasets on a multi-computer Spark installation.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505515.2505630","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

High-level productivity languages such as Python, Matlab, and R are popular choices for scientists doing data analysis. However, for today's increasingly large datasets, applications written in these languages may run too slowly, if at all. In such cases, an experienced programmer must typically rewrite the application in a less-productive performant language such as C or C++, but this work is intricate, tedious, and often non-reusable. To bridge this gap between programmer productivity and performance, we extend an existing framework that uses just-in-time code generation and compilation. This framework uses the SEJITS methodology, (Selective Embedded Just-In-Time Specialization [11]), converting programs written in domain specific embedded languages (DSELs) to programs in languages suitable for high performance or parallel computation. We present a Python DSEL for a recently developed, scalable bootstrapping method; the DSEL executes efficiently in a distributed cluster. In previous work [18, Prasad et al. created a DSEL compiler for the same DSEL (with minor differences) to generate OpenMP or Cilk code. In this work, we create a new DSEL compiler which instead emits code to run on Spark [16], a distributed processing framework. Using two example applications of bootstrapping, we show that the resulting distributed code achieves near-perfect strong scaling from 4 to 32 eight-core computers (32 to 256 cores) on datasets up to hundreds of gigabytes in size. With our DSEL, a data scientist can write a single program in serial Python that can run "toy" problems in plain Python, non-toy problems fitting on a single computer in OpenMP or Cilk, and non-toy problems with large datasets on a multi-computer Spark installation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

python的可伸缩引导

Python、Matlab和R等高级生产力语言是科学家进行数据分析的热门选择。然而，对于今天越来越大的数据集，用这些语言编写的应用程序可能运行得太慢，如果有的话。在这种情况下，经验丰富的程序员通常必须用效率较低的高性能语言(如C或c++)重写应用程序，但这项工作复杂、乏味，而且通常不可重用。为了弥合程序员生产力和性能之间的差距，我们扩展了一个使用即时代码生成和编译的现有框架。该框架使用SEJITS方法(选择性嵌入式即时专门化[11])，将用特定领域嵌入式语言(dsel)编写的程序转换为适合高性能或并行计算的语言程序。我们为最近开发的一种可扩展的引导方法提供了一个Python DSEL;DSEL在分布式集群中高效地执行。在之前的工作[18]中，Prasad等人为相同的DSEL(略有不同)创建了一个DSEL编译器来生成OpenMP或Cilk代码。在这项工作中，我们创建了一个新的DSEL编译器，它发出代码在Spark[16]上运行，Spark是一个分布式处理框架。通过使用引导的两个示例应用程序，我们展示了所得到的分布式代码在高达数百gb大小的数据集上实现了近乎完美的从4到32个八核计算机(32到256核)的强大扩展。使用我们的DSEL，数据科学家可以用串行Python编写一个程序，该程序可以在普通Python中运行“玩具”问题，在OpenMP或Cilk中适合单台计算机的非玩具问题，以及在多计算机Spark安装上运行大型数据集的非玩具问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

自引率

0.00%

发文量

期刊最新文献

Exploring XML data is as easy as using maps Mining-based compression approach of propositional formulae Flexible and dynamic compromises for effective recommendations Efficient parsing-based search over structured data Recommendation via user's personality and social contextual