Scaling Irregular Applications through Data Aggregation and Software Multithreading

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.117

Alessandro Morari, Antonino Tumeo, D. Chavarría-Miranda, Oreste Villa, M. Valero

{"title":"Scaling Irregular Applications through Data Aggregation and Software Multithreading","authors":"Alessandro Morari, Antonino Tumeo, D. Chavarría-Miranda, Oreste Villa, M. Valero","doi":"10.1109/IPDPS.2014.117","DOIUrl":null,"url":null,"abstract":"Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"134 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过数据聚合和软件多线程扩展不规则应用

生物信息学、数据分析、语义数据库和知识发现等领域的新兴应用需要数十到数百tb的数据集。目前，只有分布式内存集群有足够的聚合空间来支持这种大小的数据集的内存处理。然而，除了大尺寸之外，这些新应用程序类使用的数据结构通常具有不可预测和细粒度访问的特征:即，它们呈现不规则的行为。相反，传统的商品集群利用基于缓存的处理器和针对局部性、规则计算和批量通信进行优化的高带宽网络。由于这些原因，不规则的应用程序在这些系统上效率低下，并且需要定制的、手工编码的优化来提供性能和大小的可伸缩性。轻量级软件多线程可以通过与计算重叠的网络通信来容忍数据访问延迟，而聚合可以通过合并细粒度的网络消息来减少开销和增加带宽利用率，这是可以加快商品集群上大规模不规则应用程序性能的关键技术。在本文中，我们描述了GMT(全局内存和线程)，一个运行时系统库，它将软件多线程和消息聚合与分区全局地址空间(PGAS)数据模型结合在一起，以实现多节点系统上不规则应用程序的更高性能和可扩展性。我们将介绍运行时的体系结构，解释它是如何围绕这两项关键技术进行设计的。我们表明，使用我们的运行时编写的不规则应用程序可以比使用不利用这些技术的其他编程模型编写的相应应用程序表现得更好，甚至要好上几个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量