2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第8页

O(log N)-Time Complete Visibility for Asynchronous Robots with Lights 带灯异步机器人的O(log N)时间完全可见性

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.51

Gokarna Sharma, R. Vaidyanathan, J. Trahan, C. Busch, S. Rai

We consider the distributed setting of N autonomous mobile robots that operate in Look-Compute-Move (LCM) cycles and communicate with other robots using colored lights (the robots with lights model). We study the fundamental problem of repositioning N autonomous robots on a plane sothat each robot is visible to all others (the Complete Visibility problem) on this model; a robot cannot see another robot if a third robot is positioned between them on the straight line connecting them. There exists an O(1) time, O(1) color algorithm for this problem in the semi-synchronous setting. In this paper, we provide the first O(log N) time, O(1) color algorithm for this problem in the asynchronous setting. This is a significant improvement over an O(N)-time translation of the semi-synchronous algorithm to the asynchronous setting. The proposed algorithm is collision-free - robots do not share positions and their paths do not cross.

我们考虑了N个自主移动机器人的分布式设置，这些机器人以Look-Compute-Move (LCM)周期运行，并使用彩灯与其他机器人进行通信(带灯机器人模型)。我们研究了在平面上重新定位N个自主机器人的基本问题，使每个机器人对该模型上的所有机器人都是可见的(完全可见性问题);如果第三个机器人在连接它们的直线上位于它们之间，则一个机器人无法看到另一个机器人。对于该问题，在半同步设置下存在一个O(1)时间，O(1)颜色的算法。在本文中，我们提供了第一个O(log N)时间，O(1)颜色的异步设置算法。相对于半同步算法到异步设置的O(N)时间转换，这是一个显著的改进。提出的算法是无碰撞的——机器人不共享位置，它们的路径不交叉。

引用次数: 25

Reducing Pagerank Communication via Propagation Blocking 通过传播阻塞减少网页排名通信

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.112

S. Beamer, K. Asanović, D. Patterson

Reducing communication is an important objective, as it can save energy or improve the performance of a communication-bound application. The graph algorithm PageRank computes the importance of vertices in a graph, and it serves as an important benchmark for graph algorithm performance. If the input graph to PageRank has poor locality, the execution will need to read many cache lines from memory, some of which may not be fully utilized. We present propagation blocking, an optimization to improve spatial locality, and we demonstrate its application to PageRank. In contrast to cache blocking which partitions the graph, we partition the data transfers between vertices (propagations). If the input graph has poor locality, our approach will reduce communication. Our approach reduces communication more than conventional cache blocking if the input graph is sufficiently sparse or if number of vertices is sufficiently large relative to the cache size. To evaluate our approach, we use both simple analytic models to gain insights and precise hardware performance counter measurements to compare implementations on a suite of 8 real-world and synthetic graphs. We demonstrate our parallel implementations substantially outperform prior work in execution time and communication volume. Although we present results for PageRank, propagation blocking could be generalized to SpMV (sparse matrix multiplying dense vector) or other graph programming models.

减少通信是一个重要的目标，因为它可以节省能源或提高通信绑定应用程序的性能。图算法PageRank计算图中顶点的重要性，是图算法性能的重要基准。如果PageRank的输入图具有较差的局部性，则执行将需要从内存中读取许多缓存行，其中一些可能没有得到充分利用。我们提出了一种改进空间局部性的传播阻塞优化方法，并演示了它在PageRank中的应用。与对图进行分区的缓存阻塞不同，我们对顶点之间的数据传输进行分区(传播)。如果输入图具有较差的局部性，我们的方法将减少通信。如果输入图足够稀疏，或者顶点数量相对于缓存大小足够大，我们的方法比传统的缓存阻塞更能减少通信。为了评估我们的方法，我们使用简单的分析模型来获得见解，并使用精确的硬件性能度量来比较8个真实世界和合成图上的实现。我们证明了我们的并行实现在执行时间和通信量方面大大优于先前的工作。虽然我们给出了PageRank的结果，但传播阻塞可以推广到SpMV(稀疏矩阵乘以密集向量)或其他图编程模型。

{"title":"Reducing Pagerank Communication via Propagation Blocking","authors":"S. Beamer, K. Asanović, D. Patterson","doi":"10.1109/IPDPS.2017.112","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.112","url":null,"abstract":"Reducing communication is an important objective, as it can save energy or improve the performance of a communication-bound application. The graph algorithm PageRank computes the importance of vertices in a graph, and it serves as an important benchmark for graph algorithm performance. If the input graph to PageRank has poor locality, the execution will need to read many cache lines from memory, some of which may not be fully utilized. We present propagation blocking, an optimization to improve spatial locality, and we demonstrate its application to PageRank. In contrast to cache blocking which partitions the graph, we partition the data transfers between vertices (propagations). If the input graph has poor locality, our approach will reduce communication. Our approach reduces communication more than conventional cache blocking if the input graph is sufficiently sparse or if number of vertices is sufficiently large relative to the cache size. To evaluate our approach, we use both simple analytic models to gain insights and precise hardware performance counter measurements to compare implementations on a suite of 8 real-world and synthetic graphs. We demonstrate our parallel implementations substantially outperform prior work in execution time and communication volume. Although we present results for PageRank, propagation blocking could be generalized to SpMV (sparse matrix multiplying dense vector) or other graph programming models.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125550010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Monitoring Properties of Large, Distributed, Dynamic Graphs 大型、分布式、动态图的监控属性

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.123

Gal Yehuda, D. Keren, Islam Akaria

The following is a very common question in numerous theoretical and application-related domains: given a graph G, does it satisfy some given property? For example, is G connected? Is its diameter smaller than a given threshold? Is its average degree larger than a certain threshold? Traditionally, algorithms to quickly answer such questions were developed for static and centralized graphs (i.e. G is stored in a central server and the list of its vertices and edges is static and quickly accessible). Later, as dictated by practical considerations, a great deal of attention was given to on-line algorithms for dynamic graphs (where vertices and edges can be added and deleted); the focus of research was to quickly decide whether the new graph still satisfies the given property. Today, a more difficult version of this problem, referred to as the distributed monitoring problem, is becoming increasingly important: large graphs are not only dynamic, but also distributed, that is, G is partitioned between a few servers, none of which "sees" G in its entirety. The question is how to define local conditions, such that as long as they hold on the local graphs, it is guaranteed that the desired property holds for the global G. Such local conditions are crucial for avoiding a huge communication overhead. While defining local conditions for linear properties (e.g. average degree) is relatively easy, they are considerably more difficult to derive for non-linear functions over graphs. We propose a solution and a general definition of solution optimality, and demonstrate how to apply it to two important graph properties – the spectral gap and the number of triangles. We also define an absolute lower bound on the communication overhead for distributed monitoring, and compare our algorithm to it, with excellent results. Last but not least, performance improves as the graph becomes larger and denser – that is, when distributing it is more important.

以下是一个在许多理论和应用相关领域中非常常见的问题:给定一个图G，它是否满足某些给定的性质?例如，G是否连通?它的直径是否小于给定的阈值?它的平均度数是否大于某个阈值?传统上，快速回答这些问题的算法是为静态和集中式图开发的(即G存储在中央服务器中，其顶点和边的列表是静态的，可以快速访问)。后来，由于实际考虑的需要，大量的注意力被给予了动态图的在线算法(其中顶点和边可以添加和删除);研究的重点是快速判断新图是否仍然满足给定的性质。今天，这个问题的一个更困难的版本，被称为分布式监控问题，正变得越来越重要:大型图不仅是动态的，而且是分布式的，也就是说，G被划分在几个服务器之间，没有一个服务器可以完整地“看到”G。问题是如何定义局部条件，这样只要它们在局部图上成立，就可以保证所需的属性在全局g上成立。这样的局部条件对于避免巨大的通信开销至关重要。虽然定义线性性质的局部条件(例如平均度)相对容易，但对于图上的非线性函数来说，它们要推导出来要困难得多。我们提出了一个解和解最优性的一般定义，并演示了如何将其应用于两个重要的图属性-谱间隙和三角形的数量。我们还定义了分布式监控的通信开销的绝对下界，并将我们的算法与之进行比较，结果非常好。最后但并非最不重要的一点是，性能会随着图变得更大更密集而提高——也就是说，当分布更重要时。

{"title":"Monitoring Properties of Large, Distributed, Dynamic Graphs","authors":"Gal Yehuda, D. Keren, Islam Akaria","doi":"10.1109/IPDPS.2017.123","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.123","url":null,"abstract":"The following is a very common question in numerous theoretical and application-related domains: given a graph G, does it satisfy some given property? For example, is G connected? Is its diameter smaller than a given threshold? Is its average degree larger than a certain threshold? Traditionally, algorithms to quickly answer such questions were developed for static and centralized graphs (i.e. G is stored in a central server and the list of its vertices and edges is static and quickly accessible). Later, as dictated by practical considerations, a great deal of attention was given to on-line algorithms for dynamic graphs (where vertices and edges can be added and deleted); the focus of research was to quickly decide whether the new graph still satisfies the given property. Today, a more difficult version of this problem, referred to as the distributed monitoring problem, is becoming increasingly important: large graphs are not only dynamic, but also distributed, that is, G is partitioned between a few servers, none of which \"sees\" G in its entirety. The question is how to define local conditions, such that as long as they hold on the local graphs, it is guaranteed that the desired property holds for the global G. Such local conditions are crucial for avoiding a huge communication overhead. While defining local conditions for linear properties (e.g. average degree) is relatively easy, they are considerably more difficult to derive for non-linear functions over graphs. We propose a solution and a general definition of solution optimality, and demonstrate how to apply it to two important graph properties – the spectral gap and the number of triangles. We also define an absolute lower bound on the communication overhead for distributed monitoring, and compare our algorithm to it, with excellent results. Last but not least, performance improves as the graph becomes larger and denser – that is, when distributing it is more important.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126540236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Scalable Lock-Free Vector with Combining 具有组合的可伸缩无锁矢量

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.73

Ivan Walulya, P. Tsigas

Dynamic vectors are among the most commonly used data structures in programming. They provide constant time random access and resizable data storage. Additionally, they provide constant time insertion (pushback) and deletion (popback) at the end of the sequence. However, in a multithreaded system, concurrent pushback and popback operations attempt to update the same shared object, creating a synchronization bottleneck. In this paper, we present a lock-free vector design that efficiently addresses the synchronization bottlenecks by utilizing a combining technique on pushback operations. Typical combining techniques come with the price of blocking. Our design introduces combining without sacrificing lock-freedom. We evaluate the performance of our design on a dual socket NUMA Intel server. The results show that our design performs comparably at low loads, and out-performs prior concurrent blocking and non-blocking vector implementations at high contention, by as much as 2.7x.

动态向量是编程中最常用的数据结构之一。它们提供恒定时间随机访问和可调整大小的数据存储。此外，它们在序列的末尾提供恒定时间的插入(pushback)和删除(popback)。然而，在多线程系统中，并发的pushback和popback操作试图更新相同的共享对象，从而造成同步瓶颈。在本文中，我们提出了一种无锁矢量设计，通过利用推回操作的组合技术有效地解决了同步瓶颈。典型的组合技术是以阻挡为代价的。我们的设计在不牺牲锁自由度的情况下引入组合。我们在双插槽NUMA英特尔服务器上评估了我们的设计的性能。结果表明，我们的设计在低负载下的性能相当好，并且在高争用下比先前的并发阻塞和非阻塞矢量实现高出2.7倍。

引用次数: 6

Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging 使用共享内存多核架构加速图形和机器学习工作负载，并辅助支持硬件内显式消息传递

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.116

H. Dogan, Farrukh Hijaz, Masab Ahmad, B. Kahne, Peter Wilson, O. Khan

Shared Memory stands out as a sine qua non for parallel programming of many commercial and emerging multicore processors. It optimizes patterns of communication that benefit common programming styles. As parallel programming is now mainstream, those common programming styles are challenged with emerging applications that communicate often and involve large amount of data. Such applications include graph analytics and machine learning, and this paper focuses on these domains. We retain the shared memory model and introduce a set of lightweight in-hardware explicit messaging instructions in the instruction set architecture (ISA). A set of auxiliary communication models are proposed that utilize explicit messages to accelerate synchronization primitives, and efficiently move computation towards data. The results on a 256-core simulated multicore demonstrate that the proposed communication models improve performance and dynamic energy by an average of 4x and 42% respectively over traditional shared memory.

共享内存作为许多商业和新兴多核处理器并行编程的必要条件而脱颖而出。它优化了有利于通用编程风格的通信模式。由于并行编程现在是主流，这些常见的编程风格受到了新兴应用程序的挑战，这些应用程序经常进行通信并且涉及大量数据。这些应用包括图分析和机器学习，本文主要关注这些领域。我们保留了共享内存模型，并在指令集体系结构(ISA)中引入了一组轻量级的硬件内显式消息传递指令。提出了一套辅助通信模型，利用显式消息来加速同步原语，并有效地将计算转移到数据。在256核模拟多核上的结果表明，所提出的通信模型比传统共享内存的性能和动态能量平均分别提高了4倍和42%。

{"title":"Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging","authors":"H. Dogan, Farrukh Hijaz, Masab Ahmad, B. Kahne, Peter Wilson, O. Khan","doi":"10.1109/IPDPS.2017.116","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.116","url":null,"abstract":"Shared Memory stands out as a sine qua non for parallel programming of many commercial and emerging multicore processors. It optimizes patterns of communication that benefit common programming styles. As parallel programming is now mainstream, those common programming styles are challenged with emerging applications that communicate often and involve large amount of data. Such applications include graph analytics and machine learning, and this paper focuses on these domains. We retain the shared memory model and introduce a set of lightweight in-hardware explicit messaging instructions in the instruction set architecture (ISA). A set of auxiliary communication models are proposed that utilize explicit messages to accelerate synchronization primitives, and efficiently move computation towards data. The results on a 256-core simulated multicore demonstrate that the proposed communication models improve performance and dynamic energy by an average of 4x and 42% respectively over traditional shared memory.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133880400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

RCube: A Power Efficient and Highly Available Network for Data Centers RCube:高效、高可用的数据中心网络

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.50

Zhenhua Li, Yuanyuan Yang

Designing a cost-effective network for data centers that can deliver sufficient bandwidth and provide high availability has drawn tremendous attentions recently. In this paper, we propose a novel server-centric network structure called RCube, which is energy efficient and can deploy a redundancy scheme to improve the availability of data centers. Moreover, RCube shares many good properties with BCube, a well known server-centric network structure, yet its network size can be adjusted more conveniently. We also present a routing algorithm to find paths in RCube and an algorithm to build multiple parallel paths between any pair of source and destination servers. In addition, we theoretically analyze the power efficiency of the network and availability of RCube under server failure. Our comprehensive simulations demonstrate that RCube provides higher availability and flexibility to make trade-off among many factors, such as power consumption and aggregate throughput, than BCube, while delivering similar performance to BCube in many critical metrics, such as average path length, path distribution and graceful degradation, which makes RCube a very promising empirical structure for an enterprise data center network product.

为数据中心设计一个具有成本效益、能够提供足够带宽和高可用性的网络是近年来备受关注的问题。在本文中，我们提出了一种新的以服务器为中心的网络结构，称为RCube，它既节能又可以部署冗余方案来提高数据中心的可用性。此外，RCube与BCube(一种众所周知的以服务器为中心的网络结构)具有许多良好的特性，但它的网络大小可以更方便地调整。我们还提出了在RCube中查找路径的路由算法，以及在任意一对源服务器和目标服务器之间构建多条并行路径的算法。此外，我们还从理论上分析了服务器故障时网络的功率效率和RCube的可用性。我们的综合仿真表明，RCube提供了比BCube更高的可用性和灵活性，可以在功耗和总吞吐量等诸多因素之间进行权衡，同时在许多关键指标(如平均路径长度、路径分布和优雅退化)上提供与BCube相似的性能，这使得RCube成为一个非常有前景的企业数据中心网络产品的经验结构。

{"title":"RCube: A Power Efficient and Highly Available Network for Data Centers","authors":"Zhenhua Li, Yuanyuan Yang","doi":"10.1109/IPDPS.2017.50","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.50","url":null,"abstract":"Designing a cost-effective network for data centers that can deliver sufficient bandwidth and provide high availability has drawn tremendous attentions recently. In this paper, we propose a novel server-centric network structure called RCube, which is energy efficient and can deploy a redundancy scheme to improve the availability of data centers. Moreover, RCube shares many good properties with BCube, a well known server-centric network structure, yet its network size can be adjusted more conveniently. We also present a routing algorithm to find paths in RCube and an algorithm to build multiple parallel paths between any pair of source and destination servers. In addition, we theoretically analyze the power efficiency of the network and availability of RCube under server failure. Our comprehensive simulations demonstrate that RCube provides higher availability and flexibility to make trade-off among many factors, such as power consumption and aggregate throughput, than BCube, while delivering similar performance to BCube in many critical metrics, such as average path length, path distribution and graceful degradation, which makes RCube a very promising empirical structure for an enterprise data center network product.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131880980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Towards Highly scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor 在Intel Knights Landing多核处理器上实现高度可扩展的从头算分子动力学(AIMD)模拟

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.26

M. Jacquelin, W. A. Jong, E. Bylaska

The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schrodinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, we focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange Multiplier and non-local pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multipliers kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.

从头算分子动力学(AIMD)方法允许科学家处理分子和凝聚相系统的动力学，同时保留基于第一性原理的相互作用描述。这种极其重要的方法具有巨大的计算需求，因为使用Kohn-Sham密度泛函理论(DFT)近似的电子薛定谔方程在每个时间步上都要求解。随着多核体系结构的出现，应用程序开发人员在每个计算节点中拥有大量的处理能力，这些处理能力只能通过大规模并行性来利用。像AIMD这样的计算密集型应用程序可以很好地利用这种处理能力。在本文中，我们着重于在NWChem中实现的平面波DFT方法中添加线程级并行性。通过对高细矩阵产品(拉格朗日乘法器和非局部伪势内核以及3D fft的核心)的精心优化，我们的OpenMP实现在最新的英特尔骑士登陆(KNL)处理器上提供了出色的强大缩放。我们通过构建平台的rooline模型来评估拉格朗日乘数核的效率，并验证我们的实现接近各种问题规模的rooline。最后，我们展示了64个水分子测试用例的完整AIMD模拟的强大缩放结果，该模拟可扩展到Knights Landing处理器的所有68核。

{"title":"Towards Highly scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor","authors":"M. Jacquelin, W. A. Jong, E. Bylaska","doi":"10.1109/IPDPS.2017.26","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.26","url":null,"abstract":"The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schrodinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, we focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange Multiplier and non-local pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multipliers kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131241173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

DEFT-Cache: A Cost-Effective and Highly Reliable SSD Cache for RAID Storage DEFT-Cache:一种高性价比、高可靠性的RAID存储SSD缓存

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.54

Ji-guang Wan, Wei Wu, Ling Zhan, Q. Yang, Xiaoyang Qu, C. Xie

This paper proposes a new SSD cache architecture, DEFT-cache, Delayed Erasing and Fast Taping, that maximizes I/O performance and reliability of RAID storage. First of all, DEFT-Cache exploits the inherent physical properties of flash memory SSD by making use of old data that have been overwritten but still in existence in SSD to minimize small write penalty of RAID5/6. As data pages being overwritten in SSD, old data pages are invalidated and become candidates for erasure and garbage collections. Our idea is to selectively delay the erasure of the pages and let these otherwise useless old data in SSD contribute to I/O performance for parity computations upon write I/Os. Secondly, DEFT-Cache provides inexpensive redundancy to the SSD cache by having one physical SSD and one virtual SSD as a mirror cache. The virtual SSD is implemented on HDD but using log-structured data layout, i.e. write data are quickly logged to HDD using sequential write. The dual and redundant caches provide a cost-effective and highly reliable write-back SSD cache. We have implemented DEFT-Cache on Linux system. Extensive experiments have been carried out to evaluate the potential benefits of our new techniques. Experimental results on SPC and Microsoft traces have shown that DEFT-Cache improves I/O performance by 26.81% to 56.26% in terms of average user response time. The virtual SSD mirror cache can absorb write I/Os as fast as physical SSD providing the same reliability as two physical SSD caches without noticeable performance loss.

本文提出了一种新的SSD缓存架构，即DEFT-cache, Delayed Erasing and Fast tape，以最大限度地提高RAID存储的I/O性能和可靠性。首先，DEFT-Cache利用闪存SSD固有的物理特性，利用SSD中已经被覆盖但仍然存在的旧数据来最小化RAID5/6的小写损失。当数据页在SSD中被覆盖时，旧的数据页将失效，并成为擦除和垃圾收集的候选者。我们的想法是有选择地延迟页面的擦除，并让SSD中这些无用的旧数据在写I/O时为奇偶计算贡献I/O性能。其次，DEFT-Cache通过使用一个物理SSD和一个虚拟SSD作为镜像缓存，为SSD缓存提供廉价的冗余。虚拟SSD在HDD上实现，但使用日志结构的数据布局，即写入数据使用顺序写入快速记录到HDD。双冗余缓存提供了高性价比和高可靠性的回写SSD缓存。我们已经在Linux系统上实现了DEFT-Cache。为了评估我们的新技术的潜在效益，已经进行了大量的实验。SPC和Microsoft跟踪的实验结果表明，DEFT-Cache在平均用户响应时间方面提高了26.81%至56.26%的I/O性能。虚拟SSD镜像缓存可以像物理SSD一样快速地吸收写I/ o，提供与两个物理SSD缓存相同的可靠性，而不会出现明显的性能损失。

{"title":"DEFT-Cache: A Cost-Effective and Highly Reliable SSD Cache for RAID Storage","authors":"Ji-guang Wan, Wei Wu, Ling Zhan, Q. Yang, Xiaoyang Qu, C. Xie","doi":"10.1109/IPDPS.2017.54","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.54","url":null,"abstract":"This paper proposes a new SSD cache architecture, DEFT-cache, Delayed Erasing and Fast Taping, that maximizes I/O performance and reliability of RAID storage. First of all, DEFT-Cache exploits the inherent physical properties of flash memory SSD by making use of old data that have been overwritten but still in existence in SSD to minimize small write penalty of RAID5/6. As data pages being overwritten in SSD, old data pages are invalidated and become candidates for erasure and garbage collections. Our idea is to selectively delay the erasure of the pages and let these otherwise useless old data in SSD contribute to I/O performance for parity computations upon write I/Os. Secondly, DEFT-Cache provides inexpensive redundancy to the SSD cache by having one physical SSD and one virtual SSD as a mirror cache. The virtual SSD is implemented on HDD but using log-structured data layout, i.e. write data are quickly logged to HDD using sequential write. The dual and redundant caches provide a cost-effective and highly reliable write-back SSD cache. We have implemented DEFT-Cache on Linux system. Extensive experiments have been carried out to evaluate the potential benefits of our new techniques. Experimental results on SPC and Microsoft traces have shown that DEFT-Cache improves I/O performance by 26.81% to 56.26% in terms of average user response time. The virtual SSD mirror cache can absorb write I/Os as fast as physical SSD providing the same reliability as two physical SSD caches without noticeable performance loss.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126625551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem 后缀树的并行构造与全最近邻小值问题

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.62

P. Flick, S. Aluru

A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.

后缀树是一种基本的、通用的字符串数据结构，经常用于重要的应用领域，如文本处理、信息检索和计算生物学。从序列上看，后缀树的构建需要线性时间，并且只有PRAM模型才存在最优并行算法。最近的工作主要针对低核数共享内存实现，但实现了次优复杂度，而先前的分布式内存并行算法具有二次最坏复杂度。通过求解全最近邻最小值(ANSV)问题，可以从后缀数组和最长公共前缀(LCP)数组构建后缀树。本文提出了一种广义的ANSV问题，并提出了一种分布式内存并行算法，在O(n/p +p)时间内求解该问题。我们的算法最大限度地减少了总体和每个节点的通信量。在此基础上，我们提出了一种用于构建后缀树的分布式表示的并行算法，与以前的分布式内存算法相比，它既具有优越的理论复杂性，又具有更好的实际性能。我们演示了在1024个Intel Xeon内核上构建人类基因组的后缀树和LCP阵列，用时不到2秒。

{"title":"Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem","authors":"P. Flick, S. Aluru","doi":"10.1109/IPDPS.2017.62","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.62","url":null,"abstract":"A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116226446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors 大规模并行SIMT处理器上高性能消息传递的松弛

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.94

Benjamin Klenk, H. Fröning, H. Eberle, Larry R. Dennison

Accelerators, such as GPUs, have proven to be highly successful in reducing execution time and power consumption of compute-intensive applications. Even though they are already used pervasively, they are typically supervised by general-purpose CPUs, which results in frequent control flow switches and data transfers as CPUs are handling all communication tasks. However, we observe that accelerators are recently being augmented with peer-to-peer communication capabilities that allow for autonomous traffic sourcing and sinking. While appropriate hardware support is becoming available, it seems that the right communication semantics are yet to be identified. Maintaining the semantics of existing communication models, such as the Message Passing Interface (MPI), seems problematic as they have been designed for the CPU’s execution model, which inherently differs from such specialized processors. In this paper, we analyze the compatibility of traditional message passing with massively parallel Single Instruction Multiple Thread (SIMT) architectures, as represented by GPUs, and focus on the message matching problem. We begin with a fully MPI-compliant set of guarantees, including tag and source wildcards and message ordering. Based on an analysis of exascale proxy applications, we start relaxing these guarantees to adapt message passing to the GPU’s execution model. We present suitable algorithms for message matching on GPUs that can yield matching rates of 60M and 500M matches/s, depending on the constraints that are being relaxed. We discuss our experiments and create an understanding of the mismatch of current message passing protocols and the architecture and execution model of SIMT processors.

加速器(如gpu)已被证明在减少计算密集型应用程序的执行时间和功耗方面非常成功。尽管它们已经被广泛使用，但它们通常由通用cpu监督，这导致cpu处理所有通信任务时频繁的控制流切换和数据传输。然而，我们观察到加速器最近被增强了点对点通信功能，允许自主流量来源和下沉。虽然适当的硬件支持正在变得可用，但似乎还没有确定正确的通信语义。维护现有通信模型(如消息传递接口(Message Passing Interface, MPI))的语义似乎存在问题，因为它们是为CPU的执行模型设计的，而CPU的执行模型本质上不同于此类专用处理器。本文分析了传统消息传递与以gpu为代表的大规模并行单指令多线程(SIMT)架构的兼容性，重点研究了消息匹配问题。我们从一组完全符合mpi的保证开始，包括标记和源通配符以及消息排序。基于对exascale代理应用程序的分析，我们开始放松这些保证，以使消息传递适应GPU的执行模型。我们提出了适合gpu上消息匹配的算法，可以产生60M和500M匹配/s的匹配速率，具体取决于正在放松的约束。我们讨论了我们的实验，并创建了对当前消息传递协议与SIMT处理器的体系结构和执行模型不匹配的理解。

{"title":"Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors","authors":"Benjamin Klenk, H. Fröning, H. Eberle, Larry R. Dennison","doi":"10.1109/IPDPS.2017.94","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.94","url":null,"abstract":"Accelerators, such as GPUs, have proven to be highly successful in reducing execution time and power consumption of compute-intensive applications. Even though they are already used pervasively, they are typically supervised by general-purpose CPUs, which results in frequent control flow switches and data transfers as CPUs are handling all communication tasks. However, we observe that accelerators are recently being augmented with peer-to-peer communication capabilities that allow for autonomous traffic sourcing and sinking. While appropriate hardware support is becoming available, it seems that the right communication semantics are yet to be identified. Maintaining the semantics of existing communication models, such as the Message Passing Interface (MPI), seems problematic as they have been designed for the CPU’s execution model, which inherently differs from such specialized processors. In this paper, we analyze the compatibility of traditional message passing with massively parallel Single Instruction Multiple Thread (SIMT) architectures, as represented by GPUs, and focus on the message matching problem. We begin with a fully MPI-compliant set of guarantees, including tag and source wildcards and message ordering. Based on an analysis of exascale proxy applications, we start relaxing these guarantees to adapt message passing to the GPU’s execution model. We present suitable algorithms for message matching on GPUs that can yield matching rates of 60M and 500M matches/s, depending on the constraints that are being relaxed. We discuss our experiments and create an understanding of the mismatch of current message passing protocols and the architecture and execution model of SIMT processors.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125567458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21