首页 > 最新文献

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
Reducing Pagerank Communication via Propagation Blocking 通过传播阻塞减少网页排名通信
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.112
S. Beamer, K. Asanović, D. Patterson
Reducing communication is an important objective, as it can save energy or improve the performance of a communication-bound application. The graph algorithm PageRank computes the importance of vertices in a graph, and it serves as an important benchmark for graph algorithm performance. If the input graph to PageRank has poor locality, the execution will need to read many cache lines from memory, some of which may not be fully utilized. We present propagation blocking, an optimization to improve spatial locality, and we demonstrate its application to PageRank. In contrast to cache blocking which partitions the graph, we partition the data transfers between vertices (propagations). If the input graph has poor locality, our approach will reduce communication. Our approach reduces communication more than conventional cache blocking if the input graph is sufficiently sparse or if number of vertices is sufficiently large relative to the cache size. To evaluate our approach, we use both simple analytic models to gain insights and precise hardware performance counter measurements to compare implementations on a suite of 8 real-world and synthetic graphs. We demonstrate our parallel implementations substantially outperform prior work in execution time and communication volume. Although we present results for PageRank, propagation blocking could be generalized to SpMV (sparse matrix multiplying dense vector) or other graph programming models.
减少通信是一个重要的目标,因为它可以节省能源或提高通信绑定应用程序的性能。图算法PageRank计算图中顶点的重要性,是图算法性能的重要基准。如果PageRank的输入图具有较差的局部性,则执行将需要从内存中读取许多缓存行,其中一些可能没有得到充分利用。我们提出了一种改进空间局部性的传播阻塞优化方法,并演示了它在PageRank中的应用。与对图进行分区的缓存阻塞不同,我们对顶点之间的数据传输进行分区(传播)。如果输入图具有较差的局部性,我们的方法将减少通信。如果输入图足够稀疏,或者顶点数量相对于缓存大小足够大,我们的方法比传统的缓存阻塞更能减少通信。为了评估我们的方法,我们使用简单的分析模型来获得见解,并使用精确的硬件性能度量来比较8个真实世界和合成图上的实现。我们证明了我们的并行实现在执行时间和通信量方面大大优于先前的工作。虽然我们给出了PageRank的结果,但传播阻塞可以推广到SpMV(稀疏矩阵乘以密集向量)或其他图编程模型。
{"title":"Reducing Pagerank Communication via Propagation Blocking","authors":"S. Beamer, K. Asanović, D. Patterson","doi":"10.1109/IPDPS.2017.112","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.112","url":null,"abstract":"Reducing communication is an important objective, as it can save energy or improve the performance of a communication-bound application. The graph algorithm PageRank computes the importance of vertices in a graph, and it serves as an important benchmark for graph algorithm performance. If the input graph to PageRank has poor locality, the execution will need to read many cache lines from memory, some of which may not be fully utilized. We present propagation blocking, an optimization to improve spatial locality, and we demonstrate its application to PageRank. In contrast to cache blocking which partitions the graph, we partition the data transfers between vertices (propagations). If the input graph has poor locality, our approach will reduce communication. Our approach reduces communication more than conventional cache blocking if the input graph is sufficiently sparse or if number of vertices is sufficiently large relative to the cache size. To evaluate our approach, we use both simple analytic models to gain insights and precise hardware performance counter measurements to compare implementations on a suite of 8 real-world and synthetic graphs. We demonstrate our parallel implementations substantially outperform prior work in execution time and communication volume. Although we present results for PageRank, propagation blocking could be generalized to SpMV (sparse matrix multiplying dense vector) or other graph programming models.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125550010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code 阿波罗:用于快速动态调优输入依赖代码的可重用模型
D. Beckingsale, Olga Pearce, I. Laguna, T. Gamblin
Increasing architectural diversity makes performance portability extremely important for parallel simulation codes. Emerging on-node parallelization frameworks such as Kokkos and RAJA decouple the work done in kernels from the parallelization mechanism, allowing for a single source kernel to be tuned for different architectures at compile time. However, computational demands in production applications change at runtime, and performance depends both on the architecture and the input problem, and tuning a kernel for one set of inputs may not improve its performance on another. The statically optimized versions need to be chosen dynamically to obtain the best performance. Existing auto-tuning approaches can handle slowly evolving applications effectively, but are too slow to tune highly input-dependent kernels. We developed Apollo, an auto-tuning extension for RAJA that uses pre-trained, reusable models to tune input-dependent code at runtime. Apollo is designed for highly dynamic applications; it generates sufficiently low-overhead code to tune parameters each time a kernel runs, making fast decisions. We apply Apollo to two hydrodynamics benchmarks and to a production multi-physics code, and show that it can achieve speedups from 1.2x to 4.8x.
不断增加的体系结构多样性使得性能可移植性对并行仿真代码极其重要。新兴的节点上并行化框架(如Kokkos和RAJA)将内核中完成的工作与并行化机制解耦,允许在编译时针对不同的体系结构对单个源内核进行调优。但是,生产应用程序中的计算需求在运行时发生变化,性能取决于体系结构和输入问题,针对一组输入调优内核可能不会提高其在另一组输入上的性能。静态优化版本需要动态选择,以获得最佳性能。现有的自动调优方法可以有效地处理缓慢发展的应用程序,但是对于调优高度依赖输入的内核来说太慢了。我们开发了Apollo,这是RAJA的一个自动调优扩展,它使用预训练的、可重用的模型在运行时调优依赖于输入的代码。Apollo是为高度动态应用而设计的;每次内核运行时,它都会生成足够低开销的代码来调优参数,从而快速做出决策。我们将Apollo应用于两个流体动力学基准测试和一个生产的多物理场代码,并表明它可以实现从1.2倍到4.8倍的加速。
{"title":"Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code","authors":"D. Beckingsale, Olga Pearce, I. Laguna, T. Gamblin","doi":"10.1109/IPDPS.2017.38","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.38","url":null,"abstract":"Increasing architectural diversity makes performance portability extremely important for parallel simulation codes. Emerging on-node parallelization frameworks such as Kokkos and RAJA decouple the work done in kernels from the parallelization mechanism, allowing for a single source kernel to be tuned for different architectures at compile time. However, computational demands in production applications change at runtime, and performance depends both on the architecture and the input problem, and tuning a kernel for one set of inputs may not improve its performance on another. The statically optimized versions need to be chosen dynamically to obtain the best performance. Existing auto-tuning approaches can handle slowly evolving applications effectively, but are too slow to tune highly input-dependent kernels. We developed Apollo, an auto-tuning extension for RAJA that uses pre-trained, reusable models to tune input-dependent code at runtime. Apollo is designed for highly dynamic applications; it generates sufficiently low-overhead code to tune parameters each time a kernel runs, making fast decisions. We apply Apollo to two hydrodynamics benchmarks and to a production multi-physics code, and show that it can achieve speedups from 1.2x to 4.8x.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125844252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging 使用共享内存多核架构加速图形和机器学习工作负载,并辅助支持硬件内显式消息传递
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.116
H. Dogan, Farrukh Hijaz, Masab Ahmad, B. Kahne, Peter Wilson, O. Khan
Shared Memory stands out as a sine qua non for parallel programming of many commercial and emerging multicore processors. It optimizes patterns of communication that benefit common programming styles. As parallel programming is now mainstream, those common programming styles are challenged with emerging applications that communicate often and involve large amount of data. Such applications include graph analytics and machine learning, and this paper focuses on these domains. We retain the shared memory model and introduce a set of lightweight in-hardware explicit messaging instructions in the instruction set architecture (ISA). A set of auxiliary communication models are proposed that utilize explicit messages to accelerate synchronization primitives, and efficiently move computation towards data. The results on a 256-core simulated multicore demonstrate that the proposed communication models improve performance and dynamic energy by an average of 4x and 42% respectively over traditional shared memory.
共享内存作为许多商业和新兴多核处理器并行编程的必要条件而脱颖而出。它优化了有利于通用编程风格的通信模式。由于并行编程现在是主流,这些常见的编程风格受到了新兴应用程序的挑战,这些应用程序经常进行通信并且涉及大量数据。这些应用包括图分析和机器学习,本文主要关注这些领域。我们保留了共享内存模型,并在指令集体系结构(ISA)中引入了一组轻量级的硬件内显式消息传递指令。提出了一套辅助通信模型,利用显式消息来加速同步原语,并有效地将计算转移到数据。在256核模拟多核上的结果表明,所提出的通信模型比传统共享内存的性能和动态能量平均分别提高了4倍和42%。
{"title":"Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging","authors":"H. Dogan, Farrukh Hijaz, Masab Ahmad, B. Kahne, Peter Wilson, O. Khan","doi":"10.1109/IPDPS.2017.116","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.116","url":null,"abstract":"Shared Memory stands out as a sine qua non for parallel programming of many commercial and emerging multicore processors. It optimizes patterns of communication that benefit common programming styles. As parallel programming is now mainstream, those common programming styles are challenged with emerging applications that communicate often and involve large amount of data. Such applications include graph analytics and machine learning, and this paper focuses on these domains. We retain the shared memory model and introduce a set of lightweight in-hardware explicit messaging instructions in the instruction set architecture (ISA). A set of auxiliary communication models are proposed that utilize explicit messages to accelerate synchronization primitives, and efficiently move computation towards data. The results on a 256-core simulated multicore demonstrate that the proposed communication models improve performance and dynamic energy by an average of 4x and 42% respectively over traditional shared memory.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133880400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Towards Highly scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor 在Intel Knights Landing多核处理器上实现高度可扩展的从头算分子动力学(AIMD)模拟
M. Jacquelin, W. A. Jong, E. Bylaska
The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schrodinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, we focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange Multiplier and non-local pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multipliers kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.
从头算分子动力学(AIMD)方法允许科学家处理分子和凝聚相系统的动力学,同时保留基于第一性原理的相互作用描述。这种极其重要的方法具有巨大的计算需求,因为使用Kohn-Sham密度泛函理论(DFT)近似的电子薛定谔方程在每个时间步上都要求解。随着多核体系结构的出现,应用程序开发人员在每个计算节点中拥有大量的处理能力,这些处理能力只能通过大规模并行性来利用。像AIMD这样的计算密集型应用程序可以很好地利用这种处理能力。在本文中,我们着重于在NWChem中实现的平面波DFT方法中添加线程级并行性。通过对高细矩阵产品(拉格朗日乘法器和非局部伪势内核以及3D fft的核心)的精心优化,我们的OpenMP实现在最新的英特尔骑士登陆(KNL)处理器上提供了出色的强大缩放。我们通过构建平台的rooline模型来评估拉格朗日乘数核的效率,并验证我们的实现接近各种问题规模的rooline。最后,我们展示了64个水分子测试用例的完整AIMD模拟的强大缩放结果,该模拟可扩展到Knights Landing处理器的所有68核。
{"title":"Towards Highly scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor","authors":"M. Jacquelin, W. A. Jong, E. Bylaska","doi":"10.1109/IPDPS.2017.26","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.26","url":null,"abstract":"The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schrodinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, we focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange Multiplier and non-local pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multipliers kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131241173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Scalable Lock-Free Vector with Combining 具有组合的可伸缩无锁矢量
Ivan Walulya, P. Tsigas
Dynamic vectors are among the most commonly used data structures in programming. They provide constant time random access and resizable data storage. Additionally, they provide constant time insertion (pushback) and deletion (popback) at the end of the sequence. However, in a multithreaded system, concurrent pushback and popback operations attempt to update the same shared object, creating a synchronization bottleneck. In this paper, we present a lock-free vector design that efficiently addresses the synchronization bottlenecks by utilizing a combining technique on pushback operations. Typical combining techniques come with the price of blocking. Our design introduces combining without sacrificing lock-freedom. We evaluate the performance of our design on a dual socket NUMA Intel server. The results show that our design performs comparably at low loads, and out-performs prior concurrent blocking and non-blocking vector implementations at high contention, by as much as 2.7x.
动态向量是编程中最常用的数据结构之一。它们提供恒定时间随机访问和可调整大小的数据存储。此外,它们在序列的末尾提供恒定时间的插入(pushback)和删除(popback)。然而,在多线程系统中,并发的pushback和popback操作试图更新相同的共享对象,从而造成同步瓶颈。在本文中,我们提出了一种无锁矢量设计,通过利用推回操作的组合技术有效地解决了同步瓶颈。典型的组合技术是以阻挡为代价的。我们的设计在不牺牲锁自由度的情况下引入组合。我们在双插槽NUMA英特尔服务器上评估了我们的设计的性能。结果表明,我们的设计在低负载下的性能相当好,并且在高争用下比先前的并发阻塞和非阻塞矢量实现高出2.7倍。
{"title":"Scalable Lock-Free Vector with Combining","authors":"Ivan Walulya, P. Tsigas","doi":"10.1109/IPDPS.2017.73","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.73","url":null,"abstract":"Dynamic vectors are among the most commonly used data structures in programming. They provide constant time random access and resizable data storage. Additionally, they provide constant time insertion (pushback) and deletion (popback) at the end of the sequence. However, in a multithreaded system, concurrent pushback and popback operations attempt to update the same shared object, creating a synchronization bottleneck. In this paper, we present a lock-free vector design that efficiently addresses the synchronization bottlenecks by utilizing a combining technique on pushback operations. Typical combining techniques come with the price of blocking. Our design introduces combining without sacrificing lock-freedom. We evaluate the performance of our design on a dual socket NUMA Intel server. The results show that our design performs comparably at low loads, and out-performs prior concurrent blocking and non-blocking vector implementations at high contention, by as much as 2.7x.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134517680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
RCube: A Power Efficient and Highly Available Network for Data Centers RCube:高效、高可用的数据中心网络
Zhenhua Li, Yuanyuan Yang
Designing a cost-effective network for data centers that can deliver sufficient bandwidth and provide high availability has drawn tremendous attentions recently. In this paper, we propose a novel server-centric network structure called RCube, which is energy efficient and can deploy a redundancy scheme to improve the availability of data centers. Moreover, RCube shares many good properties with BCube, a well known server-centric network structure, yet its network size can be adjusted more conveniently. We also present a routing algorithm to find paths in RCube and an algorithm to build multiple parallel paths between any pair of source and destination servers. In addition, we theoretically analyze the power efficiency of the network and availability of RCube under server failure. Our comprehensive simulations demonstrate that RCube provides higher availability and flexibility to make trade-off among many factors, such as power consumption and aggregate throughput, than BCube, while delivering similar performance to BCube in many critical metrics, such as average path length, path distribution and graceful degradation, which makes RCube a very promising empirical structure for an enterprise data center network product.
为数据中心设计一个具有成本效益、能够提供足够带宽和高可用性的网络是近年来备受关注的问题。在本文中,我们提出了一种新的以服务器为中心的网络结构,称为RCube,它既节能又可以部署冗余方案来提高数据中心的可用性。此外,RCube与BCube(一种众所周知的以服务器为中心的网络结构)具有许多良好的特性,但它的网络大小可以更方便地调整。我们还提出了在RCube中查找路径的路由算法,以及在任意一对源服务器和目标服务器之间构建多条并行路径的算法。此外,我们还从理论上分析了服务器故障时网络的功率效率和RCube的可用性。我们的综合仿真表明,RCube提供了比BCube更高的可用性和灵活性,可以在功耗和总吞吐量等诸多因素之间进行权衡,同时在许多关键指标(如平均路径长度、路径分布和优雅退化)上提供与BCube相似的性能,这使得RCube成为一个非常有前景的企业数据中心网络产品的经验结构。
{"title":"RCube: A Power Efficient and Highly Available Network for Data Centers","authors":"Zhenhua Li, Yuanyuan Yang","doi":"10.1109/IPDPS.2017.50","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.50","url":null,"abstract":"Designing a cost-effective network for data centers that can deliver sufficient bandwidth and provide high availability has drawn tremendous attentions recently. In this paper, we propose a novel server-centric network structure called RCube, which is energy efficient and can deploy a redundancy scheme to improve the availability of data centers. Moreover, RCube shares many good properties with BCube, a well known server-centric network structure, yet its network size can be adjusted more conveniently. We also present a routing algorithm to find paths in RCube and an algorithm to build multiple parallel paths between any pair of source and destination servers. In addition, we theoretically analyze the power efficiency of the network and availability of RCube under server failure. Our comprehensive simulations demonstrate that RCube provides higher availability and flexibility to make trade-off among many factors, such as power consumption and aggregate throughput, than BCube, while delivering similar performance to BCube in many critical metrics, such as average path length, path distribution and graceful degradation, which makes RCube a very promising empirical structure for an enterprise data center network product.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131880980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
O(log N)-Time Complete Visibility for Asynchronous Robots with Lights 带灯异步机器人的O(log N)时间完全可见性
Gokarna Sharma, R. Vaidyanathan, J. Trahan, C. Busch, S. Rai
We consider the distributed setting of N autonomous mobile robots that operate in Look-Compute-Move (LCM) cycles and communicate with other robots using colored lights (the robots with lights model). We study the fundamental problem of repositioning N autonomous robots on a plane sothat each robot is visible to all others (the Complete Visibility problem) on this model; a robot cannot see another robot if a third robot is positioned between them on the straight line connecting them. There exists an O(1) time, O(1) color algorithm for this problem in the semi-synchronous setting. In this paper, we provide the first O(log N) time, O(1) color algorithm for this problem in the asynchronous setting. This is a significant improvement over an O(N)-time translation of the semi-synchronous algorithm to the asynchronous setting. The proposed algorithm is collision-free - robots do not share positions and their paths do not cross.
我们考虑了N个自主移动机器人的分布式设置,这些机器人以Look-Compute-Move (LCM)周期运行,并使用彩灯与其他机器人进行通信(带灯机器人模型)。我们研究了在平面上重新定位N个自主机器人的基本问题,使每个机器人对该模型上的所有机器人都是可见的(完全可见性问题);如果第三个机器人在连接它们的直线上位于它们之间,则一个机器人无法看到另一个机器人。对于该问题,在半同步设置下存在一个O(1)时间,O(1)颜色的算法。在本文中,我们提供了第一个O(log N)时间,O(1)颜色的异步设置算法。相对于半同步算法到异步设置的O(N)时间转换,这是一个显著的改进。提出的算法是无碰撞的——机器人不共享位置,它们的路径不交叉。
{"title":"O(log N)-Time Complete Visibility for Asynchronous Robots with Lights","authors":"Gokarna Sharma, R. Vaidyanathan, J. Trahan, C. Busch, S. Rai","doi":"10.1109/IPDPS.2017.51","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.51","url":null,"abstract":"We consider the distributed setting of N autonomous mobile robots that operate in Look-Compute-Move (LCM) cycles and communicate with other robots using colored lights (the robots with lights model). We study the fundamental problem of repositioning N autonomous robots on a plane sothat each robot is visible to all others (the Complete Visibility problem) on this model; a robot cannot see another robot if a third robot is positioned between them on the straight line connecting them. There exists an O(1) time, O(1) color algorithm for this problem in the semi-synchronous setting. In this paper, we provide the first O(log N) time, O(1) color algorithm for this problem in the asynchronous setting. This is a significant improvement over an O(N)-time translation of the semi-synchronous algorithm to the asynchronous setting. The proposed algorithm is collision-free - robots do not share positions and their paths do not cross.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129731829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Transparent Caching for RMA Systems RMA系统的透明缓存
S. D. Girolamo, Flavio Vella, T. Hoefler
The constantly increasing gap between communication and computation performance emphasizes the importance of communication-avoidance techniques. Caching is a well-known concept used to reduce accesses to slow local memories. In this work, we extend the caching idea to MPI-3 Remote Memory Access (RMA) operations. Here, caching can avoid inter-node communications and achieve similar benefits for irregular applications as communication-avoiding algorithms for structured applications. We propose CLaMPI, a caching library layered on top of MPI-3 RMA, to automatically optimize code with minimum user intervention. We demonstrate how cached RMA improves the performance of a Barnes Hut simulation and a Local Clustering Coefficient computation up to a factor of 1.8x and 5x, respectively. Due to the low overheads in the cache miss case and the potential benefits, we expect that our ideas around transparent RMA caching will soon be an integral part of many MPI libraries.
通信性能与计算性能之间不断扩大的差距凸显了通信回避技术的重要性。缓存是一个众所周知的概念,用于减少对慢速本地内存的访问。在这项工作中,我们将缓存思想扩展到MPI-3远程内存访问(RMA)操作中。在这里,缓存可以避免节点间通信,并为不规则应用程序获得与结构化应用程序避免通信算法类似的好处。我们提出了一个基于MPI-3 RMA的缓存库CLaMPI,它可以在最小的用户干预下自动优化代码。我们演示了缓存的RMA如何将Barnes Hut模拟和局部聚类系数计算的性能分别提高1.8倍和5倍。由于缓存丢失情况下的低开销和潜在的好处,我们希望我们关于透明RMA缓存的想法很快就会成为许多MPI库的一个组成部分。
{"title":"Transparent Caching for RMA Systems","authors":"S. D. Girolamo, Flavio Vella, T. Hoefler","doi":"10.1109/IPDPS.2017.92","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.92","url":null,"abstract":"The constantly increasing gap between communication and computation performance emphasizes the importance of communication-avoidance techniques. Caching is a well-known concept used to reduce accesses to slow local memories. In this work, we extend the caching idea to MPI-3 Remote Memory Access (RMA) operations. Here, caching can avoid inter-node communications and achieve similar benefits for irregular applications as communication-avoiding algorithms for structured applications. We propose CLaMPI, a caching library layered on top of MPI-3 RMA, to automatically optimize code with minimum user intervention. We demonstrate how cached RMA improves the performance of a Barnes Hut simulation and a Local Clustering Coefficient computation up to a factor of 1.8x and 5x, respectively. Due to the low overheads in the cache miss case and the potential benefits, we expect that our ideas around transparent RMA caching will soon be an integral part of many MPI libraries.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117098990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Leader Election in Asymmetric Labeled Unidirectional Rings 非对称标记单向环的Leader选举
K. Altisen, A. Datta, Stéphane Devismes, Anaïs Durand, L. Larmore
We study (deterministic) leader election in unidirectional rings of homonym processes that have no a priori knowledge on the number of processes. In this context, we show that there is no algorithm that solves process-terminating leader election for the class of asymmetric labeled rings. In particular, there is no process-terminating leader election algorithm in rings in which at least one label is unique. However, we show that process-terminating leader election is possible for the subclass of asymmetric rings, where multiplicity is bounded. We confirm this positive results by proposing two algorithms, which achieve the classical trade-off between time and space.
本文研究了同音过程单向环中对同音过程个数没有先验知识的(确定性)leader选举。在这种情况下,我们证明了不对称标记环类没有解决进程终止领导者选择的算法。特别是,在至少有一个标签是唯一的环中,不存在进程终止leader选举算法。然而,我们证明了在非对称环的子类中,进程终止的领导者选举是可能的,其中多重性是有界的。我们通过提出两种算法来证实这一积极的结果,这两种算法实现了时间和空间之间的经典权衡。
{"title":"Leader Election in Asymmetric Labeled Unidirectional Rings","authors":"K. Altisen, A. Datta, Stéphane Devismes, Anaïs Durand, L. Larmore","doi":"10.1109/IPDPS.2017.23","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.23","url":null,"abstract":"We study (deterministic) leader election in unidirectional rings of homonym processes that have no a priori knowledge on the number of processes. In this context, we show that there is no algorithm that solves process-terminating leader election for the class of asymmetric labeled rings. In particular, there is no process-terminating leader election algorithm in rings in which at least one label is unique. However, we show that process-terminating leader election is possible for the subclass of asymmetric rings, where multiplicity is bounded. We confirm this positive results by proposing two algorithms, which achieve the classical trade-off between time and space.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"331 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115876149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
An Adaptive Core-Specific Runtime for Energy Efficiency 能源效率的自适应核心特定运行时
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.114
Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, J. Prins
Energy efficiency in high performance computing (HPC) will be critical to limit operating costs and carbon footprints in future supercomputing centers. Energy efficiency of a computation can be improved by reducing time to completion without a substantial increase in power drawn or by reducing power with a little increase in time to completion. We present an Adaptive Core-specific Runtime (ACR) that dynamically adapts core frequencies to workload characteristics, and show examples of both reductions in power and improvement in the average performance. This improvement in energy efficiency is obtained without changes to the application. The adaptation policy embedded in the runtime uses existing core-specific power controls like software-controlled clock modulation and per-core Dynamic Voltage Frequency Scaling (DVFS) introduced in Intel Haswell. Experiments on six standard MPI benchmarks and a real world application show an overall 20% improvement in energy efficiency with less than 1% increase in execution time on 32 nodes (1024 cores) using per-core DVFS. An improvement in energy efficiency of up to 42% is obtained with the real world application ParaDis through a combination of speedup and power reduction. For one configuration, ParaDis achieves an average speedup of 11%, while the power is lowered by about 31%. The average improvement in the performance seen is a direct result of the reduction in run-to-run variation and running at turbo frequencies.
高性能计算(HPC)的能源效率对于限制未来超级计算中心的运营成本和碳足迹至关重要。可以通过减少完成时间而不大幅增加功耗或通过减少功耗而稍微增加完成时间来提高计算的能源效率。我们提出了一个特定于核心的自适应运行时(ACR),它可以根据工作负载特征动态地调整核心频率,并展示了功耗降低和平均性能提高的示例。这种能源效率的提高是在不改变应用程序的情况下获得的。在运行时中嵌入的自适应策略使用了现有的特定于核心的功率控制,如英特尔Haswell中引入的软件控制时钟调制和单核动态电压频率缩放(DVFS)。在六个标准MPI基准测试和一个实际应用程序上进行的实验表明,使用每核DVFS在32个节点(1024核)上,能源效率总体提高了20%,执行时间增加了不到1%。通过加速和降低功耗的结合,ParaDis在实际应用中获得了高达42%的能源效率提高。对于一种配置,ParaDis实现了11%的平均加速,而功耗降低了约31%。性能的平均改善是减少运行到运行的变化和在涡轮频率下运行的直接结果。
{"title":"An Adaptive Core-Specific Runtime for Energy Efficiency","authors":"Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, J. Prins","doi":"10.1109/IPDPS.2017.114","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.114","url":null,"abstract":"Energy efficiency in high performance computing (HPC) will be critical to limit operating costs and carbon footprints in future supercomputing centers. Energy efficiency of a computation can be improved by reducing time to completion without a substantial increase in power drawn or by reducing power with a little increase in time to completion. We present an Adaptive Core-specific Runtime (ACR) that dynamically adapts core frequencies to workload characteristics, and show examples of both reductions in power and improvement in the average performance. This improvement in energy efficiency is obtained without changes to the application. The adaptation policy embedded in the runtime uses existing core-specific power controls like software-controlled clock modulation and per-core Dynamic Voltage Frequency Scaling (DVFS) introduced in Intel Haswell. Experiments on six standard MPI benchmarks and a real world application show an overall 20% improvement in energy efficiency with less than 1% increase in execution time on 32 nodes (1024 cores) using per-core DVFS. An improvement in energy efficiency of up to 42% is obtained with the real world application ParaDis through a combination of speedup and power reduction. For one configuration, ParaDis achieves an average speedup of 11%, while the power is lowered by about 31%. The average improvement in the performance seen is a direct result of the reduction in run-to-run variation and running at turbo frequencies.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122108643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
期刊
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1