2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第8页

Message from the Program Chair and Vice Chairs 来自项目主席和副主席的信息

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/PERCOM.2005.25

Christian Becker, A. Dey, F. Lau, Gergely Záruba Vice-Chairs

We are pleased to announce an excellent technical program for the 6th International Conference on Pervasive Computing and Communications. The program covers a broad cross section of topics in pervasive computing and communications. This year, 160 papers were submitted for consideration to the program committee. As a result, the selection process was highly competitive, and the result is a program of high-quality papers.

我们很高兴地宣布第六届普适计算与通信国际会议的优秀技术计划。该计划涵盖了普适计算和通信领域的广泛主题。今年，160篇论文被提交给项目委员会审议。因此，选拔过程竞争激烈，结果是一个高质量的论文计划。

引用次数: 0

Network Size Estimation in Small-World Networks Under Byzantine Faults 拜占庭故障下小世界网络的网络大小估计

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00094

Soumyottam Chatterjee, Gopal Pandurangan, Peter Robinson

We study the fundamental problem of counting the number of nodes in a sparse network (of unknown size) under the presence of a large number of Byzantine nodes. We assume the full information model where the Byzantine nodes have complete knowledge about the entire state of the network at every round (including random choices made by all the nodes), have unbounded computational power, and can deviate arbitrarily from the protocol. Essentially all known algorithms for fundamental Byzantine problems (e.g., agreement, leader election, sampling) studied in the literature assume the knowledge (or at least an estimate) of the size of the network. It is non-trivial to design algorithms for Byzantine problems that work without knowledge of the network size, especially in bounded-degree (expander) networks where the local views of all nodes are (essentially) the same and limited, and Byzantine nodes can quite easily fake the presence/absence of non-existing nodes. To design truly local algorithms that do not rely on any global knowledge (including network size), estimating the size of the network under Byzantine nodes is an important first step. Our main contribution is a randomized distributed algorithm that estimates the size of a network under the presence of a large number of Byzantine nodes. In particular, our algorithm estimates the size of a sparse, "small-world", expander network with up to O(n^1-Δ) Byzantine nodes, where n is the (unknown) network size and Δ > 0 can be be any arbitrarily small (but fixed) constant. Our algorithm outputs a (fixed) constant factor estimate of log(n) with high probability; the correct estimate of the network size will be known to a large fraction (1 - ε)-fraction, for any fixed positive constant ε) of the honest nodes. Our algorithm is fully distributed, lightweight, and simple to implement, runs in O(log^3 n) rounds, and requires nodes to send and receive messages of only small-sized messages per round; any node's local computation cost per round is also small.

我们研究了在存在大量拜占庭节点的情况下，计算一个(未知大小)稀疏网络中节点数量的基本问题。我们假设完全信息模型，其中拜占庭节点在每轮中都完全了解网络的整个状态(包括所有节点所做的随机选择)，具有无限的计算能力，并且可以任意偏离协议。从本质上讲，文献中研究的所有已知的基本拜占庭问题(例如，协议，领导者选举，抽样)的算法都假设知道(或至少估计)网络的大小。在不知道网络大小的情况下，为拜占庭问题设计算法是非常重要的，特别是在有界度(扩展器)网络中，所有节点的局部视图(本质上)是相同的和有限的，拜占庭节点可以很容易地假装不存在节点的存在/不存在。为了设计不依赖任何全局知识(包括网络大小)的真正的局部算法，估算拜占庭节点下的网络大小是重要的第一步。我们的主要贡献是一种随机分布算法，它可以在大量拜占庭节点存在的情况下估计网络的大小。特别是，我们的算法估计一个稀疏的，“小世界”的扩展网络的大小，该网络最多有O(n^1-Δ)个拜占庭节点，其中n是(未知的)网络大小，Δ > 0可以是任意小的(但固定的)常数。我们的算法以高概率输出log(n)的(固定)常数因子估计;对于诚实节点的很大一部分(1 - ε)(对于任何固定的正常数ε)，网络大小的正确估计将是已知的。我们的算法是完全分布式的，轻量级的，易于实现的，在O(log^ 3n)轮中运行，并且要求节点每轮只发送和接收小消息;每轮任何节点的本地计算成本也很小。

{"title":"Network Size Estimation in Small-World Networks Under Byzantine Faults","authors":"Soumyottam Chatterjee, Gopal Pandurangan, Peter Robinson","doi":"10.1109/IPDPS.2019.00094","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00094","url":null,"abstract":"We study the fundamental problem of counting the number of nodes in a sparse network (of unknown size) under the presence of a large number of Byzantine nodes. We assume the full information model where the Byzantine nodes have complete knowledge about the entire state of the network at every round (including random choices made by all the nodes), have unbounded computational power, and can deviate arbitrarily from the protocol. Essentially all known algorithms for fundamental Byzantine problems (e.g., agreement, leader election, sampling) studied in the literature assume the knowledge (or at least an estimate) of the size of the network. It is non-trivial to design algorithms for Byzantine problems that work without knowledge of the network size, especially in bounded-degree (expander) networks where the local views of all nodes are (essentially) the same and limited, and Byzantine nodes can quite easily fake the presence/absence of non-existing nodes. To design truly local algorithms that do not rely on any global knowledge (including network size), estimating the size of the network under Byzantine nodes is an important first step. Our main contribution is a randomized distributed algorithm that estimates the size of a network under the presence of a large number of Byzantine nodes. In particular, our algorithm estimates the size of a sparse, \"small-world\", expander network with up to O(n^1-Δ) Byzantine nodes, where n is the (unknown) network size and Δ > 0 can be be any arbitrarily small (but fixed) constant. Our algorithm outputs a (fixed) constant factor estimate of log(n) with high probability; the correct estimate of the network size will be known to a large fraction (1 - ε)-fraction, for any fixed positive constant ε) of the honest nodes. Our algorithm is fully distributed, lightweight, and simple to implement, runs in O(log^3 n) rounds, and requires nodes to send and receive messages of only small-sized messages per round; any node's local computation cost per round is also small.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114633849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

LLC-Guided Data Migration in Hybrid Memory Systems 混合存储系统中llc引导的数据迁移

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00101

E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis

Although 3D-stacked DRAM offers substantially higher bandwidth than commodity DDR DIMMs, it cannot yet provide the necessary capacity to replace the bulk of the memory. A promising alternative is to use flat address space, hybrid memory systems of two or more levels, each exhibiting different performance characteristics. One such existing approach employs a near, high bandwidth 3D-stacked memory, placed on top of the processor die, combined with a far, commodity DDR memory, placed off-chip. Migrating data from the far to the near memory has significant performance potential, but also entails overheads, which may diminish migration benefits or even lead to performance degradation. This paper describes a new data migration scheme for hybrid memory systems that takes into account the above overheads and improves migration efficiency and effectiveness. It is based on the observation that migrating memory segments, which are (partly) present in the Last-Level Cache (LLC) introduces lower migration traffic. Our approach relies on the state of the LLC cachelines to predict future reuse and select memory segments for migration. Thereby, the segments are migrated when present (at least partly) in the LLC incurring lower cost. Our experiments confirm that our approach outperforms current state-of-the art migration designs improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%.

虽然3d堆叠DRAM提供比商品DDR dimm高得多的带宽，但它还不能提供取代大部分内存所需的容量。一个有希望的替代方案是使用平面地址空间，两层或两层以上的混合存储系统，每层都表现出不同的性能特征。现有的一种方法是采用近距离、高带宽的3d堆叠存储器，放置在处理器芯片的顶部，结合远距离、商用DDR存储器，放置在芯片外。将数据从远内存迁移到近内存具有很大的性能潜力，但也会带来开销，这可能会减少迁移的好处，甚至导致性能下降。本文提出了一种新的混合存储系统数据迁移方案，该方案考虑了上述开销，提高了迁移效率和有效性。这是基于这样的观察，即迁移(部分)存在于最后一级缓存(LLC)中的内存段引入了较低的迁移流量。我们的方法依赖于LLC cacheline的状态来预测未来的重用和选择用于迁移的内存段。因此，当存在(至少部分)在有限责任公司中产生较低的成本时，这些部分被迁移。我们的实验证实，我们的方法优于当前最先进的迁移设计，将系统性能提高12.1%，并将内存系统动态能量降低13.2%。

{"title":"LLC-Guided Data Migration in Hybrid Memory Systems","authors":"E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis","doi":"10.1109/IPDPS.2019.00101","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00101","url":null,"abstract":"Although 3D-stacked DRAM offers substantially higher bandwidth than commodity DDR DIMMs, it cannot yet provide the necessary capacity to replace the bulk of the memory. A promising alternative is to use flat address space, hybrid memory systems of two or more levels, each exhibiting different performance characteristics. One such existing approach employs a near, high bandwidth 3D-stacked memory, placed on top of the processor die, combined with a far, commodity DDR memory, placed off-chip. Migrating data from the far to the near memory has significant performance potential, but also entails overheads, which may diminish migration benefits or even lead to performance degradation. This paper describes a new data migration scheme for hybrid memory systems that takes into account the above overheads and improves migration efficiency and effectiveness. It is based on the observation that migrating memory segments, which are (partly) present in the Last-Level Cache (LLC) introduces lower migration traffic. Our approach relies on the state of the LLC cachelines to predict future reuse and select memory segments for migration. Thereby, the segments are migrated when present (at least partly) in the LLC incurring lower cost. Our experiments confirm that our approach outperforms current state-of-the art migration designs improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130977101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Dynamic Memory Management for GPU-Based Training of Deep Neural Networks 基于gpu的深度神经网络训练动态内存管理

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00030

B. ShriramS, Anshuj Garg, Purushottam Kulkarni

Deep learning has been widely adopted for different applications of artificial intelligence - speech recognition, natural language processing, computer vision etc. The growing size of Deep Neural Networks (DNNs) has compelled the researchers to design memory efficient and performance optimal algorithms. Apart from algorithmic improvements, specialized hardware like Graphics Processing Units (GPUs) are being widely employed to accelerate the training and inference phases of deep networks. However, the limited GPU memory capacity limits the upper bound on the size of networks that can be offloaded to and trained using GPUs. vDNN addresses the GPU memory bottleneck issue and provides a solution which enables training of deep networks that are larger than GPU memory. In our work, we characterize and identify multiple bottlenecks with vDNN like delayed computation start, high pinned memory requirements and GPU memory fragmentation. We present vDNN++ which extends vDNN and resolves the identified issues. Our results show that the performance of vDNN++ is comparable or better (up to 60% relative improvement) than vDNN. We propose different heuristics and order for memory allocation, and empirically evaluate the extent of memory fragmentation with them. We are also able to reduce the pinned memory requirement by up to 60%.

深度学习已被广泛应用于人工智能的不同应用-语音识别，自然语言处理，计算机视觉等。随着深度神经网络(dnn)规模的不断扩大，研究人员不得不设计记忆效率和性能最优的算法。除了算法的改进，图形处理单元(gpu)等专用硬件也被广泛用于加速深度网络的训练和推理阶段。然而，有限的GPU内存容量限制了可以卸载和使用GPU训练的网络大小的上限。vDNN解决了GPU内存瓶颈问题，并提供了一种能够训练比GPU内存更大的深度网络的解决方案。在我们的工作中，我们描述并识别了vDNN的多个瓶颈，如延迟计算启动，高固定内存需求和GPU内存碎片。我们提出了vdnn++，它扩展了vDNN并解决了已识别的问题。我们的结果表明，vdnn++的性能与vDNN相当或更好(高达60%的相对改进)。我们提出了不同的内存分配启发式和顺序，并用它们对内存碎片程度进行了实证评估。我们还能够将固定内存需求减少高达60%。

{"title":"Dynamic Memory Management for GPU-Based Training of Deep Neural Networks","authors":"B. ShriramS, Anshuj Garg, Purushottam Kulkarni","doi":"10.1109/IPDPS.2019.00030","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00030","url":null,"abstract":"Deep learning has been widely adopted for different applications of artificial intelligence - speech recognition, natural language processing, computer vision etc. The growing size of Deep Neural Networks (DNNs) has compelled the researchers to design memory efficient and performance optimal algorithms. Apart from algorithmic improvements, specialized hardware like Graphics Processing Units (GPUs) are being widely employed to accelerate the training and inference phases of deep networks. However, the limited GPU memory capacity limits the upper bound on the size of networks that can be offloaded to and trained using GPUs. vDNN addresses the GPU memory bottleneck issue and provides a solution which enables training of deep networks that are larger than GPU memory. In our work, we characterize and identify multiple bottlenecks with vDNN like delayed computation start, high pinned memory requirements and GPU memory fragmentation. We present vDNN++ which extends vDNN and resolves the identified issues. Our results show that the performance of vDNN++ is comparable or better (up to 60% relative improvement) than vDNN. We propose different heuristics and order for memory allocation, and empirically evaluate the extent of memory fragmentation with them. We are also able to reduce the pinned memory requirement by up to 60%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122060192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

QoS-Driven Coordinated Management of Resources to Save Energy in Multi-core Systems 基于qos驱动的多核系统资源协同节能管理

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00040

M. Nejat, M. Pericàs, P. Stenström

Applications that are run on multicore systems without performance targets can waste significant energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of individual applications in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and DVFS settings at runtime at negligible overhead. Compared to DVFS and cache partitioning alone, we show that our proposed coordinated RMA is capable of saving, on average, 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone, when the performance target is set to 70% of the baseline system performance.

在没有性能目标的多核系统上运行的应用程序可能会浪费大量的能源。本文首次考虑了一种QoS驱动的协调资源管理算法(RMA)，该算法动态调整每核最后一级缓存分区的大小和每核电压频率设置，以节省能源，同时尊重运行在多核系统上的多编程工作负载中单个应用程序的QoS要求。它通过在运行时以可忽略不计的开销跨LLC分区大小和DVFS设置范围进行配置空间探索来实现这一点。与单独的DVFS和缓存分区相比，我们表明，当性能目标设置为基准系统性能的70%时，我们提出的协调RMA能够平均节省20%的能量，而单独的DVFS和单独的缓存分区分别节省15%和7%的能量。

引用次数: 9

Exploring MPI Communication Models for Graph Applications Using Graph Matching as a Case Study 使用图匹配作为案例研究探索图应用程序的MPI通信模型

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00085

Sayan Ghosh, M. Halappanavar, A. Kalyanaraman, Arif M. Khan, A. Gebremedhin

Traditional implementations of parallel graph operations on distributed memory platforms are written using Message Passing Interface (MPI) point-to-point communication primitives such as Send-Recv (blocking and nonblocking). Apart from this classical model, the MPI community has over the years added other communication models; however, their suitability for handling the irregular traffic workloads typical of graph operations remain comparatively less explored. Our aim in this paper is to study these relatively underutilized communication models of MPI for graph applications. More specifically, we evaluate MPI's one-sided programming, or Remote Memory Access (RMA), and nearest neighborhood collectives using a process graph topology. There are features in these newer models that are intended to better map to irregular communication patterns, as exemplified in graph algorithms. As a concrete application for our case study, we use distributed memory implementations of an approximate weighted graph matching algorithm to investigate performances of MPI-3 RMA and neighborhood collective operations compared to nonblocking Send-Recv. A matching in a graph is a subset of edges such that no two matched edges are incident on the same vertex. A maximum weight matching is a matching of maximum weight computed as the sum of the weights of matched edges. Execution of graph matching is dominated by high volume of irregular memory accesses, making it an ideal candidate for studying the effects of various MPI communication models on graph applications at scale. Our neighborhood collectives and RMA implementations yield up to 6x speedup over traditional nonblocking Send-Recv implementations on thousands of cores of the NERSC Cori supercomputer. We believe the lessons learned from this study can be adopted to benefit a wider range of graph applications.

分布式内存平台上并行图操作的传统实现是使用消息传递接口(MPI)点对点通信原语(如Send-Recv(阻塞和非阻塞))编写的。除了这个经典模型之外，MPI社区多年来还增加了其他通信模型;然而，它们对于处理典型的图操作的不规则流量工作负载的适用性仍然相对较少探索。本文的目的是研究这些相对未被充分利用的图形应用程序的MPI通信模型。更具体地说，我们使用过程图拓扑来评估MPI的单面编程，或远程内存访问(RMA)，以及最近的邻域集合。这些新模型中有一些特性旨在更好地映射到不规则的通信模式，例如图算法。作为我们案例研究的具体应用，我们使用近似加权图匹配算法的分布式内存实现来研究MPI-3 RMA和邻域集体操作与非阻塞Send-Recv的性能。图中的匹配是在同一顶点上没有两条匹配的边的子集。最大权值匹配是用匹配边的权值之和计算的最大权值的匹配。图匹配的执行是由大量的不规则内存访问主导的，这使得它成为研究各种MPI通信模型对大规模图应用程序影响的理想候选者。我们的社区集体和RMA实现在NERSC Cori超级计算机的数千核上比传统的无阻塞发送-接收实现产生高达6倍的加速。我们相信，从这项研究中吸取的经验教训可以用于更广泛的图形应用程序。

{"title":"Exploring MPI Communication Models for Graph Applications Using Graph Matching as a Case Study","authors":"Sayan Ghosh, M. Halappanavar, A. Kalyanaraman, Arif M. Khan, A. Gebremedhin","doi":"10.1109/IPDPS.2019.00085","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00085","url":null,"abstract":"Traditional implementations of parallel graph operations on distributed memory platforms are written using Message Passing Interface (MPI) point-to-point communication primitives such as Send-Recv (blocking and nonblocking). Apart from this classical model, the MPI community has over the years added other communication models; however, their suitability for handling the irregular traffic workloads typical of graph operations remain comparatively less explored. Our aim in this paper is to study these relatively underutilized communication models of MPI for graph applications. More specifically, we evaluate MPI's one-sided programming, or Remote Memory Access (RMA), and nearest neighborhood collectives using a process graph topology. There are features in these newer models that are intended to better map to irregular communication patterns, as exemplified in graph algorithms. As a concrete application for our case study, we use distributed memory implementations of an approximate weighted graph matching algorithm to investigate performances of MPI-3 RMA and neighborhood collective operations compared to nonblocking Send-Recv. A matching in a graph is a subset of edges such that no two matched edges are incident on the same vertex. A maximum weight matching is a matching of maximum weight computed as the sum of the weights of matched edges. Execution of graph matching is dominated by high volume of irregular memory accesses, making it an ideal candidate for studying the effects of various MPI communication models on graph applications at scale. Our neighborhood collectives and RMA implementations yield up to 6x speedup over traditional nonblocking Send-Recv implementations on thousands of cores of the NERSC Cori supercomputer. We believe the lessons learned from this study can be adopted to benefit a wider range of graph applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123305950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning 使用DAG分区的同构处理器的可伸缩的基于集群的任务调度程序

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00026

M. Özkaya, A. Benoit, B. Uçar, J. Herrmann, Ümit V. Çatalyürek

When scheduling a directed acyclic graph (DAG) of tasks with communication costs on computational platforms, a good trade-off between load balance and data locality is necessary. List-based scheduling techniques are commonly-used greedy approaches for this problem. The downside of list-scheduling heuristics is that they are incapable of making short-term sacrifices for the global efficiency of the schedule. In this work, we describe new list-based scheduling heuristics based on clustering for homogeneous platforms, under the realistic duplex single-port communication model. Our approach uses an acyclic partitioner for DAGs for clustering. The clustering enhances the data locality of the scheduler with a global view of the graph. Furthermore, since the partition is acyclic, we can schedule each part completely once its input tasks are ready to be executed. We present an extensive experimental evaluation showing the trade-offs between the granularity of clustering and the parallelism, and how this affects the scheduling. Furthermore, we compare our heuristics to the best state-of-the-art list-scheduling and clustering heuristics, and obtain more than three times better makespan in cases with many communications.

在计算平台上调度具有通信开销的任务的有向无环图(DAG)时，需要在负载平衡和数据局部性之间进行良好的权衡。基于列表的调度技术是解决此问题的常用贪婪方法。列表调度启发式的缺点是，它们无法为调度的全局效率做出短期牺牲。在本工作中，我们描述了在现实双工单端口通信模型下，基于聚类的同构平台的新的基于列表的调度启发式算法。我们的方法使用dag的无循环分区器进行聚类。聚类通过图的全局视图增强了调度器的数据局部性。此外，由于分区是无循环的，一旦每个部分的输入任务准备好执行，我们就可以完全调度每个部分。我们提供了一个广泛的实验评估，显示了集群粒度和并行性之间的权衡，以及这如何影响调度。此外，我们将我们的启发式方法与最先进的列表调度和聚类启发式方法进行了比较，并在具有许多通信的情况下获得了三倍以上的最大完成时间。

{"title":"A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning","authors":"M. Özkaya, A. Benoit, B. Uçar, J. Herrmann, Ümit V. Çatalyürek","doi":"10.1109/IPDPS.2019.00026","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00026","url":null,"abstract":"When scheduling a directed acyclic graph (DAG) of tasks with communication costs on computational platforms, a good trade-off between load balance and data locality is necessary. List-based scheduling techniques are commonly-used greedy approaches for this problem. The downside of list-scheduling heuristics is that they are incapable of making short-term sacrifices for the global efficiency of the schedule. In this work, we describe new list-based scheduling heuristics based on clustering for homogeneous platforms, under the realistic duplex single-port communication model. Our approach uses an acyclic partitioner for DAGs for clustering. The clustering enhances the data locality of the scheduler with a global view of the graph. Furthermore, since the partition is acyclic, we can schedule each part completely once its input tasks are ready to be executed. We present an extensive experimental evaluation showing the trade-offs between the granularity of clustering and the parallelism, and how this affects the scheduling. Furthermore, we compare our heuristics to the best state-of-the-art list-scheduling and clustering heuristics, and obtain more than three times better makespan in cases with many communications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125680807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

SAC Goes Cluster: Fully Implicit Distributed Computing SAC Goes集群:完全隐式分布式计算

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00107

Thomas Macht, C. Grelck

SAC (Single Assignment C) is a purely functional, data-parallel array programming language that predominantly targets compute-intensive applications. Thus, clusters of workstations, or distributed memory architectures in general, form highly relevant compilation targets. Notwithstanding, SAC as of today only supports shared-memory architectures, graphics accelerators and heterogeneous combinations thereof. In our current work we aim at closing this gap. At the same time, we are determined to uphold SAC's promise of entirely compiler-directed exploitation of concurrency, no matter what the target architecture is. Distributed memory architectures are going to make this promise a particular challenge. Despite SAC's functional semantics, it is generally far from straightforward to infer exact communication patterns from architecture-agnostic code. Therefore, we intend to capitalise on recent advances in network technology, namely the closing of the gap between memory bandwidth and network bandwidth. We aim at a solution based on a custom-designed software distributed shared memory (S-DSM) and large per-node software-managed cache memories. To this effect the functional nature of SAC with its write-once/read-only arrays provides a strategic advantage that we thoroughly exploit. Throughout the paper we further motivate our approach, sketch out our implementation strategy, show preliminary results and discuss the pros and cons of our approach.

SAC (Single Assignment C)是一种纯函数式、数据并行数组编程语言，主要针对计算密集型应用程序。因此，工作站集群，或者通常的分布式内存体系结构，形成了高度相关的编译目标。尽管如此，SAC到目前为止只支持共享内存架构、图形加速器及其异构组合。在我们目前的工作中，我们的目标是缩小这一差距。同时，无论目标体系结构是什么，我们都决心坚持SAC的承诺，即完全由编译器定向地利用并发性。分布式内存架构将使这一承诺成为一个特别的挑战。尽管SAC具有功能语义，但从与体系结构无关的代码中推断出精确的通信模式通常是很不容易的。因此，我们打算利用网络技术的最新进展，即缩小内存带宽和网络带宽之间的差距。我们的目标是基于定制设计的软件分布式共享内存(S-DSM)和大型每个节点软件管理的缓存内存的解决方案。为此，SAC的功能性及其一次写入/只读数组提供了我们充分利用的战略优势。在整个论文中，我们进一步激励了我们的方法，概述了我们的实施策略，展示了初步结果，并讨论了我们方法的利弊。

{"title":"SAC Goes Cluster: Fully Implicit Distributed Computing","authors":"Thomas Macht, C. Grelck","doi":"10.1109/IPDPS.2019.00107","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00107","url":null,"abstract":"SAC (Single Assignment C) is a purely functional, data-parallel array programming language that predominantly targets compute-intensive applications. Thus, clusters of workstations, or distributed memory architectures in general, form highly relevant compilation targets. Notwithstanding, SAC as of today only supports shared-memory architectures, graphics accelerators and heterogeneous combinations thereof. In our current work we aim at closing this gap. At the same time, we are determined to uphold SAC's promise of entirely compiler-directed exploitation of concurrency, no matter what the target architecture is. Distributed memory architectures are going to make this promise a particular challenge. Despite SAC's functional semantics, it is generally far from straightforward to infer exact communication patterns from architecture-agnostic code. Therefore, we intend to capitalise on recent advances in network technology, namely the closing of the gap between memory bandwidth and network bandwidth. We aim at a solution based on a custom-designed software distributed shared memory (S-DSM) and large per-node software-managed cache memories. To this effect the functional nature of SAC with its write-once/read-only arrays provides a strategic advantage that we thoroughly exploit. Throughout the paper we further motivate our approach, sketch out our implementation strategy, show preliminary results and discuss the pros and cons of our approach.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125173176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

HART: A Concurrent Hash-Assisted Radix Tree for DRAM-PM Hybrid Memory Systems ram - pm混合存储系统的并发哈希辅助基树

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00100

Wen Pan, T. Xie, Xiaojia Song

Persistent memory (PM) exhibits a huge potential to provide applications with a hybrid memory system where both DRAM and PM are directly connected to a CPU. In such a system, an efficient indexing data structure such as a persistent tree becomes an indispensable component. Designing a capable persistent tree, however, is challenging as it has to ensure consistency, persistence, and scalability without substantially degrading performance. Besides, it needs to prevent persistent memory leaks. While hash table has been widely used for main memory indexing due to its superior performance in random query, ART (Adaptive Radix Tree) is inherently better than B/B^+-tree in most basic operations on both DRAM and PM. To exploit their complementary merits, in this paper we propose a novel concurrent and persistent tree called HART (Hash-assisted ART), which employs a hash table to manage ARTs. HART employs a selective consistency/persistence mechanism and an enhanced persistent memory allocator, which can not only optimize its performance but also prevent persistent memory leaks. Experimental results show that in most cases HART significantly outperforms WOART and FPTree, two state-of-the-art persistent trees. Also, it scales well in concurrent scenarios.

持久内存(PM)显示出巨大的潜力，可以为应用程序提供混合存储系统，其中DRAM和PM都直接连接到CPU。在这样的系统中，高效的索引数据结构(如持久树)成为不可或缺的组成部分。然而，设计一个有能力的持久化树是具有挑战性的，因为它必须确保一致性、持久化和可伸缩性，同时又不会大大降低性能。此外，它还需要防止持久内存泄漏。虽然哈希表由于其在随机查询方面的优越性能而被广泛用于主存索引，但在DRAM和PM的大多数基本操作中，ART (Adaptive Radix Tree)本质上优于B/B^+-tree。为了利用它们的互补优点，在本文中，我们提出了一种新的并发持久树，称为HART(哈希辅助ART)，它使用哈希表来管理ART。HART采用选择性一致性/持久性机制和增强的持久性内存分配器，不仅可以优化其性能，还可以防止持久性内存泄漏。实验结果表明，在大多数情况下，HART显著优于WOART和FPTree这两种最先进的持久树。此外，它在并发场景中可很好地扩展。

{"title":"HART: A Concurrent Hash-Assisted Radix Tree for DRAM-PM Hybrid Memory Systems","authors":"Wen Pan, T. Xie, Xiaojia Song","doi":"10.1109/IPDPS.2019.00100","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00100","url":null,"abstract":"Persistent memory (PM) exhibits a huge potential to provide applications with a hybrid memory system where both DRAM and PM are directly connected to a CPU. In such a system, an efficient indexing data structure such as a persistent tree becomes an indispensable component. Designing a capable persistent tree, however, is challenging as it has to ensure consistency, persistence, and scalability without substantially degrading performance. Besides, it needs to prevent persistent memory leaks. While hash table has been widely used for main memory indexing due to its superior performance in random query, ART (Adaptive Radix Tree) is inherently better than B/B^+-tree in most basic operations on both DRAM and PM. To exploit their complementary merits, in this paper we propose a novel concurrent and persistent tree called HART (Hash-assisted ART), which employs a hash table to manage ARTs. HART employs a selective consistency/persistence mechanism and an enhanced persistent memory allocator, which can not only optimize its performance but also prevent persistent memory leaks. Experimental results show that in most cases HART significantly outperforms WOART and FPTree, two state-of-the-art persistent trees. Also, it scales well in concurrent scenarios.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121047232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

BigSpa: An Efficient Interprocedural Static Analysis Engine in the Cloud BigSpa:云端高效的过程间静态分析引擎

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00086

Zhiqiang Zuo, Rong Gu, Xi Jiang, Zhaokang Wang, Yihua Huang, Linzhang Wang, Xuandong Li

Static program analysis is widely used in various application areas to solve many practical problems. Although researchers have made significant achievements in static analysis, it is still too challenging to perform sophisticated interprocedural analysis on large-scale modern software. The underlying reason is that interprocedural analysis for large-scale modern software is highly computation-and memory-intensive, leading to poor scalability. We aim to tackle the scalability problem by proposing a novel big data solution for sophisticated static analysis. Specifically, we propose a data-parallel algorithm and a join-process-filter computation model for the CFL-reachability based interprocedural analysis and develop an efficient distributed static analysis engine in the cloud, called BigSpa. Our experiments validated that BigSpa running on a cluster scales greatly to perform precise interprocedural analyses on millions of lines of code, and runs an order of magnitude or more faster than the existing state-of-the-art analysis tools.

静态程序分析广泛应用于各个应用领域，解决了许多实际问题。尽管研究人员在静态分析方面取得了重大成就，但在大规模的现代软件上进行复杂的程序间分析仍然具有很大的挑战性。潜在的原因是大规模现代软件的过程间分析是高度计算和内存密集型的，导致了较差的可伸缩性。我们的目标是通过为复杂的静态分析提出一种新颖的大数据解决方案来解决可扩展性问题。具体而言，我们提出了一种基于cfl可达性的过程间分析的数据并行算法和连接-进程-过滤计算模型，并开发了一个高效的云分布式静态分析引擎BigSpa。我们的实验证实，在集群上运行的BigSpa可以在数百万行代码上执行精确的过程间分析，并且运行速度比现有的最先进的分析工具快一个数量级或更快。

引用次数: 9