2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献_第5页

A Framework for Lattice QCD Calculations on GPUs 基于gpu的点阵QCD计算框架

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.112

F. Winter, M. Clark, R. Edwards, B. Joó

Computing platforms equipped with accelerators like GPUs have proven to provide great computational power. However, exploiting such platforms for existing scientific applications is not a trivial task. Current GPU programming frameworks such as CUDA C/C++ require low-level programming from the developer in order to achieve high performance code. As a result porting of applications to GPUs is typically limited to time-dominant algorithms and routines, leaving the remainder not accelerated which can open a serious Amdahl's law issue. The Lattice QCD application Chroma allows us to explore a different porting strategy. The layered structure of the software architecture logically separates the data-parallel from the application layer. The QCD Data-Parallel software layer provides data types and expressions with stencil-like operations suitable for lattice field theory. Chroma implements algorithms in terms of this high-level interface. Thus by porting the low-level layer one effectively ports the whole application layer in one swing. The QDP-JIT/PTX library, our reimplementation of the low-level layer, provides a framework for Lattice QCD calculations for the CUDA architecture. The complete software interface is supported and thus applications can be run unaltered on GPU-based parallel computers. This reimplementation was possible due to the availability of a JIT compiler which translates an assembly language (PTX) to GPU code. The existing expression templates enabled us to employ compile-time computations in order to build code generators and to automate the memory management for CUDA. Our implementation has allowed us to deploy the full Chroma gauge-generation program on large scale GPU-based machines such as Titan and Blue Waters and accelerate the calculation by more than an order of magnitude.

配备了像gpu这样的加速器的计算平台已经被证明可以提供强大的计算能力。然而，为现有的科学应用开发这样的平台并不是一项微不足道的任务。当前的GPU编程框架(如CUDA C/ c++)需要开发人员进行低级编程才能实现高性能代码。因此，将应用程序移植到gpu通常仅限于时间主导算法和例程，而其余部分没有加速，这可能会引发严重的阿姆达尔定律问题。Lattice QCD应用程序Chroma允许我们探索一种不同的移植策略。软件体系结构的分层结构在逻辑上将数据并行层与应用层分开。QCD数据并行软件层提供了适合晶格场理论的数据类型和具有类似模板操作的表达式。Chroma根据这个高级接口实现算法。因此，通过移植低级层，可以一次有效地移植整个应用程序层。QDP-JIT/PTX库，我们对底层的重新实现，为CUDA架构的Lattice QCD计算提供了一个框架。支持完整的软件接口，因此应用程序可以在基于gpu的并行计算机上不加更改地运行。这种重新实现是可能的，因为JIT编译器可以将汇编语言(PTX)转换为GPU代码。现有的表达式模板使我们能够使用编译时计算来构建代码生成器并自动化CUDA的内存管理。我们的实现使我们能够在大型基于gpu的机器(如Titan和Blue Waters)上部署完整的色度计生成程序，并将计算速度提高一个数量级以上。

{"title":"A Framework for Lattice QCD Calculations on GPUs","authors":"F. Winter, M. Clark, R. Edwards, B. Joó","doi":"10.1109/IPDPS.2014.112","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.112","url":null,"abstract":"Computing platforms equipped with accelerators like GPUs have proven to provide great computational power. However, exploiting such platforms for existing scientific applications is not a trivial task. Current GPU programming frameworks such as CUDA C/C++ require low-level programming from the developer in order to achieve high performance code. As a result porting of applications to GPUs is typically limited to time-dominant algorithms and routines, leaving the remainder not accelerated which can open a serious Amdahl's law issue. The Lattice QCD application Chroma allows us to explore a different porting strategy. The layered structure of the software architecture logically separates the data-parallel from the application layer. The QCD Data-Parallel software layer provides data types and expressions with stencil-like operations suitable for lattice field theory. Chroma implements algorithms in terms of this high-level interface. Thus by porting the low-level layer one effectively ports the whole application layer in one swing. The QDP-JIT/PTX library, our reimplementation of the low-level layer, provides a framework for Lattice QCD calculations for the CUDA architecture. The complete software interface is supported and thus applications can be run unaltered on GPU-based parallel computers. This reimplementation was possible due to the availability of a JIT compiler which translates an assembly language (PTX) to GPU code. The existing expression templates enabled us to employ compile-time computations in order to build code generators and to automate the memory management for CUDA. Our implementation has allowed us to deploy the full Chroma gauge-generation program on large scale GPU-based machines such as Titan and Blue Waters and accelerate the calculation by more than an order of magnitude.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125851930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters InfiniBand MIC集群的高性能Alltoall和Allgather设计

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.72

Akshay Venkatesh, S. Potluri, R. Rajachandrasekar, Miao Luo, Khaled Hamidouche, D. Panda

Intel's Many-Integrated-Core (MIC) architecture aims to provide Teraflop throughput (through high degrees of parallelism) with a high FLOP/Watt ratio and x86 compatibility. However, this two-fold approach to solving power and programmability challenges for Exascale computing is constrained by certain architectural idiosyncrasies. MIC coprocessors have a memory constrained environment and its processors operate at slower clock rates. Also, being PCI devices, the communication characteristics of MIC co-processors are different compared to communication behavior seen in homogeneous environments. For instance, the performance of sending data from the MIC memory to a remote node's memory through message passing routines has 3x-6x higher latency than sending from the host processor memory. Hence communication libraries that do not consider these architectural subtleties are likely to nullify performance benefits or even cause degradation in applications that intend to use MICs and rely heavily on communication routines. The performance of Message Passing Interface (MPI) operations, especially dense collective operations like All-to-all and All gather, strongly affect the performance of many distributed parallel applications. In this paper, we revisit state-of-the-art algorithms commonly used to implement All-to-all collectives and propose adaptations and optimizations to alleviate architectural bottlenecks on MIC clusters. We also propose a few novel designs to improve the communication latency of these operations. Through micro-benchmarks and applications, we substantiate the benefits of incorporating the proposed adaptations to the All-to-All collective operations. At the micro-benchmark level, the proposed designs show as much as 79% improvement for All gather operation and up to 70% improvement for All-to-all and with the P3DFFT application, an improvement of 38% is seen in overall execution time.

英特尔的多集成核心(MIC)架构旨在提供Teraflop吞吐量(通过高度并行性)，具有高FLOP/Watt比和x86兼容性。然而，这种解决Exascale计算的能力和可编程性挑战的双重方法受到某些架构特性的限制。MIC协处理器具有内存受限的环境，其处理器以较慢的时钟速率运行。此外，作为PCI设备，MIC协处理器的通信特性与同构环境中的通信行为不同。例如，通过消息传递例程将数据从MIC内存发送到远程节点的内存的延迟比从主机处理器内存发送数据的延迟高3 -6倍。因此，如果通信库没有考虑到这些架构上的微妙之处，那么它很可能会抵消性能上的好处，甚至在打算使用mic并严重依赖通信例程的应用程序中导致性能下降。消息传递接口(Message Passing Interface, MPI)操作的性能，特别是密集的集体操作，如All-to- All和All- gather，会严重影响许多分布式并行应用程序的性能。在本文中，我们回顾了通常用于实现所有到所有集合的最先进算法，并提出了适应和优化，以缓解MIC集群上的架构瓶颈。我们还提出了一些新的设计来改善这些操作的通信延迟。通过微观基准和应用，我们证实了将拟议的调整纳入“所有对所有”集体操作的好处。在微基准测试水平上，建议的设计显示所有集合操作的改进高达79%，所有对所有操作的改进高达70%，使用P3DFFT应用程序，总体执行时间的改进为38%。

{"title":"High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters","authors":"Akshay Venkatesh, S. Potluri, R. Rajachandrasekar, Miao Luo, Khaled Hamidouche, D. Panda","doi":"10.1109/IPDPS.2014.72","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.72","url":null,"abstract":"Intel's Many-Integrated-Core (MIC) architecture aims to provide Teraflop throughput (through high degrees of parallelism) with a high FLOP/Watt ratio and x86 compatibility. However, this two-fold approach to solving power and programmability challenges for Exascale computing is constrained by certain architectural idiosyncrasies. MIC coprocessors have a memory constrained environment and its processors operate at slower clock rates. Also, being PCI devices, the communication characteristics of MIC co-processors are different compared to communication behavior seen in homogeneous environments. For instance, the performance of sending data from the MIC memory to a remote node's memory through message passing routines has 3x-6x higher latency than sending from the host processor memory. Hence communication libraries that do not consider these architectural subtleties are likely to nullify performance benefits or even cause degradation in applications that intend to use MICs and rely heavily on communication routines. The performance of Message Passing Interface (MPI) operations, especially dense collective operations like All-to-all and All gather, strongly affect the performance of many distributed parallel applications. In this paper, we revisit state-of-the-art algorithms commonly used to implement All-to-all collectives and propose adaptations and optimizations to alleviate architectural bottlenecks on MIC clusters. We also propose a few novel designs to improve the communication latency of these operations. Through micro-benchmarks and applications, we substantiate the benefits of incorporating the proposed adaptations to the All-to-All collective operations. At the micro-benchmark level, the proposed designs show as much as 79% improvement for All gather operation and up to 70% improvement for All-to-all and with the P3DFFT application, an improvement of 38% is seen in overall execution time.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129961830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Balancing On-Chip Network Latency in Multi-application Mapping for Chip-Multiprocessors 在芯片多处理器多应用映射中平衡片上网络延迟

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.94

Di Zhu, Lizhong Chen, Siyu Yue, T. Pinkston, Massoud Pedram

As the number of cores continues to grow in chip multiprocessors (CMPs), application-to-core mapping algorithms that leverage the non-uniform on-chip resource access time have been receiving increasing attention. However, existing mapping methods for reducing overall packet latency cannot meet the requirement of balanced on-chip latency when multiple applications are present. In this paper, we address the looming issue of balancing minimized on-chip packet latency with performance-awareness in the multi-application mapping of CMPs. Specifically, the proposed mapping problem is formulated, its NP-completeness is proven, and an efficient heuristic-based algorithm for solving the problem is presented. Simulation results show that the proposed algorithm is able to reduce the maximum average packet latency by 10.42% and the standard deviation of packet latency by 99.65% among concurrently running applications and, at the same time, incur little degradation in the overall performance.

随着芯片多处理器(cmp)中内核数量的不断增长，利用芯片上不均匀资源访问时间的应用程序到内核映射算法受到越来越多的关注。然而，现有的减少整体数据包延迟的映射方法不能满足多个应用存在时平衡片上延迟的要求。在本文中，我们解决了在cmp的多应用映射中平衡最小化片上数据包延迟和性能感知的迫在眉睫的问题。具体地说，提出了映射问题，证明了它的np完备性，并给出了一种有效的基于启发式的求解算法。仿真结果表明，在并发运行的应用程序中，该算法能够将最大平均数据包延迟降低10.42%，数据包延迟标准差降低99.65%，同时对整体性能的影响很小。

引用次数: 11

Complex Network Analysis Using Parallel Approximate Motif Counting 基于并行近似基序计数的复杂网络分析

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.50

George M. Slota, Kamesh Madduri

Subgraph counting forms the basis of many complex network analysis metrics, including motif and anti-motif finding, relative graph let frequency distance, and graph let degree distribution agreements. Determining exact subgraph counts is computationally very expensive. In recent work, we present FASCIA, a shared-memory parallel algorithm and implementation for approximate subgraph counting. FASCIA uses a dynamic programming-based approach and is significantly faster than exhaustive enumeration, while generating high-quality approximations of subgraph counts. However, the memory usage of the dynamic programming step prohibits us from applying FASCIA to very large graphs. In this paper, we introduce a distributed-memory parallelization of FASCIA by partitioning the graph and the dynamic programming table. We discuss a new collective communication scheme to make the dynamic programming step memory-efficient. These optimizations enable scaling to much larger networks than before. We also present a simple parallelization strategy for distributed subgraph counting on smaller networks. The new additions let us use subgraph counts as graph signatures for a large network collection, and we analyze this collection using various subgraph count-based graph analytics.

子图计数构成了许多复杂网络分析指标的基础，包括基序和反基序发现、相对图let频率距离和图let度分布协议。确定精确的子图计数在计算上是非常昂贵的。在最近的工作中，我们提出了FASCIA，一种用于近似子图计数的共享内存并行算法和实现。FASCIA使用基于动态规划的方法，比穷举枚举要快得多，同时生成子图计数的高质量近似。然而，动态规划步骤的内存使用使我们无法将FASCIA应用于非常大的图形。本文通过图的划分和动态规划表的划分，介绍了FASCIA的分布式内存并行化。我们讨论了一种新的集体通信方案，使动态规划步骤节省内存。这些优化可以扩展到比以前大得多的网络。我们还提出了一种简单的并行化策略，用于较小网络上的分布式子图计数。新添加的功能使我们可以使用子图计数作为大型网络集合的图签名，并且我们使用各种基于子图计数的图分析来分析这个集合。

{"title":"Complex Network Analysis Using Parallel Approximate Motif Counting","authors":"George M. Slota, Kamesh Madduri","doi":"10.1109/IPDPS.2014.50","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.50","url":null,"abstract":"Subgraph counting forms the basis of many complex network analysis metrics, including motif and anti-motif finding, relative graph let frequency distance, and graph let degree distribution agreements. Determining exact subgraph counts is computationally very expensive. In recent work, we present FASCIA, a shared-memory parallel algorithm and implementation for approximate subgraph counting. FASCIA uses a dynamic programming-based approach and is significantly faster than exhaustive enumeration, while generating high-quality approximations of subgraph counts. However, the memory usage of the dynamic programming step prohibits us from applying FASCIA to very large graphs. In this paper, we introduce a distributed-memory parallelization of FASCIA by partitioning the graph and the dynamic programming table. We discuss a new collective communication scheme to make the dynamic programming step memory-efficient. These optimizations enable scaling to much larger networks than before. We also present a simple parallelization strategy for distributed subgraph counting on smaller networks. The new additions let us use subgraph counts as graph signatures for a large network collection, and we analyze this collection using various subgraph count-based graph analytics.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115260078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Cost-Optimal Execution of Boolean Query Trees with Shared Streams 具有共享流的布尔查询树的成本最优执行

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.13

H. Casanova, Lipyeow Lim, Y. Robert, F. Vivien, Dounia Zaidouni

The processing of queries expressed as trees of boolean operators applied to predicates on sensor data streams has several applications in mobile computing. Sensor data must be retrieved from the sensors, which incurs a cost, e.g., an energy expense that depletes the battery of a mobile query processing device. The objective is to determine the order in which predicates should be evaluated so as to shortcut part of the query evaluation and minimize the expected cost. This problem has been studied assuming that each data stream occurs at a single predicate. In this work we remove this assumption since it does not necessarily hold in practice. Our main results are an optimal algorithm for single-level trees and a proof of NP-completeness for DNF trees. For DNF trees, however, we show that there is an optimal predicate evaluation order that corresponds to a depth-first traversal. This result provides inspiration for a class of heuristics. We show that one of these heuristics largely outperforms other sensible heuristics, including a heuristic proposed in previous work.

用布尔运算符树表示的查询处理应用于传感器数据流上的谓词，在移动计算中有几个应用。必须从传感器检索传感器数据，这产生成本，例如，消耗移动查询处理设备的电池的能源费用。目标是确定应该评估谓词的顺序，以便缩短部分查询评估并最小化预期成本。假设每个数据流发生在单个谓词上，就研究了这个问题。在这项工作中，我们删除了这个假设，因为它不一定在实践中成立。我们的主要成果是单层树的最优算法和DNF树的np完备性证明。然而，对于DNF树，我们证明了存在一个与深度优先遍历相对应的最优谓词求值顺序。这一结果为一类启发式提供了灵感。我们表明，这些启发式中的一个在很大程度上优于其他明智的启发式，包括在以前的工作中提出的启发式。

引用次数: 6

Collaborative Network Configuration in Hybrid Electrical/Optical Data Center Networks 混合电/光数据中心网络中的协同网络配置

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.92

Zhiyang Guo, Yuanyuan Yang

Recently, there has been much effort on introducing optical fiber communication to data center networks (DCNs) because of its significant advantage in bandwidth capacity and power efficiency. However, due to limitations of optical switching technologies, optical networking alone has not yet been able to accommodate the volatile data center traffic. As a result, hybrid packet/circuit (Hypac) switched DCNs, which argument the electrical packet switched (EPS) network with an optical circuit switched (OCS) network, have been proposed to combine the strengths of both types of networks. However, one problem with current Hypac DCNs is that the EPS network is shared in a best effort fashion and is largely oblivious to the accompanying OCS network, which results in severe drawbacks, such as degraded network predictability and deficiency in handling correlated traffic. Since the OCS/EPS networks have unique strengths and weaknesses, and are best suited for different traffic patterns, coordinating and collaborating the configuration of both networks is critical to reach the full potential of Hypac DCNs, which motivates the study in this paper. First, we present a network model that accurately abstracts the essential characteristics of the EPS/OCS networks. Second, considering the recent advances in network control technology, we propose a time-efficient algorithm called Collaborative Bandwidth Allocation (CBA) that configures both networks in a complementary manner. Finally, we conduct comprehensive simulations, which demonstrate that CBA significantly improves the performance of Hypac DCNs in many aspects.

近年来，由于光纤通信在带宽容量和功率效率方面的显著优势，在数据中心网络(DCNs)中引入光纤通信已成为人们关注的焦点。然而，由于光交换技术的限制，光网络本身还不能满足数据中心流量的不稳定性。因此，将电子分组交换(EPS)网络和光电路交换(OCS)网络相结合的混合分组/电路交换(Hypac) DCNs被提出，以结合两种类型网络的优势。然而，当前Hypac dcn的一个问题是，EPS网络以尽力而为的方式共享，并且在很大程度上忽略了伴随的OCS网络，这导致了严重的缺点，例如网络可预测性降低和处理相关流量的不足。由于OCS/EPS网络具有独特的优势和劣势，并且最适合不同的流量模式，因此协调和协作两个网络的配置对于充分发挥Hypac DCNs的潜力至关重要，这也是本文研究的动机。首先，我们提出了一个准确抽象EPS/OCS网络基本特征的网络模型。其次，考虑到网络控制技术的最新进展，我们提出了一种称为协同带宽分配(CBA)的时间效率算法，该算法以互补的方式配置两个网络。最后，我们进行了全面的仿真，结果表明CBA在多个方面显著提高了Hypac DCNs的性能。

{"title":"Collaborative Network Configuration in Hybrid Electrical/Optical Data Center Networks","authors":"Zhiyang Guo, Yuanyuan Yang","doi":"10.1109/IPDPS.2014.92","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.92","url":null,"abstract":"Recently, there has been much effort on introducing optical fiber communication to data center networks (DCNs) because of its significant advantage in bandwidth capacity and power efficiency. However, due to limitations of optical switching technologies, optical networking alone has not yet been able to accommodate the volatile data center traffic. As a result, hybrid packet/circuit (Hypac) switched DCNs, which argument the electrical packet switched (EPS) network with an optical circuit switched (OCS) network, have been proposed to combine the strengths of both types of networks. However, one problem with current Hypac DCNs is that the EPS network is shared in a best effort fashion and is largely oblivious to the accompanying OCS network, which results in severe drawbacks, such as degraded network predictability and deficiency in handling correlated traffic. Since the OCS/EPS networks have unique strengths and weaknesses, and are best suited for different traffic patterns, coordinating and collaborating the configuration of both networks is critical to reach the full potential of Hypac DCNs, which motivates the study in this paper. First, we present a network model that accurately abstracts the essential characteristics of the EPS/OCS networks. Second, considering the recent advances in network control technology, we propose a time-efficient algorithm called Collaborative Bandwidth Allocation (CBA) that configures both networks in a complementary manner. Finally, we conduct comprehensive simulations, which demonstrate that CBA significantly improves the performance of Hypac DCNs in many aspects.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126734586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Finding Motifs in Biological Sequences Using the Micron Automata Processor 利用微米自动机处理器寻找生物序列中的基序

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.51

Indranil Roy, S. Aluru

Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-hard, and the largest solved instance reported to date is (26, 11). We propose a novel algorithm for the (l, d) motif search problem using streaming execution over a large set of Non-deterministic Finite Automata (NFA). This solution is designed to take advantage of the Micron Automata Processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We estimate the run-time for the (39, 18) and (40, 17) problem instances using the resources available within a single Automata Processor board. In addition to solving larger instances of the (l, d) motif search problem, the paper serves as a useful guide to solving problems using this new accelerator technology.

在多个DNA或蛋白质序列中寻找近似保守的序列(称为基序)是计算生物学中的一个重要问题。本文考虑(l, d)基序搜索问题，即在n个给定序列中至少q个序列中存在一个或多个长度为l的基序，且每次出现的基序与最多d次替换中的基序不同。已知这个问题是np困难的，迄今为止报道的最大的解决实例是(26,11)。我们提出了一种新的(l, d)基序搜索算法，该算法使用大量非确定性有限自动机(NFA)上的流执行。该解决方案旨在利用美光自动处理器，这是一项接近部署的新技术，可以同时并行执行多个NFA。我们使用单个Automata Processor板内的可用资源估计(39,18)和(40,17)问题实例的运行时。除了解决(l, d)基序搜索问题的更大实例外，本文还为使用这种新的加速器技术解决问题提供了有用的指导。

引用次数: 72

Heterogeneity-Aware Workload Placement and Migration in Distributed Sustainable Datacenters 分布式可持续数据中心中异构感知的工作负载放置和迁移

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.41

Dazhao Cheng, Changjun Jiang, Xiaobo Zhou

While major cloud service operators have taken various initiatives to operate their sustainable data enters with green energy, it is challenging to effectively utilize the green energy since its generation depends on dynamic natural conditions. Fortunately, the geographical distribution of data enters provides an opportunity for optimizing the system performance by distributing cloud workloads. In this paper, we propose a holistic heterogeneity-aware cloud workload placement and migration approach, sCloud, that aims to maximize the system good put in distributed self-sustainable data enters. sCloud adaptively places the transactional workload to distributed data enters, allocates the available resource to heterogeneous workloads in each data enter, and migrates batch jobs across data enters, while taking into account the green power availability and QoS requirements. We formulate the transactional workload placement as a constrained optimization problem that can be solved by nonlinear programming. Then, we propose a batch job migration algorithm to further improve the system good put when the green power supply varies widely at different locations. We have implemented sCloud in a university cloud test bed with real-world weather conditions and workload traces. Experimental results demonstrate sCloud can achieve near-to-optimal system performance while being resilient to dynamic power availability. It outperforms a heterogeneity-oblivious approach by 26% in improving system good put and 29% in reducing QoS violations.

虽然主要的云服务运营商已经采取了各种措施来使用绿色能源运营其可持续数据中心，但由于绿色能源的产生取决于动态的自然条件，因此有效利用绿色能源是一项挑战。幸运的是，数据中心的地理分布为通过分布云工作负载来优化系统性能提供了机会。在本文中，我们提出了一种整体的异构感知云工作负载放置和迁移方法，sCloud，旨在最大限度地提高系统在分布式自我可持续数据中心中的性能。sCloud自适应地将事务性工作负载放置到分布式数据中心，将可用资源分配给每个数据中心中的异构工作负载，并在考虑绿色电源可用性和QoS要求的情况下跨数据中心迁移批处理作业。我们将事务性工作负载的放置表述为一个约束优化问题，可以用非线性规划来解决。在此基础上，提出了一种批量作业迁移算法，进一步提高了不同位置绿色电源差异较大时系统的优放性。我们已经在一个具有真实天气条件和工作负载跟踪的大学云测试台上实现了sCloud。实验结果表明，sCloud可以实现接近最优的系统性能，同时对动态电源可用性具有弹性。它在提高系统良好投放率方面比异构无关方法高出26%，在减少QoS违反方面比异构无关方法高出29%。

{"title":"Heterogeneity-Aware Workload Placement and Migration in Distributed Sustainable Datacenters","authors":"Dazhao Cheng, Changjun Jiang, Xiaobo Zhou","doi":"10.1109/IPDPS.2014.41","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.41","url":null,"abstract":"While major cloud service operators have taken various initiatives to operate their sustainable data enters with green energy, it is challenging to effectively utilize the green energy since its generation depends on dynamic natural conditions. Fortunately, the geographical distribution of data enters provides an opportunity for optimizing the system performance by distributing cloud workloads. In this paper, we propose a holistic heterogeneity-aware cloud workload placement and migration approach, sCloud, that aims to maximize the system good put in distributed self-sustainable data enters. sCloud adaptively places the transactional workload to distributed data enters, allocates the available resource to heterogeneous workloads in each data enter, and migrates batch jobs across data enters, while taking into account the green power availability and QoS requirements. We formulate the transactional workload placement as a constrained optimization problem that can be solved by nonlinear programming. Then, we propose a batch job migration algorithm to further improve the system good put when the green power supply varies widely at different locations. We have implemented sCloud in a university cloud test bed with real-world weather conditions and workload traces. Experimental results demonstrate sCloud can achieve near-to-optimal system performance while being resilient to dynamic power availability. It outperforms a heterogeneity-oblivious approach by 26% in improving system good put and 29% in reducing QoS violations.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114785220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series 多变量时间序列的高效通信分布式方差监测与离群点检测

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.16

Moshe Gabel, A. Schuster, D. Keren

Modern scale-out services are comprised of thousands of individual machines, which must be continuously monitored for unexpected failures. One recent approach to monitoring is latent fault detection, an adaptive statistical framework for scale-out, load-balanced systems. By periodically measuring hundreds of performance metrics and looking for outlier machines, it attempts to detect subtle problems such as misconfigurations, bugs, and malfunctioning hardware, before they manifest as machine failures. Previous work on a large, real-world Web service has shown that many failures are indeed preceded by such latent faults. Latent fault detection is an offline framework with large bandwidth and processing requirements. Each machine must send all its measurements to a centralized location, which is prohibitive in some settings and requires data-parallel processing infrastructure. In this work we adapt the latent fault detector to provide an online, communication- and computation-reduced version. We utilize stream processing techniques to trade accuracy for communication and computation. We first describe a novel communication-efficient online distributed variance monitoring algorithm that provides a continuous estimate of the global variance within guaranteed approximation bounds. Using the variance monitor, we provide an online distributed outlier detection framework for non-stationary multivariate time series common in scale-out systems. The adapted framework reduces data size and central processing cost by processing the data in situ, making it usable in wider settings. Like the original framework, our adaptation admits different comparison functions, supports non-stationary data, and provides statistical guarantees on the rate of false positives. Simulations on logs from a production system show that we are able to reduce bandwidth by an order of magnitude, with below 1% error compared to the original algorithm.

现代的横向扩展服务由数千台独立的机器组成，必须持续监控这些机器，以防出现意外故障。最近的一种监测方法是潜在故障检测，这是一种用于向外扩展、负载平衡系统的自适应统计框架。通过定期测量数百个性能指标并查找异常机器，它试图在错误配置、错误和硬件故障等细微问题表现为机器故障之前检测它们。以前对大型、真实的Web服务的研究表明，在许多故障之前确实存在这样的潜在错误。潜在故障检测是一种带宽大、处理要求高的离线框架。每台机器必须将其所有测量值发送到一个集中位置，这在某些设置中是禁止的，并且需要数据并行处理基础设施。在这项工作中，我们调整了潜在故障检测器，以提供在线，通信和计算减少的版本。我们利用流处理技术来交换通信和计算的准确性。我们首先描述了一种新的通信高效在线分布式方差监测算法，该算法在保证的近似范围内提供全局方差的连续估计。利用方差监测器，我们为横向扩展系统中常见的非平稳多变量时间序列提供了一个在线分布式离群值检测框架。调整后的框架通过就地处理数据减少了数据大小和中央处理成本，使其可用于更广泛的环境。与原始框架一样，我们的调整允许不同的比较函数，支持非平稳数据，并提供误报率的统计保证。对生产系统日志的模拟表明，与原始算法相比，我们能够将带宽减少一个数量级，误差低于1%。

{"title":"Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series","authors":"Moshe Gabel, A. Schuster, D. Keren","doi":"10.1109/IPDPS.2014.16","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.16","url":null,"abstract":"Modern scale-out services are comprised of thousands of individual machines, which must be continuously monitored for unexpected failures. One recent approach to monitoring is latent fault detection, an adaptive statistical framework for scale-out, load-balanced systems. By periodically measuring hundreds of performance metrics and looking for outlier machines, it attempts to detect subtle problems such as misconfigurations, bugs, and malfunctioning hardware, before they manifest as machine failures. Previous work on a large, real-world Web service has shown that many failures are indeed preceded by such latent faults. Latent fault detection is an offline framework with large bandwidth and processing requirements. Each machine must send all its measurements to a centralized location, which is prohibitive in some settings and requires data-parallel processing infrastructure. In this work we adapt the latent fault detector to provide an online, communication- and computation-reduced version. We utilize stream processing techniques to trade accuracy for communication and computation. We first describe a novel communication-efficient online distributed variance monitoring algorithm that provides a continuous estimate of the global variance within guaranteed approximation bounds. Using the variance monitor, we provide an online distributed outlier detection framework for non-stationary multivariate time series common in scale-out systems. The adapted framework reduces data size and central processing cost by processing the data in situ, making it usable in wider settings. Like the original framework, our adaptation admits different comparison functions, supports non-stationary data, and provides statistical guarantees on the rate of false positives. Simulations on logs from a production system show that we are able to reduce bandwidth by an order of magnitude, with below 1% error compared to the original algorithm.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117180604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Exploiting Geometric Partitioning in Task Mapping for Parallel Computers 并行计算机任务映射中的几何划分

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.15

Mehmet Deveci, S. Rajamanickam, V. Leung, K. Pedretti, Stephen L. Olivier, David P. Bunde, Ümit V. Çatalyürek, K. Devine

We present a new method for mapping applications' MPI tasks to cores of a parallel computer such that communication and execution time are reduced. We consider the case of sparse node allocation within a parallel machine, where the nodes assigned to a job are not necessarily located within a contiguous block nor within close proximity to each other in the network. The goal is to assign tasks to cores so that interdependent tasks are performed by "nearby" cores, thus lowering the distance messages must travel, the amount of congestion in the network, and the overall cost of communication. Our new method applies a geometric partitioning algorithm to both the tasks and the processors, and assigns task parts to the corresponding processor parts. We show that, for the structured finite difference mini-app Mini Ghost, our mapping method reduced execution time 34% on average on 65,536 cores of a Cray XE6. In a molecular dynamics mini-app, Mini MD, our mapping method reduced communication time by 26% on average on 6144 cores. We also compare our mapping with graph-based mappings from the LibTopoMap library and show that our mappings reduced the communication time on average by 15% in MiniGhost and 10% in MiniMD.

我们提出了一种将应用程序的MPI任务映射到并行计算机内核的新方法，从而减少了通信和执行时间。我们考虑了并行机器中稀疏节点分配的情况，其中分配给作业的节点不一定位于连续的块内，也不一定位于网络中彼此的邻近范围内。目标是将任务分配给核心，以便相互依赖的任务由“附近”的核心执行，从而降低消息必须传输的距离、网络中的拥塞量和通信的总体成本。该方法对任务和处理器都采用几何划分算法，并将任务部件分配给相应的处理器部件。我们表明，对于结构化有限差分迷你应用程序Mini Ghost，我们的映射方法在Cray XE6的65,536个内核上平均减少了34%的执行时间。在分子动力学小应用程序Mini MD中，我们的映射方法在6144核上平均减少了26%的通信时间。我们还将我们的映射与LibTopoMap库中的基于图的映射进行了比较，结果表明我们的映射在MiniGhost中平均减少了15%的通信时间，在MiniMD中平均减少了10%。

{"title":"Exploiting Geometric Partitioning in Task Mapping for Parallel Computers","authors":"Mehmet Deveci, S. Rajamanickam, V. Leung, K. Pedretti, Stephen L. Olivier, David P. Bunde, Ümit V. Çatalyürek, K. Devine","doi":"10.1109/IPDPS.2014.15","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.15","url":null,"abstract":"We present a new method for mapping applications' MPI tasks to cores of a parallel computer such that communication and execution time are reduced. We consider the case of sparse node allocation within a parallel machine, where the nodes assigned to a job are not necessarily located within a contiguous block nor within close proximity to each other in the network. The goal is to assign tasks to cores so that interdependent tasks are performed by \"nearby\" cores, thus lowering the distance messages must travel, the amount of congestion in the network, and the overall cost of communication. Our new method applies a geometric partitioning algorithm to both the tasks and the processors, and assigns task parts to the corresponding processor parts. We show that, for the structured finite difference mini-app Mini Ghost, our mapping method reduced execution time 34% on average on 65,536 cores of a Cray XE6. In a molecular dynamics mini-app, Mini MD, our mapping method reduced communication time by 26% on average on 6144 cores. We also compare our mapping with graph-based mappings from the LibTopoMap library and show that our mappings reduced the communication time on average by 15% in MiniGhost and 10% in MiniMD.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115076130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69