2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献

英文中文

Efficient Multi-GPU Computation of All-Pairs Shortest Paths 全对最短路径的高效多gpu计算

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.46

H. Djidjev, S. Thulasidasan, Guillaume Chapuis, R. Andonov, D. Lavenier

We describe a new algorithm for solving the all-pairs shortest-path (APSP) problem for planar graphs and graphs with small separators that exploits the massive on-chip parallelism available in today's Graphics Processing Units (GPUs). Our algorithm, based on the Floyd-War shall algorithm, has near optimal complexity in terms of the total number of operations, while its matrix-based structure is regular enough to allow for efficient parallel implementation on the GPUs. By applying a divide-and-conquer approach, we are able to make use of multi-node GPU clusters, resulting in more than an order of magnitude speedup over the fastest known Dijkstra-based GPU implementation and a two-fold speedup over a parallel Dijkstra-based CPU implementation.

我们描述了一种新的算法，用于解决平面图形和具有小分隔符的图形的全对最短路径(APSP)问题，该算法利用了当今图形处理单元(gpu)中可用的大量片上并行性。我们的算法基于Floyd-War shall算法，在操作总数方面具有接近最优的复杂性，而其基于矩阵的结构足够规则，可以在gpu上有效地并行实现。通过应用分而治之的方法，我们能够利用多节点GPU集群，从而比已知最快的基于dijkstra的GPU实现提高一个数量级以上，比并行的基于dijkstra的CPU实现提高两倍。

引用次数: 37

Skywalk: A Topology for HPC Networks with Low-Delay Switches Skywalk:一种具有低延迟交换机的HPC网络拓扑

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.37

I. Fujiwara, M. Koibuchi, Hiroki Matsutani, H. Casanova

With low-delay switches on the horizon, end-to-end latency in large-scale High Performance Computing (HPC) interconnects will be dominated by cable delays. In this context we define a new network topology, Skywalk, for deploying low-latency interconnects in upcoming HPC systems. Skywalk uses randomness to achieve low latency, but does so in a way that accounts for the physical layout of the topology so as to lead to further cable length and thus latency reductions. Via graph analysis and discrete-event simulation we show that Skywalk compares favorably (in terms of latency, cable length, and throughput) to traditional low-degree torus and moderate-degree hypercube topologies, to high-degree fully-connected Dragonfly topologies, to the HyperX topology, and to recently proposed fully random topologies.

随着低延迟交换机的出现，大规模高性能计算(HPC)互连的端到端延迟将由电缆延迟主导。在这种情况下，我们定义了一个新的网络拓扑，Skywalk，用于在即将到来的HPC系统中部署低延迟互连。Skywalk使用随机性来实现低延迟，但这样做的方式是考虑到拓扑的物理布局，从而导致进一步的电缆长度，从而减少延迟。通过图形分析和离散事件模拟，我们表明Skywalk与传统的低度环面和中等度超立方体拓扑、高度全连接蜻蜓拓扑、HyperX拓扑以及最近提出的完全随机拓扑相比(在延迟、电缆长度和吞吐量方面)具有优势。

引用次数: 23

Efficient Data Race Detection for C/C++ Programs Using Dynamic Granularity 使用动态粒度的C/ c++程序的高效数据竞争检测

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.76

Y. Song, Yann-Hang Lee

To detect races precisely without false alarms, vector clock based race detectors can be applied if the overhead in time and space can be contained. This is indeed the case for the applications developed in object-oriented programming language where objects can be used as detection units. On the other hand, embedded applications, often written in C/C++, necessitate the use of fine-grained detection approaches that lead to significant execution overhead. In this paper, we present a dynamic granularity algorithm for vector clock based data race detectors. The algorithm exploits the fact that neigh boring memory locations tend to be accessed together and can share the same vector clock archiving dynamic granularity of detection. The algorithm is implemented on top of Fast Track and uses Intel PIN tool for dynamic binary instrumentation. Experimental results on benchmarks show that, on average, the race detection tool using the dynamic granularity algorithm is 43% faster than the Fast Track with byte granularity and is with 60% less memory usage. Comparison with existing industrial tools, Val grind DRD and Intel Inspector XE, also suggests that the proposed dynamic granularity approach is very viable.

为了准确地检测比赛而不产生假警报，如果可以控制时间和空间上的开销，可以应用基于矢量时钟的比赛检测器。对于使用面向对象编程语言开发的应用程序来说确实是这样，其中对象可以用作检测单元。另一方面，通常用C/ c++编写的嵌入式应用程序需要使用细粒度检测方法，这会导致大量的执行开销。本文提出了一种基于矢量时钟的数据竞争检测器的动态粒度算法。该算法利用了相邻的无聊内存位置容易被一起访问的事实，并且可以共享相同的矢量时钟存档动态检测粒度。该算法在Fast Track之上实现，并使用英特尔PIN工具进行动态二进制检测。基准测试的实验结果表明，平均而言，使用动态粒度算法的竞赛检测工具比使用字节粒度的Fast Track快43%，内存使用减少60%。与现有的工业工具(Val grind DRD和Intel Inspector XE)的比较也表明，所提出的动态粒度方法是非常可行的。

{"title":"Efficient Data Race Detection for C/C++ Programs Using Dynamic Granularity","authors":"Y. Song, Yann-Hang Lee","doi":"10.1109/IPDPS.2014.76","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.76","url":null,"abstract":"To detect races precisely without false alarms, vector clock based race detectors can be applied if the overhead in time and space can be contained. This is indeed the case for the applications developed in object-oriented programming language where objects can be used as detection units. On the other hand, embedded applications, often written in C/C++, necessitate the use of fine-grained detection approaches that lead to significant execution overhead. In this paper, we present a dynamic granularity algorithm for vector clock based data race detectors. The algorithm exploits the fact that neigh boring memory locations tend to be accessed together and can share the same vector clock archiving dynamic granularity of detection. The algorithm is implemented on top of Fast Track and uses Intel PIN tool for dynamic binary instrumentation. Experimental results on benchmarks show that, on average, the race detection tool using the dynamic granularity algorithm is 43% faster than the Fast Track with byte granularity and is with 60% less memory usage. Comparison with existing industrial tools, Val grind DRD and Intel Inspector XE, also suggests that the proposed dynamic granularity approach is very viable.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132170213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Bipartite Matching Heuristics with Quality Guarantees on Shared Memory Parallel Computers 共享内存并行计算机上具有质量保证的二部匹配启发式算法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.63

F. Dufossé, K. Kaya, B. Uçar

We propose two heuristics for the bipartite matching problem that are amenable to shared-memory parallelization. The first heuristic is very intriguing from parallelization perspective. It has no significant algorithmic synchronization overhead and no conflict resolution is needed across threads. We show that this heuristic has an approximation ratio of around 0.632. The second heuristic is designed to obtain a larger matching by employing the well-known Karp-Sipser heuristic on a judiciously chosen subgraph of the original graph. We show that the Karp-Sipser heuristic always finds a maximum cardinality matching in the chosen subgraph. Although the Karp-Sipser heuristic is hard to parallelize for general graphs, we exploit the structure of the selected sub graphs to propose a specialized implementation which demonstrates a very good scalability. Based on our experiments and theoretical evidence, we conjecture that this second heuristic obtains matchings with cardinality of at least 0.866 of the maximum cardinality. We discuss parallel implementations of the proposed heuristics on shared memory systems. Experimental results, for demonstrating speed-ups and verifying the theoretical results in practice, are provided.

我们提出了两种适合于共享内存并行化的二部匹配问题的启发式算法。从并行化的角度来看，第一个启发式非常有趣。它没有显著的算法同步开销，也不需要跨线程解决冲突。我们发现这个启发式的近似值约为0.632。第二种启发式是通过在原始图的一个明智选择的子图上使用著名的Karp-Sipser启发式来获得更大的匹配。我们证明了Karp-Sipser启发式总是在选择的子图中找到一个最大基数匹配。尽管卡普-希瑟启发式算法难以对一般图进行并行化，但我们利用所选子图的结构提出了一种具有良好可扩展性的专用实现。根据我们的实验和理论证据，我们推测这第二个启发式方法获得的基数至少为最大基数的0.866的匹配。我们讨论了所提出的启发式算法在共享内存系统上的并行实现。并给出了实验结果，以验证理论结果在实践中的有效性。

{"title":"Bipartite Matching Heuristics with Quality Guarantees on Shared Memory Parallel Computers","authors":"F. Dufossé, K. Kaya, B. Uçar","doi":"10.1109/IPDPS.2014.63","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.63","url":null,"abstract":"We propose two heuristics for the bipartite matching problem that are amenable to shared-memory parallelization. The first heuristic is very intriguing from parallelization perspective. It has no significant algorithmic synchronization overhead and no conflict resolution is needed across threads. We show that this heuristic has an approximation ratio of around 0.632. The second heuristic is designed to obtain a larger matching by employing the well-known Karp-Sipser heuristic on a judiciously chosen subgraph of the original graph. We show that the Karp-Sipser heuristic always finds a maximum cardinality matching in the chosen subgraph. Although the Karp-Sipser heuristic is hard to parallelize for general graphs, we exploit the structure of the selected sub graphs to propose a specialized implementation which demonstrates a very good scalability. Based on our experiments and theoretical evidence, we conjecture that this second heuristic obtains matchings with cardinality of at least 0.866 of the maximum cardinality. We discuss parallel implementations of the proposed heuristics on shared memory systems. Experimental results, for demonstrating speed-ups and verifying the theoretical results in practice, are provided.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"70 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127999877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

UPC++: A PGAS Extension for C++ c++的PGAS扩展

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.115

Yili Zheng, A. Kamil, Michael B. Driscoll, H. Shan, K. Yelick

Partitioned Global Address Space (PGAS) languages are convenient for expressing algorithms with large, random-access data, and they have proven to provide high performance and scalability through lightweight one-sided communication and locality control. While very convenient for moving data around the system, PGAS languages have taken different views on the model of computation, with the static Single Program Multiple Data (SPMD) model providing the best scalability. In this paper we present UPC++, a PGAS extension for C++ that has three main objectives: 1) to provide an object-oriented PGAS programming model in the context of the popular C++ language, 2) to add useful parallel programming idioms unavailable in UPC, such as asynchronous remote function invocation and multidimensional arrays, to support complex scientific applications, 3) to offer an easy on-ramp to PGAS programming through interoperability with other existing parallel programming systems (e.g., MPI, OpenMP, CUDA). We implement UPC++ with a "compiler-free" approach using C++ templates and runtime libraries. We borrow heavily from previous PGAS languages and describe the design decisions that led to this particular set of language features, providing significantly more expressiveness than UPC with very similar performance characteristics. We evaluate the programmability and performance of UPC++ using five benchmarks on two representative supercomputers, demonstrating that UPC++ can deliver excellent performance at large scale up to 32K cores while offering PGAS productivity features to C++ applications.

分区全局地址空间(PGAS)语言对于表达具有大量随机访问数据的算法非常方便，并且它们已被证明可以通过轻量级的单侧通信和局域控制提供高性能和可伸缩性。虽然对于在系统中移动数据非常方便，但PGAS语言对计算模型有不同的看法，静态的单程序多数据(SPMD)模型提供了最佳的可伸缩性。在本文中，我们介绍了upc++，一个c++的PGAS扩展，它有三个主要目标:1)在流行的c++语言环境中提供面向对象的PGAS编程模型;2)添加UPC中不可用的有用的并行编程习惯，如异步远程函数调用和多维数组，以支持复杂的科学应用;3)通过与其他现有并行编程系统(例如MPI, OpenMP, CUDA)的互操作性，为PGAS编程提供一个简单的入口。我们使用c++模板和运行时库以“无编译器”的方式实现upc++。我们大量借鉴了以前的PGAS语言，并描述了导致这一特定语言特性集的设计决策，提供了比UPC更强的表达性，具有非常相似的性能特征。我们在两台具有代表性的超级计算机上使用五个基准测试来评估upc++的可编程性和性能，证明upc++可以在高达32K核的大规模下提供出色的性能，同时为c++应用程序提供PGAS生产力功能。

{"title":"UPC++: A PGAS Extension for C++","authors":"Yili Zheng, A. Kamil, Michael B. Driscoll, H. Shan, K. Yelick","doi":"10.1109/IPDPS.2014.115","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.115","url":null,"abstract":"Partitioned Global Address Space (PGAS) languages are convenient for expressing algorithms with large, random-access data, and they have proven to provide high performance and scalability through lightweight one-sided communication and locality control. While very convenient for moving data around the system, PGAS languages have taken different views on the model of computation, with the static Single Program Multiple Data (SPMD) model providing the best scalability. In this paper we present UPC++, a PGAS extension for C++ that has three main objectives: 1) to provide an object-oriented PGAS programming model in the context of the popular C++ language, 2) to add useful parallel programming idioms unavailable in UPC, such as asynchronous remote function invocation and multidimensional arrays, to support complex scientific applications, 3) to offer an easy on-ramp to PGAS programming through interoperability with other existing parallel programming systems (e.g., MPI, OpenMP, CUDA). We implement UPC++ with a \"compiler-free\" approach using C++ templates and runtime libraries. We borrow heavily from previous PGAS languages and describe the design decisions that led to this particular set of language features, providing significantly more expressiveness than UPC with very similar performance characteristics. We evaluate the programmability and performance of UPC++ using five benchmarks on two representative supercomputers, demonstrating that UPC++ can deliver excellent performance at large scale up to 32K cores while offering PGAS productivity features to C++ applications.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126893730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 175

Energy Efficient HPC on Embedded SoCs: Optimization Techniques for Mali GPU 基于嵌入式soc的高效高性能计算:Mali GPU的优化技术

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.24

Ivan Grasso, Petar Radojkovic, Nikola Rajovic, Isaac Gelado, Alex Ramírez

A lot of effort from academia and industry has been invested in exploring the suitability of low-power embedded technologies for HPC. Although state-of-the-art embedded systems-on-chip (SoCs) inherently contain GPUs that could be used for HPC, their performance and energy capabilities have never been evaluated. Two reasons contribute to the above. Primarily, embedded GPUs until now, have not supported 64-bit floating point arithmetic - a requirement for HPC. Secondly, embedded GPUs did not provide support for parallel programming languages such as OpenCL and CUDA. However, the situation is changing, and the latest GPUs integrated in embedded SoCs do support 64-bit floating point precision and parallel programming models. In this paper, we analyze performance and energy advantages of embedded GPUs for HPC. In particular, we analyze ARM Mali-T604 GPU - the first embedded GPUs with OpenCL Full Profile support. We identify, implement and evaluate software optimization techniques for efficient utilization of the ARM Mali GPU Compute Architecture. Our results show that, HPC benchmarks running on the ARM Mali-T604 GPU integrated into Exynos 5250 SoC, on average, achieve speed-up of 8.7X over a single Cortex-A15 core, while consuming only 32% of the energy. Overall results show that embedded GPUs have performance and energy qualities that make them candidates for future HPC systems.

学术界和工业界已经投入了大量的精力来探索低功耗嵌入式技术对高性能计算的适用性。虽然最先进的嵌入式片上系统(soc)固有地包含可用于HPC的gpu，但它们的性能和能源能力从未被评估过。有两个原因导致了上述情况。到目前为止，嵌入式gpu主要不支持64位浮点运算——这是高性能计算的一个要求。其次，嵌入式gpu不支持并行编程语言，如OpenCL和CUDA。然而，情况正在发生变化，集成在嵌入式soc中的最新gpu确实支持64位浮点精度和并行编程模型。本文分析了用于高性能计算的嵌入式gpu在性能和能耗方面的优势。我们特别分析了ARM Mali-T604 GPU——第一个支持OpenCL Full Profile的嵌入式GPU。我们确定，实施和评估有效利用ARM Mali GPU计算架构的软件优化技术。我们的研究结果表明，在集成到Exynos 5250 SoC中的ARM Mali-T604 GPU上运行的HPC基准测试，平均而言，在单个Cortex-A15内核上实现了8.7倍的加速，而消耗的能量仅为32%。总体结果表明，嵌入式gpu具有性能和能源质量，使其成为未来高性能计算系统的候选者。

{"title":"Energy Efficient HPC on Embedded SoCs: Optimization Techniques for Mali GPU","authors":"Ivan Grasso, Petar Radojkovic, Nikola Rajovic, Isaac Gelado, Alex Ramírez","doi":"10.1109/IPDPS.2014.24","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.24","url":null,"abstract":"A lot of effort from academia and industry has been invested in exploring the suitability of low-power embedded technologies for HPC. Although state-of-the-art embedded systems-on-chip (SoCs) inherently contain GPUs that could be used for HPC, their performance and energy capabilities have never been evaluated. Two reasons contribute to the above. Primarily, embedded GPUs until now, have not supported 64-bit floating point arithmetic - a requirement for HPC. Secondly, embedded GPUs did not provide support for parallel programming languages such as OpenCL and CUDA. However, the situation is changing, and the latest GPUs integrated in embedded SoCs do support 64-bit floating point precision and parallel programming models. In this paper, we analyze performance and energy advantages of embedded GPUs for HPC. In particular, we analyze ARM Mali-T604 GPU - the first embedded GPUs with OpenCL Full Profile support. We identify, implement and evaluate software optimization techniques for efficient utilization of the ARM Mali GPU Compute Architecture. Our results show that, HPC benchmarks running on the ARM Mali-T604 GPU integrated into Exynos 5250 SoC, on average, achieve speed-up of 8.7X over a single Cortex-A15 core, while consuming only 32% of the energy. Overall results show that embedded GPUs have performance and energy qualities that make them candidates for future HPC systems.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129810053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

Optimizing Bandwidth Allocation in Flex-Grid Optical Networks with Application to Scheduling 柔性网格光网络带宽分配优化及其在调度中的应用

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.93

H. Shachnai, A. Voloshin, S. Zaks

All-optical networks have been largely investigated due to their high data transmission rates. In the traditional Wavelength-Division Multiplexing (WDM) technology, the spectrum of light that can be transmitted through the optical fiber has been divided into frequency intervals of fixed width, with a gap of unused frequencies between them. Recently, an alternative emerging architecture was suggested which moves away from the rigid Dense WDM (DWDM) model towards a flexible model, where usable frequency intervals are of variable width (even within the same link). Each light path has to be assigned a frequency interval (sub-spectrum), which remains fixed through all of the links it traverses. Two different light paths using the same link must be assigned disjoint sub-spectra. This technology is termed flex-grid (or, flex-spectrum), as opposed to fixed-grid (or, fixed-spectrum) current technology. In this work we study a problem of optimal bandwidth allocation arising in the flex-grid technology. In this setting, each light path has a lower and upper bound on the width of its frequency interval, as well as an associated profit, and we want to find a bandwidth assignment that maximizes the total profit. This problem is known to be NP-Complete. We observe that, in fact, the problem is inapproximable within any constant ratio even on a path network. We further derive NP-hardness results and present approximation algorithms for several special cases of the path and ring networks, which are of practical interest. Finally, while in general our problem is hard to approximate, we show that an optimal solution can be obtained by allowing resource augmentation. Our study has applications also in real time scheduling.

全光网络由于其高数据传输速率而受到广泛的研究。在传统的波分复用(Wavelength-Division Multiplexing, WDM)技术中，可以通过光纤传输的光谱被划分为固定宽度的频率间隔，间隔之间有未使用的频率间隔。最近，有人提出了一种新的架构，它从严格的密集波分复用(DWDM)模型转向灵活的模型，其中可用的频率间隔是可变宽度的(即使在同一链路中)。每条光路都必须被分配一个频率间隔(子频谱)，这个频率间隔在它所经过的所有链路中都是固定的。使用同一链路的两条不同光路必须分配不相交的子光谱。这种技术被称为柔性电网(或柔性频谱)，相对于固定电网(或固定频谱)当前技术。本文研究了柔性网格技术中出现的最优带宽分配问题。在这种情况下，每个光路都有其频率间隔宽度的下界和上界，以及相关的利润，我们希望找到一个使总利润最大化的带宽分配。这个问题被称为np完全问题。我们观察到，事实上，即使在路径网络上，问题在任何常数比内都是不可逼近的。我们进一步推导了具有实际意义的路径和环形网络的几种特殊情况的np -硬度结果和近似算法。最后，虽然通常我们的问题很难近似，但我们表明，允许资源增加可以获得最优解。我们的研究在实时调度中也有应用。

{"title":"Optimizing Bandwidth Allocation in Flex-Grid Optical Networks with Application to Scheduling","authors":"H. Shachnai, A. Voloshin, S. Zaks","doi":"10.1109/IPDPS.2014.93","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.93","url":null,"abstract":"All-optical networks have been largely investigated due to their high data transmission rates. In the traditional Wavelength-Division Multiplexing (WDM) technology, the spectrum of light that can be transmitted through the optical fiber has been divided into frequency intervals of fixed width, with a gap of unused frequencies between them. Recently, an alternative emerging architecture was suggested which moves away from the rigid Dense WDM (DWDM) model towards a flexible model, where usable frequency intervals are of variable width (even within the same link). Each light path has to be assigned a frequency interval (sub-spectrum), which remains fixed through all of the links it traverses. Two different light paths using the same link must be assigned disjoint sub-spectra. This technology is termed flex-grid (or, flex-spectrum), as opposed to fixed-grid (or, fixed-spectrum) current technology. In this work we study a problem of optimal bandwidth allocation arising in the flex-grid technology. In this setting, each light path has a lower and upper bound on the width of its frequency interval, as well as an associated profit, and we want to find a bandwidth assignment that maximizes the total profit. This problem is known to be NP-Complete. We observe that, in fact, the problem is inapproximable within any constant ratio even on a path network. We further derive NP-hardness results and present approximation algorithms for several special cases of the path and ring networks, which are of practical interest. Finally, while in general our problem is hard to approximate, we show that an optimal solution can be obtained by allowing resource augmentation. Our study has applications also in real time scheduling.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125436957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Cost-Efficient and Resilient Job Life-Cycle Management on Hybrid Clouds 混合云上具有成本效益和弹性的作业生命周期管理

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.43

H. Chu, Yogesh L. Simmhan

Cloud infrastructure offers democratized access to on-demand computing resources for scaling applications beyond captive local servers. While on-demand, fixed-price Virtual Machines (VMs) are popular, the availability of cheaper, but less reliable, spot VMs from cloud providers presents an opportunity to reduce the cost of hosting cloud applications. Our work addresses the issue of effective and economic use of hybrid cloud resources for planning job executions with deadline constraints. We propose strategies to manage a job's life-cycle on spot and on on-demand VMs to minimize the total dollar cost while assuring completion. With the foundation of stochastic optimization, our reusable table-based algorithm (RTBA) decides when to instantiate VMs, at what bid prices, when to use local machines, and when to checkpoint and migrate the job between these resources, with the goal of completing the job on time and with the minimum cost. In addition, three simpler heuristics are proposed as comparison. Our evaluation using historical spot prices for the Amazon EC2 market shows that RTBA on an average reduces the cost by 72%, compared to running on only on-demand VMs. It is also robust to fluctuations in spot prices. The heuristic, H3, often approaches RTBA in performance and may prove adequate for ad hoc jobs due to its simplicity.

云基础设施提供了对按需计算资源的民主化访问，用于扩展本地服务器以外的应用程序。虽然按需、固定价格的虚拟机(vm)很流行，但云提供商提供的便宜但不太可靠的现货vm提供了降低托管云应用程序成本的机会。我们的工作解决了有效和经济地使用混合云资源来规划有期限限制的作业执行的问题。我们提出了在现场和按需vm上管理作业生命周期的策略，以最大限度地降低总成本，同时确保完成。在随机优化的基础上，我们的基于可重用表的算法(RTBA)决定何时实例化vm，以什么出价，何时使用本地机器，以及何时在这些资源之间检查点和迁移作业，目标是按时完成作业并以最小的成本完成作业。此外，还提出了三种更简单的启发式方法作为比较。我们使用亚马逊EC2市场的历史现货价格进行的评估显示，与仅在按需vm上运行相比，RTBA平均降低了72%的成本。它对现货价格的波动也很强劲。启发式H3在性能上通常接近RTBA，并且由于其简单性，可能被证明适合于特殊任务。

{"title":"Cost-Efficient and Resilient Job Life-Cycle Management on Hybrid Clouds","authors":"H. Chu, Yogesh L. Simmhan","doi":"10.1109/IPDPS.2014.43","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.43","url":null,"abstract":"Cloud infrastructure offers democratized access to on-demand computing resources for scaling applications beyond captive local servers. While on-demand, fixed-price Virtual Machines (VMs) are popular, the availability of cheaper, but less reliable, spot VMs from cloud providers presents an opportunity to reduce the cost of hosting cloud applications. Our work addresses the issue of effective and economic use of hybrid cloud resources for planning job executions with deadline constraints. We propose strategies to manage a job's life-cycle on spot and on on-demand VMs to minimize the total dollar cost while assuring completion. With the foundation of stochastic optimization, our reusable table-based algorithm (RTBA) decides when to instantiate VMs, at what bid prices, when to use local machines, and when to checkpoint and migrate the job between these resources, with the goal of completing the job on time and with the minimum cost. In addition, three simpler heuristics are proposed as comparison. Our evaluation using historical spot prices for the Amazon EC2 market shows that RTBA on an average reduces the cost by 72%, compared to running on only on-demand VMs. It is also robust to fluctuations in spot prices. The heuristic, H3, often approaches RTBA in performance and may prove adequate for ad hoc jobs due to its simplicity.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121746603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Mitigating the Mismatch between the Coherence Protocol and Conflict Detection in Hardware Transactional Memory 硬件事务性内存中一致性协议与冲突检测之间的不匹配

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.69

Lihang Zhao, Lizhong Chen, J. Draper

Hardware Transactional Memory (HTM) usually piggybacks onto the cache coherence protocol to detect data access conflicts between transactions. We identify an intrinsic mismatch between the typical coherence scheme and transaction execution, which causes a sizable amount of unnecessary transaction aborts. This pathological behavior is called false aborting and increases the amount of wasted computation and on-chip communication. For the TM applications we studied, 41% of the transactional write requests incur false aborting. To combat false aborting, we propose Predictive Unicast and Notification (PUNO), a novel hardware mechanism to 1) replace the inefficient coherence multicast with a unicast scheme to prevent transactions from being disrupted unnecessarily and 2) restrain transaction polling through proactive notification. PUNO reduces transaction aborts by 61% and network traffic by 32% in workloads representative of future TM applications with a VLSI implementation area overhead of 0.41%.

硬件事务性内存(Hardware Transactional Memory, HTM)通常依赖于缓存一致性协议来检测事务之间的数据访问冲突。我们确定了典型的一致性方案和事务执行之间的内在不匹配，这导致了大量不必要的事务中止。这种病态行为被称为错误中止，并增加了浪费的计算量和芯片上的通信。对于我们研究的TM应用程序，41%的事务性写请求会导致错误中止。为了打击错误中止，我们提出了预测单播和通知(PUNO)，这是一种新的硬件机制，1)用单播方案取代低效的一致性组播，以防止事务不必要地中断;2)通过主动通知来限制事务轮询。在代表未来TM应用程序的工作负载中，PUNO减少了61%的事务中止和32%的网络流量，VLSI实现区域开销为0.41%。

引用次数: 1

Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units 基于图形处理单元的大规模n体程序中瞬态故障影响表征和数据损坏错误检测

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.55

Keun Soo YIM

In N-body programs, trajectories of simulated particles have chaotic patterns if errors are in the initial conditions or occur during some computation steps. It was believed that the global properties (e.g., total energy) of simulated particles are unlikely to be affected by a small number of such errors. In this paper, we present a quantitative analysis of the impact of transient faults in GPU devices on a global property of simulated particles. We experimentally show that a single-bit error in non-control data can change the final total energy of a large-scale N-body program with ~2.1% probability. We also find that the corrupted total energy values have certain biases (e.g., the values are not a normal distribution), which can be used to reduce the expected number of re-executions. In this paper, we also present a data error detection technique for N-body programs by utilizing two types of properties that hold in simulated physical models. The presented technique and an existing redundancy-based technique together cover many data errors (e.g., >97.5%) with a small performance overhead (e.g., 2.3%).

在n体程序中，如果在初始条件下或在某些计算步骤中发生错误，则模拟粒子的轨迹具有混沌模式。人们认为，模拟粒子的整体特性(例如，总能量)不太可能受到少量此类误差的影响。在本文中，我们提出了一个定量的分析在GPU设备的瞬态故障对模拟粒子的整体性质的影响。我们的实验表明，非控制数据中的一个比特错误可以以~2.1%的概率改变大规模n体程序的最终总能量。我们还发现损坏的总能量值有一定的偏差(例如，这些值不是正态分布)，这可以用来减少期望的重新执行次数。在本文中，我们还提出了一种n体程序的数据错误检测技术，该技术利用了模拟物理模型中的两种类型的属性。所提出的技术和现有的基于冗余的技术一起覆盖了许多数据错误(例如，>97.5%)，性能开销很小(例如，2.3%)。

{"title":"Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units","authors":"Keun Soo YIM","doi":"10.1109/IPDPS.2014.55","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.55","url":null,"abstract":"In N-body programs, trajectories of simulated particles have chaotic patterns if errors are in the initial conditions or occur during some computation steps. It was believed that the global properties (e.g., total energy) of simulated particles are unlikely to be affected by a small number of such errors. In this paper, we present a quantitative analysis of the impact of transient faults in GPU devices on a global property of simulated particles. We experimentally show that a single-bit error in non-control data can change the final total energy of a large-scale N-body program with ~2.1% probability. We also find that the corrupted total energy values have certain biases (e.g., the values are not a normal distribution), which can be used to reduce the expected number of re-executions. In this paper, we also present a data error detection technique for N-body programs by utilizing two types of properties that hold in simulated physical models. The presented technique and an existing redundancy-based technique together cover many data errors (e.g., >97.5%) with a small performance overhead (e.g., 2.3%).","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133495161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀