首页 > 最新文献

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
PhiOpenSSL: Using the Xeon Phi Coprocessor for Efficient Cryptographic Calculations PhiOpenSSL:使用Xeon Phi协处理器进行高效加密计算
Shun Yao, Dantong Yu
The Secure Sockets Layer (SSL) is the main protocol used to secure Internet traffic and cloud computing. It relies on the computation-intensive RSA cryptography, which primarily limits the throughput of the handshake process. In this paper, we design and implement an OpenSSL library, termed PhiOpenSSL, which targets the Intel Xeon Phi (KNC) coprocessor, and utilizes Intel Phi's SIMD and multi-threading capability to reduce the SSL computation latency. In particular, PhiOpenSSL vectorizes all big integer multiplications and Montgomery operations involved in RSA calculations and employs theChinese Remainder Theorem and fixed-window exponentiation in its customized library. In an experiment involving the computation of Montgomery exponentiation, PhiOpenSSL was as much as 15.3 times faster than the two other reference libcrypto libraries, one from the Intel Many-core Platform Software Stack (MPSS) and the other from the default OpenSSL. Our RSA private key cryptography routines in PhiOpenSSL are 1.6-5.7 times faster than those in these two reference systems.
安全套接字层(SSL)是用于保护互联网流量和云计算的主要协议。它依赖于计算密集型的RSA加密,这主要限制了握手过程的吞吐量。在本文中,我们设计并实现了一个OpenSSL库,称为PhiOpenSSL,它针对Intel Xeon Phi (KNC)协处理器,并利用Intel Phi的SIMD和多线程功能来减少SSL计算延迟。特别是,PhiOpenSSL对RSA计算中涉及的所有大整数乘法和蒙哥马利运算进行了矢量化,并在其定制库中使用了中国剩余定理和固定窗口幂。在一个涉及蒙哥马利指数计算的实验中,PhiOpenSSL比另外两个参考libcrypto库快15.3倍,一个来自英特尔多核平台软件堆栈(MPSS),另一个来自默认的OpenSSL。我们在PhiOpenSSL中的RSA私钥加密例程比这两个参考系统中的RSA私钥加密例程快1.6-5.7倍。
{"title":"PhiOpenSSL: Using the Xeon Phi Coprocessor for Efficient Cryptographic Calculations","authors":"Shun Yao, Dantong Yu","doi":"10.1109/IPDPS.2017.32","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.32","url":null,"abstract":"The Secure Sockets Layer (SSL) is the main protocol used to secure Internet traffic and cloud computing. It relies on the computation-intensive RSA cryptography, which primarily limits the throughput of the handshake process. In this paper, we design and implement an OpenSSL library, termed PhiOpenSSL, which targets the Intel Xeon Phi (KNC) coprocessor, and utilizes Intel Phi's SIMD and multi-threading capability to reduce the SSL computation latency. In particular, PhiOpenSSL vectorizes all big integer multiplications and Montgomery operations involved in RSA calculations and employs theChinese Remainder Theorem and fixed-window exponentiation in its customized library. In an experiment involving the computation of Montgomery exponentiation, PhiOpenSSL was as much as 15.3 times faster than the two other reference libcrypto libraries, one from the Intel Many-core Platform Software Stack (MPSS) and the other from the default OpenSSL. Our RSA private key cryptography routines in PhiOpenSSL are 1.6-5.7 times faster than those in these two reference systems.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128274218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Automatic Collapsing of Non-Rectangular Loops 自动折叠的非矩形回路
P. Clauss, Ervin Altintas, M. Kuhn
Loop collapsing is a well-known loop transformation which combines some loops that are perfectly nested into one single loop. It allows to take advantage of the whole amount of parallelism exhibited by the collapsed loops, and provides a perfect load balancing of iterations among the parallel threads. However, in the current implementations of this loop optimization, as the ones of the OpenMP language, automatic loop collapsing is limited to loops with constant loop bounds that define rectangular iteration spaces, although load imbalance is a particularly crucial issue with non-rectangular loops. The OpenMP language addresses load balance mostly through dynamic runtime scheduling of the parallel threads. Nevertheless, this runtime schedule introduces some unavoidable executiontime overhead, while preventing to exploit the entire parallelism of all the parallel loops. In this paper, we propose a technique to automatically collapse any perfectly nested loops defining non-rectangular iteration spaces, whose bounds are linear functions of the loop iterators. Such spaces may be triangular, tetrahedral, trapezoidal, rhomboidal or parallelepiped. Our solution is based on original mathematical results addressing the inversion of a multi-variate polynomial that defines a ranking of the integer points contained in a convex polyhedron. We show on a set of non-rectangular loop nests that our technique allows to generate parallel OpenMP codes that outperform the original parallel loop nests, parallelized either by using options “static” or “dynamic” of the OpenMPschedule clause.
循环折叠是一种众所周知的循环转换,它将一些完美嵌套在一个循环中的循环组合在一起。它允许利用折叠循环所显示的全部并行性,并在并行线程之间提供完美的迭代负载平衡。然而,在这种循环优化的当前实现中,就像OpenMP语言的实现一样,自动循环崩溃仅限于具有定义矩形迭代空间的恒定循环边界的循环,尽管负载不平衡是非矩形循环的一个特别关键的问题。OpenMP语言主要通过并行线程的动态运行时调度来解决负载平衡问题。然而,这个运行时计划引入了一些不可避免的执行时间开销,同时阻止了利用所有并行循环的整个并行性。在本文中,我们提出了一种自动折叠任何定义非矩形迭代空间的完美嵌套循环的技术,其边界是循环迭代器的线性函数。这样的空间可以是三角形、四面体、梯形、菱形或平行六面体。我们的解决方案是基于原始的数学结果,解决了一个多变量多项式的反转,该多项式定义了凸多面体中包含的整数点的排序。我们在一组非矩形循环巢中展示了我们的技术允许生成优于原始并行循环巢的并行OpenMP代码,通过使用OpenMPschedule子句的“静态”或“动态”选项进行并行化。
{"title":"Automatic Collapsing of Non-Rectangular Loops","authors":"P. Clauss, Ervin Altintas, M. Kuhn","doi":"10.1109/IPDPS.2017.34","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.34","url":null,"abstract":"Loop collapsing is a well-known loop transformation which combines some loops that are perfectly nested into one single loop. It allows to take advantage of the whole amount of parallelism exhibited by the collapsed loops, and provides a perfect load balancing of iterations among the parallel threads. However, in the current implementations of this loop optimization, as the ones of the OpenMP language, automatic loop collapsing is limited to loops with constant loop bounds that define rectangular iteration spaces, although load imbalance is a particularly crucial issue with non-rectangular loops. The OpenMP language addresses load balance mostly through dynamic runtime scheduling of the parallel threads. Nevertheless, this runtime schedule introduces some unavoidable executiontime overhead, while preventing to exploit the entire parallelism of all the parallel loops. In this paper, we propose a technique to automatically collapse any perfectly nested loops defining non-rectangular iteration spaces, whose bounds are linear functions of the loop iterators. Such spaces may be triangular, tetrahedral, trapezoidal, rhomboidal or parallelepiped. Our solution is based on original mathematical results addressing the inversion of a multi-variate polynomial that defines a ranking of the integer points contained in a convex polyhedron. We show on a set of non-rectangular loop nests that our technique allows to generate parallel OpenMP codes that outperform the original parallel loop nests, parallelized either by using options “static” or “dynamic” of the OpenMPschedule clause.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130328646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Elastic Data Compression with Improved Performance and Space Efficiency for Flash-Based Storage Systems 基于闪存的存储系统的弹性数据压缩性能和空间效率提高
Bo Mao, Hong Jiang, Suzhen Wu, Yaodong Yang, Zaifa Xi
Data compression has become a commodity feature for space efficiency and reliability in flash-based storage systems by reducing write traffic and space capacity demand. However, it introduces noticeable processing overheads on the critical I/O path, which degrades the system performance significantly. Existing data compression schemes for flash-based storage systems use fixed compression algorithms for all the incoming write data, failing to recognize and exploit the significant diversity in compressibility and access patterns of data and missing an opportunity to improve the system performance, the space efficiency or both. To achieve a reasonable trade-off between these two important design objectives, in this paper we introduce an Elastic Data Compression scheme, called EDC, which exploits the data compressibility and access intensity characteristics by judiciously matching data of different compressibility with different compression algorithms while leveraging the access idleness. Specifically, for compressible data blocks EDC exploits the compression diversity of the workload, and employs algorithms of higher compression rate in periods of lower system utilization and algorithms of lower compression rate in periods of higher system utilization. For non-compressible (or very lowly compressible) data blocks, it will write them through to the flash storage directly without any compression. The experiments conducted on our lightweight prototype implementation of the EDC system show that EDC saves storage space by up to 38.7%, with an average of 33.7%. In addition, it significantly outperforms the fixed compression schemes in the I/O performance measure by up to 61.4%, with an average of 36.7%.
通过减少写流量和空间容量需求,数据压缩已成为闪存存储系统中提高空间效率和可靠性的一个重要特性。但是,它在关键I/O路径上引入了明显的处理开销,从而显著降低了系统性能。现有的基于闪存存储系统的数据压缩方案对所有传入的写数据使用固定的压缩算法,无法识别和利用数据的可压缩性和访问模式的显著多样性,从而错失了提高系统性能或空间效率或两者兼而有之的机会。为了在这两个重要的设计目标之间实现合理的权衡,本文引入了一种称为EDC的弹性数据压缩方案,该方案在利用访问空闲的同时,通过明智地将不同压缩率的数据与不同的压缩算法进行匹配,从而利用数据的可压缩性和访问强度特征。具体来说,对于可压缩的数据块,EDC利用了工作负载的压缩多样性,在系统利用率较低的时间段采用压缩率较高的算法,在系统利用率较高的时间段采用压缩率较低的算法。对于不可压缩(或非常低可压缩)的数据块,它将直接将它们写入闪存,而不进行任何压缩。在我们的轻量级原型系统上进行的实验表明,EDC系统节省了38.7%的存储空间,平均节省了33.7%。此外,在I/O性能测量中,它明显优于固定压缩方案,最高可达61.4%,平均为36.7%。
{"title":"Elastic Data Compression with Improved Performance and Space Efficiency for Flash-Based Storage Systems","authors":"Bo Mao, Hong Jiang, Suzhen Wu, Yaodong Yang, Zaifa Xi","doi":"10.1109/IPDPS.2017.64","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.64","url":null,"abstract":"Data compression has become a commodity feature for space efficiency and reliability in flash-based storage systems by reducing write traffic and space capacity demand. However, it introduces noticeable processing overheads on the critical I/O path, which degrades the system performance significantly. Existing data compression schemes for flash-based storage systems use fixed compression algorithms for all the incoming write data, failing to recognize and exploit the significant diversity in compressibility and access patterns of data and missing an opportunity to improve the system performance, the space efficiency or both. To achieve a reasonable trade-off between these two important design objectives, in this paper we introduce an Elastic Data Compression scheme, called EDC, which exploits the data compressibility and access intensity characteristics by judiciously matching data of different compressibility with different compression algorithms while leveraging the access idleness. Specifically, for compressible data blocks EDC exploits the compression diversity of the workload, and employs algorithms of higher compression rate in periods of lower system utilization and algorithms of lower compression rate in periods of higher system utilization. For non-compressible (or very lowly compressible) data blocks, it will write them through to the flash storage directly without any compression. The experiments conducted on our lightweight prototype implementation of the EDC system show that EDC saves storage space by up to 38.7%, with an average of 33.7%. In addition, it significantly outperforms the fixed compression schemes in the I/O performance measure by up to 61.4%, with an average of 36.7%.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133859746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Directive-Based Partitioning and Pipelining for Graphics Processing Units 图形处理单元的基于指令的分区和流水线
Xuewen Cui, T. Scogland, B. Supinski, Wu-chun Feng
The community needs simpler mechanisms to access the performance available in accelerators, such as GPUs, FPGAs, and APUs, due to their increasing use in state-of-the-art supercomputers. Programming models like CUDA, OpenMP, OpenACC and OpenCL can efficiently offload compute-intensive workloads to these devices. By default these models naively offload computation without overlapping it with communication (copying data to or from the device). Achieving performance can require extensive refactoring and hand-tuning to apply optimizations such as pipelining. Further, users must manually partition the dataset whenever its size is larger than device memory, which can be especially difficult when the device memory size is not exposed to the user. We propose a directive-based partitioning and pipelining extension for accelerators appropriate for either OpenMP or OpenACC. Its interface supports overlap of data transfers and kernel computation without explicit user splitting of data. It can map data to a pre-allocated device buffer and automate memory-constrained array indexing and sub-task scheduling. We evaluate a prototype implementation with four different applications. The experimental results show that our approach can reduce memory usage by 52% to 97% while delivering a 1.41× to 1.65× speedup over the naive offload model.
由于在最先进的超级计算机中越来越多地使用加速器,因此社区需要更简单的机制来访问gpu、fpga和apu等加速器中可用的性能。像CUDA、OpenMP、OpenACC和OpenCL这样的编程模型可以有效地将计算密集型工作负载卸载到这些设备上。默认情况下,这些模型天真地卸载计算,而不会与通信重叠(将数据复制到设备或从设备复制数据)。实现性能可能需要大量的重构和手动调优,以应用流水线等优化。此外,当数据集的大小大于设备内存时,用户必须手动对数据集进行分区,这在不向用户公开设备内存大小时尤其困难。我们提出了一个基于指令的分区和流水线扩展,适用于OpenMP或OpenACC的加速器。它的接口支持数据传输和内核计算的重叠,而不需要显式的用户拆分数据。它可以将数据映射到预分配的设备缓冲区,并自动执行内存受限的数组索引和子任务调度。我们用四个不同的应用程序评估一个原型实现。实验结果表明,我们的方法可以减少52%到97%的内存使用,同时提供1.41到1.65倍的加速,而不是单纯的卸载模型。
{"title":"Directive-Based Partitioning and Pipelining for Graphics Processing Units","authors":"Xuewen Cui, T. Scogland, B. Supinski, Wu-chun Feng","doi":"10.1109/IPDPS.2017.96","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.96","url":null,"abstract":"The community needs simpler mechanisms to access the performance available in accelerators, such as GPUs, FPGAs, and APUs, due to their increasing use in state-of-the-art supercomputers. Programming models like CUDA, OpenMP, OpenACC and OpenCL can efficiently offload compute-intensive workloads to these devices. By default these models naively offload computation without overlapping it with communication (copying data to or from the device). Achieving performance can require extensive refactoring and hand-tuning to apply optimizations such as pipelining. Further, users must manually partition the dataset whenever its size is larger than device memory, which can be especially difficult when the device memory size is not exposed to the user. We propose a directive-based partitioning and pipelining extension for accelerators appropriate for either OpenMP or OpenACC. Its interface supports overlap of data transfers and kernel computation without explicit user splitting of data. It can map data to a pre-allocated device buffer and automate memory-constrained array indexing and sub-task scheduling. We evaluate a prototype implementation with four different applications. The experimental results show that our approach can reduce memory usage by 52% to 97% while delivering a 1.41× to 1.65× speedup over the naive offload model.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134339175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling 分层和半分区并行调度算法
V. Bonifaci, Gianlorenzo D'angelo, A. Marchetti-Spaccamela
We propose a model for scheduling jobs in a parallel machine setting that takes into account the cost of migrations by assuming that the processing time of a job may depend on the specific set of machines among which the job is migrated. For the makespan minimization objective, the model generalizes classical scheduling problems such as unrelated parallel machine scheduling, as well as novel ones such as semi-partitioned and clustered scheduling. In the case of a hierarchical family of machines, we derive a compact integer linear programming formulation of the problem and leverage its fractional relaxation to obtain a polynomial-time 2-approximation algorithm. Extensions that incorporate memory capacity constraints are also discussed.
我们提出了一个在并行机器设置中调度作业的模型,该模型通过假设作业的处理时间可能取决于作业迁移到的特定机器集,从而考虑了迁移的成本。以最大时间跨度最小化为目标,该模型对经典调度问题(如不相关并行机调度)和新调度问题(如半分区调度和集群调度)进行了推广。在分层机器族的情况下,我们推导了问题的紧凑整数线性规划公式,并利用其分数松弛来获得多项式时间2逼近算法。还讨论了包含内存容量限制的扩展。
{"title":"Algorithms for Hierarchical and Semi-Partitioned Parallel Scheduling","authors":"V. Bonifaci, Gianlorenzo D'angelo, A. Marchetti-Spaccamela","doi":"10.1109/IPDPS.2017.22","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.22","url":null,"abstract":"We propose a model for scheduling jobs in a parallel machine setting that takes into account the cost of migrations by assuming that the processing time of a job may depend on the specific set of machines among which the job is migrated. For the makespan minimization objective, the model generalizes classical scheduling problems such as unrelated parallel machine scheduling, as well as novel ones such as semi-partitioned and clustered scheduling. In the case of a hierarchical family of machines, we derive a compact integer linear programming formulation of the problem and leverage its fractional relaxation to obtain a polynomial-time 2-approximation algorithm. Extensions that incorporate memory capacity constraints are also discussed.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134110015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms 基于GPU的通信优化:以序列对齐算法为例
Jie Wang, Xinfeng Xie, J. Cong
Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced since the Kepler architecture on Nvidia GPUs, which enables threads within the same warp to directly exchange data in registers. This brought new performance optimization opportunities for algorithms with intensive inter-thread communication. In this work, we deploy register shuffle in the application domain of sequence alignment (or similarly, string matching), and conduct a quantitative analysis of the opportunities and limitations of using register shuffle. We select two sequence alignment algorithms, Smith-Waterman (SW) and Pairwise-Hidden-Markov-Model (PairHMM), from the widely used Genome Analysis Toolkit (GATK) as case studies. Compared to implementations using shared memory, we obtain a significant speed-up of 1.2× and 2.1× by using shuffle instructions for SW and PairHMM. Furthermore, we develop a performance model for analyzing the kernel performance based on the measured shuffle latency from a suite of microbenchmarks. Our model provides valuable insights for CUDA programmers into how to best use shuffle instructions for performance optimization.
在现代计算中,数据移动日益成为性能和能源效率的瓶颈。直到最近,gpu上的通信优化自由度还是有限的,因为传统gpu只提供两种线程间通信的方法:使用共享内存或全局内存。然而,自Nvidia gpu上的Kepler架构以来引入了一个新的warp shuffle指令,它允许同一warp内的线程直接交换寄存器中的数据。这为具有密集线程间通信的算法带来了新的性能优化机会。在这项工作中,我们在序列对齐(或类似地,字符串匹配)的应用领域部署了寄存器洗牌,并对使用寄存器洗牌的机会和局限性进行了定量分析。我们从广泛使用的基因组分析工具包(GATK)中选择了Smith-Waterman (SW)和Pairwise-Hidden-Markov-Model (PairHMM)两种序列比对算法作为案例研究。与使用共享内存的实现相比,通过对SW和PairHMM使用shuffle指令,我们获得了1.2倍和2.1倍的显著加速。此外,我们开发了一个性能模型,用于基于一组微基准测试中测量的shuffle延迟来分析内核性能。我们的模型为CUDA程序员提供了如何最好地使用shuffle指令进行性能优化的宝贵见解。
{"title":"Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms","authors":"Jie Wang, Xinfeng Xie, J. Cong","doi":"10.1109/IPDPS.2017.79","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.79","url":null,"abstract":"Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced since the Kepler architecture on Nvidia GPUs, which enables threads within the same warp to directly exchange data in registers. This brought new performance optimization opportunities for algorithms with intensive inter-thread communication. In this work, we deploy register shuffle in the application domain of sequence alignment (or similarly, string matching), and conduct a quantitative analysis of the opportunities and limitations of using register shuffle. We select two sequence alignment algorithms, Smith-Waterman (SW) and Pairwise-Hidden-Markov-Model (PairHMM), from the widely used Genome Analysis Toolkit (GATK) as case studies. Compared to implementations using shared memory, we obtain a significant speed-up of 1.2× and 2.1× by using shuffle instructions for SW and PairHMM. Furthermore, we develop a performance model for analyzing the kernel performance based on the measured shuffle latency from a suite of microbenchmarks. Our model provides valuable insights for CUDA programmers into how to best use shuffle instructions for performance optimization.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130045694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Power Efficient Sharing-Aware GPU Data Management 节能共享感知GPU数据管理
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.106
Abdulaziz Tabbakh, M. Annavaram, Xuehai Qian
The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy spreads these CTAs to different streaming multiprocessor cores (SM) in a round-robin fashion. Since each SM has a private L1 cache, the shared data among CTAs are replicated across L1 caches of different SMs. Data replication reduces the effective L1 cache size which in turn increases the data movement and power consumption. The goal of this paper is to reduce data movement and increase effective cache space in GPUs. We propose a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. We further enhance the scheduler with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. Essentially, this approach increases the lifetime of shared blocks and private blocks in different cache levels. The experimental results show that the proposed scheme reduces the off-chip traffic by 19% which translates to an average DRAM power reduction of 10% and performance improvement of 7%.
gpu中存储系统所消耗的功率占芯片总功率的很大一部分。随着线程级并行性的增加,gpu可能会更加强调缓存和内存带宽,从而加剧功耗。我们观察到GPU应用程序中的相邻并发线程数组(cta)共享相当数量的数据。然而,默认的GPU调度策略以循环方式将这些cta分散到不同的流多处理器内核(SM)。由于每个SM都有一个专用L1缓存,因此cta之间的共享数据会跨不同SMs的L1缓存进行复制。数据复制减少了有效的L1缓存大小,从而增加了数据移动和功耗。本文的目标是减少gpu中的数据移动并增加有效的缓存空间。我们提出了一个共享感知的CTA调度器,它尝试将具有数据共享的CTA分配到相同的SM,以减少跨SMs的私有L1缓存中的数据冗余存储。我们通过共享感知缓存分配和替换策略进一步增强了调度器。感知共享的缓存管理方法动态地对私有数据和共享数据进行分类。私有块被赋予更高的优先级,以便在L1缓存中停留更长时间,共享块被赋予更高的优先级,以便在L2缓存中停留更长时间。从本质上讲,这种方法增加了不同缓存级别中共享块和私有块的生命周期。实验结果表明,该方案减少了19%的片外流量,平均DRAM功耗降低了10%,性能提高了7%。
{"title":"Power Efficient Sharing-Aware GPU Data Management","authors":"Abdulaziz Tabbakh, M. Annavaram, Xuehai Qian","doi":"10.1109/IPDPS.2017.106","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.106","url":null,"abstract":"The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy spreads these CTAs to different streaming multiprocessor cores (SM) in a round-robin fashion. Since each SM has a private L1 cache, the shared data among CTAs are replicated across L1 caches of different SMs. Data replication reduces the effective L1 cache size which in turn increases the data movement and power consumption. The goal of this paper is to reduce data movement and increase effective cache space in GPUs. We propose a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. We further enhance the scheduler with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. Essentially, this approach increases the lifetime of shared blocks and private blocks in different cache levels. The experimental results show that the proposed scheme reduces the off-chip traffic by 19% which translates to an average DRAM power reduction of 10% and performance improvement of 7%.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130197682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Distributed Vehicle Routing Approximation 分布式车辆路线近似
A. Krishnan, Mikhail Markov, Borzoo Bonakdarpour
The classic vehicle routing problem (VRP) is generally concerned with the optimal design of routes by a fleet of vehicles to service a set of customers by minimizing the overall cost, usually the travel distance for the whole set of routes. Although the problem has been extensively studied in the context of operations research and optimization, there is little research on solving the VRP, where distributed vehicles need to compute their respective routes in a decentralized fashion. Our first contribution is a synchronous distributed approximation algorithm that solves the VRP. Using the duality theorem of linear programming, we show that the approximation ratio of our algorithm is O(n · (ρ)1/n log(n + m)), where ρ is the maximum cost of travel or service in the input VRP instance, n is the size of the graph, and m is the number of vehicles. We report results of simulations and discuss implementation of our algorithm on a real fleet of unmanned aerial systems (UASs) that carry out a set of tasks.
经典的车辆路线问题(VRP)通常关注的是车队通过最小化总成本(通常是整个路线的行驶距离)来为一组客户提供服务的路线优化设计。尽管这个问题已经在运筹学和优化的背景下进行了广泛的研究,但对于解决VRP的研究很少,其中分布式车辆需要以分散的方式计算各自的路线。我们的第一个贡献是解决VRP的同步分布式近似算法。利用线性规划的对偶定理,我们证明了我们的算法的近似比为O(n·(ρ)1/n log(n + m)),其中ρ为输入VRP实例的最大旅行或服务成本,n为图的大小,m为车辆的数量。我们报告了模拟结果,并讨论了我们的算法在执行一系列任务的真实无人机系统(UASs)上的实现。
{"title":"Distributed Vehicle Routing Approximation","authors":"A. Krishnan, Mikhail Markov, Borzoo Bonakdarpour","doi":"10.1109/IPDPS.2017.90","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.90","url":null,"abstract":"The classic vehicle routing problem (VRP) is generally concerned with the optimal design of routes by a fleet of vehicles to service a set of customers by minimizing the overall cost, usually the travel distance for the whole set of routes. Although the problem has been extensively studied in the context of operations research and optimization, there is little research on solving the VRP, where distributed vehicles need to compute their respective routes in a decentralized fashion. Our first contribution is a synchronous distributed approximation algorithm that solves the VRP. Using the duality theorem of linear programming, we show that the approximation ratio of our algorithm is O(n · (ρ)1/n log(n + m)), where ρ is the maximum cost of travel or service in the input VRP instance, n is the size of the graph, and m is the number of vehicles. We report results of simulations and discuss implementation of our algorithm on a real fleet of unmanned aerial systems (UASs) that carry out a set of tasks.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129178154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight “神威太湖之光”大气模拟的PFLOPS模板计算
Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, H. Fu, Fangfang Liu, L. Gan, Ping Xu, Wenjing Ma
Stencil computation arises from a broad set of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to opti- mize stencil computation kernels on modern supercomputers with relatively high computing throughput whilst relatively low data-moving capability. This work serves as a demon- stration on the details of the algorithms, implementations and optimizations of a real-world stencil computation in 3D nonhydrostatic atmospheric modeling on the newly announced Sunway TaihuLight supercomputer. At the algorithm level, we present a computation-communication overlapping technique to reduce the inter-process communication overhead, a locality- aware blocking method to fully exploit on-chip parallelism with enhanced data locality, and a collaborative data accessing scheme for sharing data among different threads. In addition, a variety of effective hardware specific implementation and optimization strategies on both the process- and thread-level, from the fine-grained data management to the data layout transformation, are developed to further improve the per- formance. Our experiments demonstrate that a single-process many-core speedup of as high as 170x can be achieved by using the proposed algorithm and optimization strategies. The code scales well to millions of cores in terms of strong scalability. And for the weak-scaling tests, the code can scale in a nearly ideal way to the full system scale of more than 10 million cores, sustaining 25.96 PFLOPS in double precision, which is 20% of the peak performance.
模板计算产生于一系列广泛的科学和工程应用中,并且在极端尺度模拟的性能中经常起着关键作用。由于内存的有限性,在计算吞吐量相对较高而数据移动能力相对较低的现代超级计算机上优化模板计算内核是一项具有挑战性的任务。本研究在新发布的神威太湖之光超级计算机上展示了三维非静压大气建模中真实世界模板计算的算法、实现和优化的细节。在算法层面,我们提出了一种计算通信重叠技术以减少进程间通信开销,一种局部性感知阻塞方法以充分利用芯片上的并行性并增强数据局部性,以及一种协作数据访问方案以在不同线程之间共享数据。此外,从细粒度数据管理到数据布局转换,在进程级和线程级开发了各种有效的硬件特定实现和优化策略,以进一步提高性能。我们的实验表明,通过使用所提出的算法和优化策略,可以实现高达170倍的单进程多核加速。就强大的可伸缩性而言,代码可以很好地扩展到数百万个内核。对于弱扩展测试,代码可以以近乎理想的方式扩展到超过1000万个内核的完整系统规模,在双精度下保持25.96 PFLOPS,这是峰值性能的20%。
{"title":"26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight","authors":"Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, H. Fu, Fangfang Liu, L. Gan, Ping Xu, Wenjing Ma","doi":"10.1109/IPDPS.2017.9","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.9","url":null,"abstract":"Stencil computation arises from a broad set of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to opti- mize stencil computation kernels on modern supercomputers with relatively high computing throughput whilst relatively low data-moving capability. This work serves as a demon- stration on the details of the algorithms, implementations and optimizations of a real-world stencil computation in 3D nonhydrostatic atmospheric modeling on the newly announced Sunway TaihuLight supercomputer. At the algorithm level, we present a computation-communication overlapping technique to reduce the inter-process communication overhead, a locality- aware blocking method to fully exploit on-chip parallelism with enhanced data locality, and a collaborative data accessing scheme for sharing data among different threads. In addition, a variety of effective hardware specific implementation and optimization strategies on both the process- and thread-level, from the fine-grained data management to the data layout transformation, are developed to further improve the per- formance. Our experiments demonstrate that a single-process many-core speedup of as high as 170x can be achieved by using the proposed algorithm and optimization strategies. The code scales well to millions of cores in terms of strong scalability. And for the weak-scaling tests, the code can scale in a nearly ideal way to the full system scale of more than 10 million cores, sustaining 25.96 PFLOPS in double precision, which is 20% of the peak performance.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127954560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Community Detection on the GPU GPU上的团体字检测
M. Naim, F. Manne, M. Halappanavar, Antonino Tumeo
We present and evaluate a new GPU algorithm based on the Louvain method for community detection. Our algorithm is the first for this problem that parallelizes the access to individual edges. In this way we can fine tune the load balance when processing networks with nodes of highly varying degrees. This is achieved by scaling the number of threads assigned to each node according to its degree. Extensive experiments show that we obtain speedups up to a factor of 270 compared to the sequential algorithm. The algorithm consistently outperforms other recent shared memory implementationsand is only one order of magnitude slower than the current fastest parallel Louvain method running on a Blue Gene/Q supercomputer using more than 500K threads.
我们提出并评估了一种基于Louvain方法的新的GPU社区检测算法。我们的算法是第一个并行访问各个边的算法。通过这种方式,我们可以在处理具有高度不同程度节点的网络时微调负载平衡。这是通过根据每个节点的程度缩放分配给每个节点的线程数量来实现的。大量的实验表明,与顺序算法相比,我们获得了高达270倍的加速。该算法始终优于其他最近的共享内存实现,并且只比目前最快的并行Louvain方法慢一个数量级,该方法运行在Blue Gene/Q超级计算机上,使用超过500K的线程。
{"title":"Community Detection on the GPU","authors":"M. Naim, F. Manne, M. Halappanavar, Antonino Tumeo","doi":"10.1109/IPDPS.2017.16","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.16","url":null,"abstract":"We present and evaluate a new GPU algorithm based on the Louvain method for community detection. Our algorithm is the first for this problem that parallelizes the access to individual edges. In this way we can fine tune the load balance when processing networks with nodes of highly varying degrees. This is achieved by scaling the number of threads assigned to each node according to its degree. Extensive experiments show that we obtain speedups up to a factor of 270 compared to the sequential algorithm. The algorithm consistently outperforms other recent shared memory implementationsand is only one order of magnitude slower than the current fastest parallel Louvain method running on a Blue Gene/Q supercomputer using more than 500K threads.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129981241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
期刊
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1