2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献_第7页

Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms 使用负载平衡可扩展并行化基于采样的运动规划算法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.66

Adam Fidel, S. A. Jacobs, Shishir Sharma, N. Amato, Lawrence Rauchwerger

Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores.

运动规划是计算可移动物体在环境中可行路径的问题，在许多领域都有应用，从机器人到智能CAD，到蛋白质折叠。解决这个pspace难题的最佳方法是所谓的基于抽样的计划。最近的工作介绍了均匀空间细分技术，用于并行化基于采样的运动规划算法，该算法具有良好的缩放性。然而，由于规划时间取决于区域特征，并且对于大多数问题，子问题的异构性随着处理器数量的增加而增加，因此这种方法容易出现负载不平衡。在这项工作中，我们介绍了两种技术来解决基于采样的运动规划算法并行化中的负载不平衡:自适应工作窃取方法和批量同步重新分配。我们表明，将这些技术应用于两类主要的基于并行采样的运动规划算法的代表，概率路线图和快速探索随机树，可以在超过3000个内核上实现更具可扩展性和负载均衡的计算。

{"title":"Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms","authors":"Adam Fidel, S. A. Jacobs, Shishir Sharma, N. Amato, Lawrence Rauchwerger","doi":"10.1109/IPDPS.2014.66","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.66","url":null,"abstract":"Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124127420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

POD: Performance Oriented I/O Deduplication for Primary Storage Systems in the Cloud POD:面向云主存储系统的性能I/O重复数据删除

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.84

Bo Mao, Hong Jiang, Suzhen Wu, Lei Tian

Recent studies have shown that moderate to high data redundancy clearly exists in primary storage systems in the Cloud. Our experimental studies reveal that data redundancy exhibits a much higher level of intensity on the I/O path than that on disks due to the relatively high temporal access locality associated with small I/O requests to redundant data. On the other hand, we also observe that directly applying data deduplication to primary storage systems in the Cloud will likely cause space contention in memory and data fragmentation on disks. Based on these observations, we propose a Performance-Oriented I/O Deduplication approach, called POD, rather than a capacity-oriented I/O deduplication approach, represented by iDedup, to improve the I/O performance of primary storage systems in the Cloud without sacrificing capacity savings of the latter. The salient feature of POD is its focus on not only the capacity-sensitive large writes and files, as in iDedup, but also the performance-sensitive while capacity-insensitive small writes and files. The experiments conducted on our lightweight prototype implementation of POD show that POD significantly outperforms iDedup in the I/O performance measure by up to 87.9% with an average of 58.8%. Moreover, our evaluation results also show that POD achieves comparable or better capacity savings than iDedup.

最近的研究表明，在云中的主存储系统中明显存在中度到高度的数据冗余。我们的实验研究表明，数据冗余在I/O路径上表现出比在磁盘上高得多的强度，因为与对冗余数据的小I/O请求相关的相对较高的时间访问局域性。另一方面，我们还观察到，直接对云中的主存储系统应用重复数据删除可能会导致内存中的空间争用和磁盘上的数据碎片。基于这些观察，我们提出了一种面向性能的I/O重复数据删除方法(称为POD)，而不是以iDedup为代表的面向容量的I/O重复数据删除方法(以iDedup为代表)，以提高云中的主存储系统的I/O性能，而不会牺牲后者节省的容量。POD的显著特性是，它不仅关注容量敏感的大写操作和文件(如iDedup)，还关注性能敏感而容量不敏感的小写操作和文件。在我们的POD轻量级原型实现上进行的实验表明，POD在I/O性能测量方面明显优于iDedup，最高可达87.9%，平均为58.8%。此外，我们的评估结果还表明，POD实现了与iDedup相当甚至更好的容量节省。

{"title":"POD: Performance Oriented I/O Deduplication for Primary Storage Systems in the Cloud","authors":"Bo Mao, Hong Jiang, Suzhen Wu, Lei Tian","doi":"10.1109/IPDPS.2014.84","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.84","url":null,"abstract":"Recent studies have shown that moderate to high data redundancy clearly exists in primary storage systems in the Cloud. Our experimental studies reveal that data redundancy exhibits a much higher level of intensity on the I/O path than that on disks due to the relatively high temporal access locality associated with small I/O requests to redundant data. On the other hand, we also observe that directly applying data deduplication to primary storage systems in the Cloud will likely cause space contention in memory and data fragmentation on disks. Based on these observations, we propose a Performance-Oriented I/O Deduplication approach, called POD, rather than a capacity-oriented I/O deduplication approach, represented by iDedup, to improve the I/O performance of primary storage systems in the Cloud without sacrificing capacity savings of the latter. The salient feature of POD is its focus on not only the capacity-sensitive large writes and files, as in iDedup, but also the performance-sensitive while capacity-insensitive small writes and files. The experiments conducted on our lightweight prototype implementation of POD show that POD significantly outperforms iDedup in the I/O performance measure by up to 87.9% with an average of 58.8%. Moreover, our evaluation results also show that POD achieves comparable or better capacity savings than iDedup.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116155636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Large-Scale Hydrodynamic Brownian Simulations on Multicore and Manycore Architectures 多核和多核体系结构的大规模流体动力学布朗模拟

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.65

Xing Liu, Edmond Chow

Conventional Brownian dynamics (BD) simulations with hydrodynamic interactions utilize 3n×3n dense mobility matrices, where n is the number of simulated particles. This limits the size of BD simulations, particularly on accelerators with low memory capacities. In this paper, we formulate a matrix-free algorithm for BD simulations, allowing us to scale to very large numbers of particles while also being efficient for small numbers of particles. We discuss the implementation of this method for multicore and many core architectures, as well as a hybrid implementation that splits the workload between CPUs and Intel Xeon Phi coprocessors. For 10,000 particles, the limit of the conventional algorithm on a 32 GB system, the matrix-free algorithm is 35 times faster than the conventional matrix based algorithm. We show numerical tests for the matrix-free algorithm up to 500,000 particles. For large systems, our hybrid implementation using two Intel Xeon Phi coprocessors achieves a speedup of over 3.5x compared to the CPU-only case. Our optimizations also make the matrix-free algorithm faster than the conventional dense matrix algorithm on as few as 1000 particles.

传统的具有流体动力学相互作用的布朗动力学(BD)模拟利用3n×3n密集迁移率矩阵，其中n是模拟粒子的数量。这限制了BD模拟的大小，特别是在内存容量较低的加速器上。在本文中，我们为BD模拟制定了一个无矩阵算法，允许我们扩展到非常大量的粒子，同时对少量粒子也有效。我们讨论了这种方法在多核和多核架构中的实现，以及在cpu和Intel Xeon Phi协处理器之间划分工作负载的混合实现。对于常规算法在32gb系统上的极限10,000个粒子，无矩阵算法比常规基于矩阵的算法快35倍。我们展示了多达500,000个粒子的无矩阵算法的数值测试。对于大型系统，我们使用两个Intel Xeon Phi协处理器的混合实现与仅使用cpu的情况相比，速度提高了3.5倍以上。我们的优化还使无矩阵算法比传统的密集矩阵算法在少至1000个粒子上更快。

{"title":"Large-Scale Hydrodynamic Brownian Simulations on Multicore and Manycore Architectures","authors":"Xing Liu, Edmond Chow","doi":"10.1109/IPDPS.2014.65","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.65","url":null,"abstract":"Conventional Brownian dynamics (BD) simulations with hydrodynamic interactions utilize 3n×3n dense mobility matrices, where n is the number of simulated particles. This limits the size of BD simulations, particularly on accelerators with low memory capacities. In this paper, we formulate a matrix-free algorithm for BD simulations, allowing us to scale to very large numbers of particles while also being efficient for small numbers of particles. We discuss the implementation of this method for multicore and many core architectures, as well as a hybrid implementation that splits the workload between CPUs and Intel Xeon Phi coprocessors. For 10,000 particles, the limit of the conventional algorithm on a 32 GB system, the matrix-free algorithm is 35 times faster than the conventional matrix based algorithm. We show numerical tests for the matrix-free algorithm up to 500,000 particles. For large systems, our hybrid implementation using two Intel Xeon Phi coprocessors achieves a speedup of over 3.5x compared to the CPU-only case. Our optimizations also make the matrix-free algorithm faster than the conventional dense matrix algorithm on as few as 1000 particles.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116454537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Pipelined Compaction for the LSM-Tree lsm树的流水线压缩

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.85

Zigang Zhang, Yinliang Yue, Bingsheng He, Jin Xiong, Mingyu Chen, Lixin Zhang, Ninghui Sun

Write-optimized data structures like Log-Structured Merge-tree (LSM-tree) and its variants are widely used in key-value storage systems like Big Table and Cassandra. Due to deferral and batching, the LSM-tree based storage systems need background compactions to merge key-value entries and keep them sorted for future queries and scans. Background compactions play a key role on the performance of the LSM-tree based storage systems. Existing studies about the background compaction focus on decreasing the compaction frequency, reducing I/Os or confining compactions on hot data key-ranges. They do not pay much attention to the computation time in background compactions. However, the computation time is no longer negligible, and even the computation takes more than 60% of the total compaction time in storage systems using flash based SSDs. Therefore, an alternative method to speedup the compaction is to make good use of the parallelism of underlying hardware including CPUs and I/O devices. In this paper, we analyze the compaction procedure, recognize the performance bottleneck, and propose the Pipelined Compaction Procedure (PCP) to better utilize the parallelism of CPUs and I/O devices. Theoretical analysis proves that PCP can improve the compaction bandwidth. Furthermore, we implement PCP in real system and conduct extensive experiments. The experimental results show that the pipelined compaction procedure can increase the compaction bandwidth and storage system throughput by 77% and 62% respectively.

写优化的数据结构，如日志结构合并树(Log-Structured Merge-tree, LSM-tree)及其变体，广泛用于Big Table和Cassandra等键值存储系统。由于延迟和批处理，基于lsm树的存储系统需要后台压缩来合并键值条目，并为将来的查询和扫描保持它们的排序。背景压缩对基于lsm树的存储系统的性能起着至关重要的作用。现有关于后台压缩的研究主要集中在降低压缩频率、减少I/ o或限制对热数据键范围的压缩。它们在后台压缩时不太注意计算时间。然而，计算时间不再是可以忽略不计的，甚至在使用基于闪存的ssd的存储系统中，计算时间也超过了总压缩时间的60%。因此，加速压缩的另一种方法是充分利用底层硬件(包括cpu和I/O设备)的并行性。在本文中，我们分析了压缩过程，识别了性能瓶颈，提出了流水线压缩过程(PCP)，以更好地利用cpu和I/O设备的并行性。理论分析证明，PCP可以提高压缩带宽。此外，我们在实际系统中实现了PCP，并进行了大量的实验。实验结果表明，采用流水线压缩方式可使压缩带宽和存储系统吞吐量分别提高77%和62%。

{"title":"Pipelined Compaction for the LSM-Tree","authors":"Zigang Zhang, Yinliang Yue, Bingsheng He, Jin Xiong, Mingyu Chen, Lixin Zhang, Ninghui Sun","doi":"10.1109/IPDPS.2014.85","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.85","url":null,"abstract":"Write-optimized data structures like Log-Structured Merge-tree (LSM-tree) and its variants are widely used in key-value storage systems like Big Table and Cassandra. Due to deferral and batching, the LSM-tree based storage systems need background compactions to merge key-value entries and keep them sorted for future queries and scans. Background compactions play a key role on the performance of the LSM-tree based storage systems. Existing studies about the background compaction focus on decreasing the compaction frequency, reducing I/Os or confining compactions on hot data key-ranges. They do not pay much attention to the computation time in background compactions. However, the computation time is no longer negligible, and even the computation takes more than 60% of the total compaction time in storage systems using flash based SSDs. Therefore, an alternative method to speedup the compaction is to make good use of the parallelism of underlying hardware including CPUs and I/O devices. In this paper, we analyze the compaction procedure, recognize the performance bottleneck, and propose the Pipelined Compaction Procedure (PCP) to better utilize the parallelism of CPUs and I/O devices. Theoretical analysis proves that PCP can improve the compaction bandwidth. Furthermore, we implement PCP in real system and conduct extensive experiments. The experimental results show that the pipelined compaction procedure can increase the compaction bandwidth and storage system throughput by 77% and 62% respectively.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125776271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Petascale General Solver for Semidefinite Programming Problems with Over Two Million Constraints 二百万约束半定规划问题的千兆级通用求解器

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.121

K. Fujisawa, Toshio Endo, Yuichiro Yasui, Hitoshi Sato, Naoki Matsuzawa, S. Matsuoka, Hayato Waki

The semi definite programming (SDP) problem is one of the central problems in mathematical optimization. The primal-dual interior-point method (PDIPM) is one of the most powerful algorithms for solving SDP problems, and many research groups have employed it for developing software packages. However, two well-known major bottlenecks, i.e., the generation of the Schur complement matrix (SCM) and its Cholesky factorization, exist in the algorithmic framework of the PDIPM. We have developed a new version of the semi definite programming algorithm parallel version (SDPARA), which is a parallel implementation on multiple CPUs and GPUs for solving extremely large-scale SDP problems with over a million constraints. SDPARA can automatically extract the unique characteristics from an SDP problem and identify the bottleneck. When the generation of the SCM becomes a bottleneck, SDPARA can attain high scalability using a large quantity of CPU cores and some processor affinity and memory interleaving techniques. SDPARA can also perform parallel Cholesky factorization using thousands of GPUs and techniques for overlapping computation and communication if an SDP problem has over two million constraints and Cholesky factorization constitutes a bottleneck. We demonstrate that SDPARA is a high-performance general solver for SDPs in various application fields through numerical experiments conducted on the TSUBAME 2.5 supercomputer, and we solved the largest SDP problem (which has over 2.33 million constraints), thereby creating a new world record. Our implementation also achieved 1.713 PFlops in double precision for large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs.

半确定规划(SDP)问题是数学优化中的核心问题之一。原对偶内点法(PDIPM)是求解SDP问题最强大的算法之一，许多研究小组已将其用于开发软件包。然而，在PDIPM的算法框架中存在两个众所周知的主要瓶颈，即Schur补矩阵(SCM)的生成及其Cholesky分解。我们开发了一种新版本的半确定规划算法并行版本(SDPARA)，它是在多个cpu和gpu上并行实现的，用于解决具有超过一百万个约束的极大规模SDP问题。SDPARA可以自动从一个SDP问题中提取唯一的特征，并识别瓶颈。当单片机的生成成为瓶颈时，利用大量的CPU内核和一些处理器亲和和内存交错技术，SDPARA可以获得较高的可扩展性。如果一个SDP问题有超过200万个约束，并且Cholesky分解构成瓶颈，那么SDPARA还可以使用数千个gpu和重叠计算和通信技术来执行并行Cholesky分解。我们通过在TSUBAME 2.5超级计算机上的数值实验，证明了SDPARA是各种应用领域中SDP的高性能通用求解器，并解决了最大的SDP问题(超过233万个约束条件)，从而创造了新的世界纪录。我们的实现还使用2,720个cpu和4,080个gpu实现了大规模Cholesky分解的双精度1.713 PFlops。

{"title":"Petascale General Solver for Semidefinite Programming Problems with Over Two Million Constraints","authors":"K. Fujisawa, Toshio Endo, Yuichiro Yasui, Hitoshi Sato, Naoki Matsuzawa, S. Matsuoka, Hayato Waki","doi":"10.1109/IPDPS.2014.121","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.121","url":null,"abstract":"The semi definite programming (SDP) problem is one of the central problems in mathematical optimization. The primal-dual interior-point method (PDIPM) is one of the most powerful algorithms for solving SDP problems, and many research groups have employed it for developing software packages. However, two well-known major bottlenecks, i.e., the generation of the Schur complement matrix (SCM) and its Cholesky factorization, exist in the algorithmic framework of the PDIPM. We have developed a new version of the semi definite programming algorithm parallel version (SDPARA), which is a parallel implementation on multiple CPUs and GPUs for solving extremely large-scale SDP problems with over a million constraints. SDPARA can automatically extract the unique characteristics from an SDP problem and identify the bottleneck. When the generation of the SCM becomes a bottleneck, SDPARA can attain high scalability using a large quantity of CPU cores and some processor affinity and memory interleaving techniques. SDPARA can also perform parallel Cholesky factorization using thousands of GPUs and techniques for overlapping computation and communication if an SDP problem has over two million constraints and Cholesky factorization constitutes a bottleneck. We demonstrate that SDPARA is a high-performance general solver for SDPs in various application fields through numerical experiments conducted on the TSUBAME 2.5 supercomputer, and we solved the largest SDP problem (which has over 2.33 million constraints), thereby creating a new world record. Our implementation also achieved 1.713 PFlops in double precision for large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125943869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

An Efficient Method for Stream Semantics over RDMA 一种有效的RDMA流语义处理方法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.91

Patrick MacArthur, R. Russell

Most network applications today are written to use TCP/IP via sockets. Remote Direct Memory Access (RDMA) is gaining popularity because its zero-copy, kernel-bypass features provide a high throughput, low latency reliable transport. Unlike TCP, which is a stream-oriented protocol, RDMA is a message-oriented protocol, and the OFA verbs library for writing RDMA application programs is more complex than the TCP sockets interface. UNH EXS is one of several libraries designed to give applications more convenient, high-level access to RDMA features. Recent work has shown that RDMA is viable both in the data center and over distance. One potential bottleneck in libraries that use RDMA is the requirement to wait for message advertisements in order to send large zero-copy messages. By sending messages first to an internal, hidden buffer and copying the message later, latency can be reduced at the expense of higher CPU usage at the receiver. This paper presents a communication algorithm that has been implemented in the UNH EXS stream-oriented mode to allow dynamic switching between sending transfers directly to user memory and sending transfers indirectly via an internal, hidden buffer depending on the state of the sender and receiver. Based on preliminary results, we see that this algorithm performs well under a variety of application requirements.

现在大多数网络应用程序都是通过套接字来使用TCP/IP的。远程直接内存访问(RDMA)越来越受欢迎，因为它的零拷贝、内核旁路特性提供了高吞吐量、低延迟的可靠传输。TCP是面向流的协议，而RDMA是面向消息的协议，用于编写RDMA应用程序的OFA动词库比TCP套接字接口更复杂。UNH EXS是为使应用程序更方便、更高级地访问RDMA特性而设计的几个库之一。最近的研究表明，RDMA在数据中心和远距离上都是可行的。在使用RDMA的库中，一个潜在的瓶颈是需要等待消息发布，以便发送大的零拷贝消息。通过先将消息发送到一个内部的、隐藏的缓冲区，然后再复制消息，可以减少延迟，但代价是提高接收端的CPU使用率。本文提出了一种以UNH EXS面向流模式实现的通信算法，该算法允许根据发送方和接收方的状态，在直接向用户内存发送传输和通过内部隐藏缓冲区间接发送传输之间进行动态切换。根据初步结果，我们看到该算法在各种应用需求下都表现良好。

{"title":"An Efficient Method for Stream Semantics over RDMA","authors":"Patrick MacArthur, R. Russell","doi":"10.1109/IPDPS.2014.91","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.91","url":null,"abstract":"Most network applications today are written to use TCP/IP via sockets. Remote Direct Memory Access (RDMA) is gaining popularity because its zero-copy, kernel-bypass features provide a high throughput, low latency reliable transport. Unlike TCP, which is a stream-oriented protocol, RDMA is a message-oriented protocol, and the OFA verbs library for writing RDMA application programs is more complex than the TCP sockets interface. UNH EXS is one of several libraries designed to give applications more convenient, high-level access to RDMA features. Recent work has shown that RDMA is viable both in the data center and over distance. One potential bottleneck in libraries that use RDMA is the requirement to wait for message advertisements in order to send large zero-copy messages. By sending messages first to an internal, hidden buffer and copying the message later, latency can be reduced at the expense of higher CPU usage at the receiver. This paper presents a communication algorithm that has been implemented in the UNH EXS stream-oriented mode to allow dynamic switching between sending transfers directly to user memory and sending transfers indirectly via an internal, hidden buffer depending on the state of the sender and receiver. Based on preliminary results, we see that this algorithm performs well under a variety of application requirements.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116983997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Scaling Irregular Applications through Data Aggregation and Software Multithreading 通过数据聚合和软件多线程扩展不规则应用

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.117

Alessandro Morari, Antonino Tumeo, D. Chavarría-Miranda, Oreste Villa, M. Valero

Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.

生物信息学、数据分析、语义数据库和知识发现等领域的新兴应用需要数十到数百tb的数据集。目前，只有分布式内存集群有足够的聚合空间来支持这种大小的数据集的内存处理。然而，除了大尺寸之外，这些新应用程序类使用的数据结构通常具有不可预测和细粒度访问的特征:即，它们呈现不规则的行为。相反，传统的商品集群利用基于缓存的处理器和针对局部性、规则计算和批量通信进行优化的高带宽网络。由于这些原因，不规则的应用程序在这些系统上效率低下，并且需要定制的、手工编码的优化来提供性能和大小的可伸缩性。轻量级软件多线程可以通过与计算重叠的网络通信来容忍数据访问延迟，而聚合可以通过合并细粒度的网络消息来减少开销和增加带宽利用率，这是可以加快商品集群上大规模不规则应用程序性能的关键技术。在本文中，我们描述了GMT(全局内存和线程)，一个运行时系统库，它将软件多线程和消息聚合与分区全局地址空间(PGAS)数据模型结合在一起，以实现多节点系统上不规则应用程序的更高性能和可扩展性。我们将介绍运行时的体系结构，解释它是如何围绕这两项关键技术进行设计的。我们表明，使用我们的运行时编写的不规则应用程序可以比使用不利用这些技术的其他编程模型编写的相应应用程序表现得更好，甚至要好上几个数量级。

{"title":"Scaling Irregular Applications through Data Aggregation and Software Multithreading","authors":"Alessandro Morari, Antonino Tumeo, D. Chavarría-Miranda, Oreste Villa, M. Valero","doi":"10.1109/IPDPS.2014.117","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.117","url":null,"abstract":"Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114669725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Nitro: A Framework for Adaptive Code Variant Tuning Nitro:一个自适应代码变体调优的框架

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.59

Saurav Muralidharan, Manu Shantharam, Mary W. Hall, M. Garland, Bryan Catanzaro

Autotuning systems intelligently navigate a search space of possible implementations of a computation to find the implementation(s) that best meets a specific optimization criteria, usually performance. This paper describes Nitro, a programmer-directed auto tuning framework that facilitates tuning of code variants, or alternative implementations of the same computation. Nitro provides a library interface that permits programmers to express code variants along with meta-information that aids the system in selecting among the set of variants at run time. Machine learning is employed to build a model through training on this meta-information, so that when a new input is presented, Nitro can consult the model to select the appropriate variant. In experiments with five real-world irregular GPU benchmarks from sparse numerical methods, graph computations and sorting, Nitro-tuned variants achieve over 93% of the performance of variants selected through exhaustive search. Further, we describe optimizations and heuristics in Nitro that substantially reduce training time and other overheads.

自动调优系统智能地导航计算的可能实现的搜索空间，以找到最符合特定优化标准(通常是性能)的实现。本文描述了Nitro，一个由程序员指导的自动调优框架，它有助于调整代码变体，或者相同计算的替代实现。Nitro提供了一个库接口，允许程序员通过元信息来表达代码变体，元信息可以帮助系统在运行时从一组变体中进行选择。利用机器学习对这些元信息进行训练来建立模型，当出现新的输入时，Nitro可以参考模型来选择合适的变体。在基于稀疏数值方法、图计算和排序的五个真实世界不规则GPU基准的实验中，通过穷举搜索选择的变体，硝基调优变体的性能达到93%以上。此外，我们还描述了Nitro中的优化和启发式，这些优化和启发式大大减少了训练时间和其他开销。

引用次数: 59

Auto-Tuning Dedispersion for Many-Core Accelerators 多核加速器的自动调谐去色散

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.101

A. Sclocco, H. Bal, J. Hessels, J. V. Leeuwen, R. V. Nieuwpoort

Dedispersion is a basic algorithm to reconstruct impulsive astrophysical signals. It is used in high sampling-rate radio astronomy to counteract temporal smearing by intervening interstellar medium. To counteract this smearing, the received signal train must be dedispersed for thousands of trial distances, after which the transformed signals are further analyzed. This process is expensive on both computing and data handling. This challenge is exacerbated in future, and even some current, radio telescopes which routinely produce hundreds of such data streams in parallel. There, the compute requirements for dedispersion are high (petascale), while the data intensity is extreme. Yet, the dedispersion algorithm remains a basic component of every radio telescope, and a fundamental step in searching the sky for radio pulsars and other transient astrophysical objects. In this paper, we study the parallelization of the dedispersion algorithm on many-core accelerators, including GPUs from AMD and NVIDIA, and the Intel Xeon Phi. An important contribution is the computational analysis of the algorithm, from which we conclude that dedispersion is inherently memory-bound in any realistic scenario, in contrast to earlier reports. We also provide empirical proof that, even in unrealistic scenarios, hardware limitations keep the arithmetic intensity low, thus limiting performance. We exploit auto-tuning to adapt the algorithm, not only to different accelerators, but also to different observations, and even telescopes. Our experiments show how the algorithm is tuned automatically for different scenarios and how it exploits and highlights the underlying specificities of the hardware: in some observations, the tuner automatically optimizes device occupancy, while in others it optimizes memory bandwidth. We quantitatively analyze the problem space, and by comparing the results of optimal auto-tuned versions against the best performing fixed codes, we show the impact that auto-tuning has on performance, and conclude that it is statistically relevant.

去色散是一种重建脉冲天体物理信号的基本算法。它用于高采样率的射电天文学，以抵消星际介质的干扰造成的时间干扰。为了消除这种干扰，必须对接收到的信号序列进行数千次试验距离的去分散处理，然后对变换后的信号进行进一步分析。这个过程在计算和数据处理上都很昂贵。这一挑战在未来会加剧，甚至一些现有的射电望远镜也会同时产生数百个这样的数据流。在那里，去分散的计算需求很高(千兆级)，而数据强度非常高。然而，去色散算法仍然是每个射电望远镜的基本组成部分，也是在天空中搜索射电脉冲星和其他瞬态天体物理物体的基本步骤。在本文中，我们研究了去色散算法在多核加速器上的并行化，包括AMD和NVIDIA的gpu，以及Intel的Xeon Phi。一个重要的贡献是算法的计算分析，从中我们得出结论，在任何现实情况下，去分散本质上是内存约束的，与之前的报告相反。我们还提供了经验证明，即使在不现实的场景中，硬件限制也会使算术强度保持较低，从而限制性能。我们利用自动调整来调整算法，不仅适用于不同的加速器，也适用于不同的观测，甚至望远镜。我们的实验显示了该算法如何自动针对不同的场景进行调优，以及它如何利用和突出硬件的潜在特性:在一些观察中，调谐器自动优化设备占用，而在其他观察中，它优化内存带宽。我们定量地分析了问题空间，并通过比较最佳自动调优版本的结果与性能最佳的固定代码的结果，我们展示了自动调优对性能的影响，并得出了统计相关的结论。

{"title":"Auto-Tuning Dedispersion for Many-Core Accelerators","authors":"A. Sclocco, H. Bal, J. Hessels, J. V. Leeuwen, R. V. Nieuwpoort","doi":"10.1109/IPDPS.2014.101","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.101","url":null,"abstract":"Dedispersion is a basic algorithm to reconstruct impulsive astrophysical signals. It is used in high sampling-rate radio astronomy to counteract temporal smearing by intervening interstellar medium. To counteract this smearing, the received signal train must be dedispersed for thousands of trial distances, after which the transformed signals are further analyzed. This process is expensive on both computing and data handling. This challenge is exacerbated in future, and even some current, radio telescopes which routinely produce hundreds of such data streams in parallel. There, the compute requirements for dedispersion are high (petascale), while the data intensity is extreme. Yet, the dedispersion algorithm remains a basic component of every radio telescope, and a fundamental step in searching the sky for radio pulsars and other transient astrophysical objects. In this paper, we study the parallelization of the dedispersion algorithm on many-core accelerators, including GPUs from AMD and NVIDIA, and the Intel Xeon Phi. An important contribution is the computational analysis of the algorithm, from which we conclude that dedispersion is inherently memory-bound in any realistic scenario, in contrast to earlier reports. We also provide empirical proof that, even in unrealistic scenarios, hardware limitations keep the arithmetic intensity low, thus limiting performance. We exploit auto-tuning to adapt the algorithm, not only to different accelerators, but also to different observations, and even telescopes. Our experiments show how the algorithm is tuned automatically for different scenarios and how it exploits and highlights the underlying specificities of the hardware: in some observations, the tuner automatically optimizes device occupancy, while in others it optimizes memory bandwidth. We quantitatively analyze the problem space, and by comparing the results of optimal auto-tuned versions against the best performing fixed codes, we show the impact that auto-tuning has on performance, and conclude that it is statistically relevant.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127706039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Power-Efficient Multiple Producer-Consumer 节能的多重生产者-消费者

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.75

R. Medhat, Borzoo Bonakdarpour, S. Fischmeister

Power efficiency has been one of the main objectives of hardware design in the last two decades. However, with the recent explosion of mobile computing and the increasing demand for green data centers, software power efficiency has also risen to be an equally important factor. We argue that most classic concurrency control algorithms were designed in an era when power efficiency was not an important dimension in algorithm design. Such algorithms are applied to solve a wide range of problems from kernel-level primitives in operating systems to networking devices and web services. These primitives and services are constantly and heavily invoked in any computer system and by larger scale in networking devices and data centers. Thus, even a small change in their power spectrum can make a huge impact on overall power consumption in long periods of time. This paper focuses on the classic producer-consumer problem. First, we study the power efficiency of different existing implementations of the producer-consumer problem. In particular, we present evidence that these implementations behave drastically differently with respect to power consumption. Secondly, we present a dynamic algorithm for the multiple producer-consumer problem, where consumers in a multicore system use learning mechanisms to predict the rate of production, and effectively utilize this prediction to attempt to latch onto previously scheduled CPU wake-ups. Such group latching results in minimizing the overall number of CPU wakeups and in effect, power consumption. We enable consumers to dynamically reserve more pre-allocated memory in cases where the production rate is too high. Consumers may compete for the extra space and dynamically release it when it is no longer needed. Our experiments show that our algorithm provides up to 40% decrease in the number of CPU wakeups, and 30% decrease in power consumption. We validate the scalability of our algorithm with an increasing number of consumers.

在过去的二十年中，能效一直是硬件设计的主要目标之一。然而，随着最近移动计算的爆炸式增长和对绿色数据中心的需求不断增加，软件电源效率也上升为一个同样重要的因素。我们认为，大多数经典的并发控制算法是在功耗效率不是算法设计的重要维度的时代设计的。这些算法被应用于解决从操作系统中的内核级原语到网络设备和web服务的广泛问题。这些原语和服务在任何计算机系统中以及在网络设备和数据中心中被不断地大量调用。因此，即使功率谱的微小变化也会对长时间的整体功耗产生巨大影响。本文主要研究经典的生产者-消费者问题。首先，我们研究了现有不同的生产者-消费者问题实现的功率效率。特别是，我们提供的证据表明，这些实现在功耗方面的行为截然不同。其次，我们提出了一种针对多生产者-消费者问题的动态算法，其中多核系统中的消费者使用学习机制来预测生产速率，并有效地利用这种预测来尝试锁定先前计划的CPU唤醒。这种组锁存的结果是最小化CPU唤醒的总次数，实际上，减少了功耗。在生产速率过高的情况下，我们允许消费者动态保留更多预分配的内存。消费者可能会争夺额外的空间，并在不再需要时动态地释放它。我们的实验表明，我们的算法可以减少40%的CPU唤醒次数，减少30%的功耗。我们用越来越多的消费者来验证算法的可扩展性。

{"title":"Power-Efficient Multiple Producer-Consumer","authors":"R. Medhat, Borzoo Bonakdarpour, S. Fischmeister","doi":"10.1109/IPDPS.2014.75","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.75","url":null,"abstract":"Power efficiency has been one of the main objectives of hardware design in the last two decades. However, with the recent explosion of mobile computing and the increasing demand for green data centers, software power efficiency has also risen to be an equally important factor. We argue that most classic concurrency control algorithms were designed in an era when power efficiency was not an important dimension in algorithm design. Such algorithms are applied to solve a wide range of problems from kernel-level primitives in operating systems to networking devices and web services. These primitives and services are constantly and heavily invoked in any computer system and by larger scale in networking devices and data centers. Thus, even a small change in their power spectrum can make a huge impact on overall power consumption in long periods of time. This paper focuses on the classic producer-consumer problem. First, we study the power efficiency of different existing implementations of the producer-consumer problem. In particular, we present evidence that these implementations behave drastically differently with respect to power consumption. Secondly, we present a dynamic algorithm for the multiple producer-consumer problem, where consumers in a multicore system use learning mechanisms to predict the rate of production, and effectively utilize this prediction to attempt to latch onto previously scheduled CPU wake-ups. Such group latching results in minimizing the overall number of CPU wakeups and in effect, power consumption. We enable consumers to dynamically reserve more pre-allocated memory in cases where the production rate is too high. Consumers may compete for the extra space and dynamically release it when it is no longer needed. Our experiments show that our algorithm provides up to 40% decrease in the number of CPU wakeups, and 30% decrease in power consumption. We validate the scalability of our algorithm with an increasing number of consumers.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"30 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131830099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2