2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing最新文献

Transactional Forwarding: Supporting Highly-Concurrent STM in Asynchronous Distributed Systems 事务性转发:支持异步分布式系统中的高并发STM

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.36

Mohamed M. Saad, B. Ravindran

Distributed software transactional memory (or DTM) is an emerging promising model for distributed concurrency control, as it avoids the problems with locks (e.g., distributed deadlocks), while retaining the programming simplicity of coarse-grained locking. We consider DTM in Herlihy and Sun's data flow distributed execution model, where transactions are immobile and objects dynamically migrate to invoking transactions. To support DTM in this model and ensure transactional properties including atomicity, consistency, and isolation, we develop an algorithm called Transactional Forwarding Algorithm (or TFA). TFA guarantees a consistent view of shared objects between distributed transactions, provides atomicity for object operations, and transparently handles object relocation and versioning using an asynchronous version clock-based validation algorithm. We show that TFA is opaque (its correctness property) and permits strong progressiveness (its progress property). We implement TFA in a Java DTM framework and conduct experimental studies on a 120-node system, executing over 4 million transactions, with more than 1000 active concurrent transactions. Our implementation reveals that TFA outperforms competing distributed concurrency control models including Java RMI with spin locks, distributed shared memory, and directory-based DTM, by as much as 13x (for read-dominant transactions), and competitor DTM implementations by as much as 4x.

分布式软件事务性内存(Distributed software transactional memory，简称DTM)是一种新兴的有前途的分布式并发控制模型，因为它避免了锁的问题(例如，分布式死锁)，同时保留了粗粒度锁的编程简单性。我们在Herlihy和Sun的数据流分布式执行模型中考虑DTM，其中事务是不可移动的，对象动态迁移到调用事务。为了在此模型中支持DTM并确保事务属性，包括原子性、一致性和隔离性，我们开发了一种称为事务转发算法(transactional Forwarding algorithm，简称TFA)的算法。TFA保证分布式事务之间共享对象的一致视图，为对象操作提供原子性，并使用基于异步版本时钟的验证算法透明地处理对象重定位和版本控制。我们证明TFA是不透明的(它的正确性属性)，并允许强进步性(它的进度属性)。我们在Java DTM框架中实现了TFA，并在一个120节点的系统上进行了实验研究，该系统执行了超过400万个事务，并发事务超过1000个。我们的实现表明，TFA比竞争的分布式并发控制模型(包括带自旋锁的Java RMI、分布式共享内存和基于目录的DTM)的性能高出13倍(对于读主导事务)，比竞争对手的DTM实现高出4倍。

{"title":"Transactional Forwarding: Supporting Highly-Concurrent STM in Asynchronous Distributed Systems","authors":"Mohamed M. Saad, B. Ravindran","doi":"10.1109/SBAC-PAD.2012.36","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.36","url":null,"abstract":"Distributed software transactional memory (or DTM) is an emerging promising model for distributed concurrency control, as it avoids the problems with locks (e.g., distributed deadlocks), while retaining the programming simplicity of coarse-grained locking. We consider DTM in Herlihy and Sun's data flow distributed execution model, where transactions are immobile and objects dynamically migrate to invoking transactions. To support DTM in this model and ensure transactional properties including atomicity, consistency, and isolation, we develop an algorithm called Transactional Forwarding Algorithm (or TFA). TFA guarantees a consistent view of shared objects between distributed transactions, provides atomicity for object operations, and transparently handles object relocation and versioning using an asynchronous version clock-based validation algorithm. We show that TFA is opaque (its correctness property) and permits strong progressiveness (its progress property). We implement TFA in a Java DTM framework and conduct experimental studies on a 120-node system, executing over 4 million transactions, with more than 1000 active concurrent transactions. Our implementation reveals that TFA outperforms competing distributed concurrency control models including Java RMI with spin locks, distributed shared memory, and directory-based DTM, by as much as 13x (for read-dominant transactions), and competitor DTM implementations by as much as 4x.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122126222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment FusedOS:在异构环境中融合LWK性能和FWK功能

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.14

Yoonho Park, E. V. Hensbergen, Marius Hillenbrand, T. Inglett, Bryan S. Rosenburg, K. D. Ryu, R. Wisniewski

Traditionally, there have been two approaches to providing an operating environment for high performance computing (HPC). A Full-Weight Kernel(FWK) approach starts with a general-purpose operating system and strips it down to better scale up across more cores and out across larger clusters. A Light-Weight Kernel (LWK) approach starts with a new thin kernel code base and extends its functionality by adding more system services needed by applications. In both cases, the goal is to provide end-users with a scalable HPC operating environment with the functionality and services needed to reliably run their applications. To achieve this goal, we propose a new approach, called Fused OS, that combines the FWK and LWK approaches. Fused OS provides an infrastructure capable of partitioning the resources of a multicoreheterogeneous system and collaboratively running different operating environments on subsets of the cores and memory, without the use of a virtual machine monitor. With Fused OS, HPC applications can enjoy both the performance characteristics of an LWK and the rich functionality of an FWK through cross-core system service delegation. This paper presents the Fused OS architecture and a prototype implementation on Blue Gene/Q. The Fused OS prototype leverages Linux with small modifications as a FWK and implements a user-level LWK called Compute Library (CL) by leveraging CNK. We present CL performance results demonstrating low noise and show micro-benchmarks running with performance commensurate with that provided by CNK.

传统上，有两种方法可以为高性能计算(HPC)提供操作环境。全权重内核(Full-Weight Kernel, FWK)方法从通用操作系统开始，并将其剥离，以便更好地扩展到更多内核和跨更大集群。轻量级内核(Light-Weight Kernel, LWK)方法从一个新的精简内核代码库开始，并通过添加应用程序所需的更多系统服务来扩展其功能。在这两种情况下，目标都是为最终用户提供可扩展的HPC操作环境，提供可靠运行其应用程序所需的功能和服务。为了实现这一目标，我们提出了一种新的方法，称为Fused OS，它结合了FWK和LWK方法。fusion OS提供了一种基础设施，能够对多核异构系统的资源进行分区，并在内核和内存的子集上协作运行不同的操作环境，而无需使用虚拟机监视器。通过跨核心系统服务委托，HPC应用既可以享受到LWK的性能特点，又可以享受到FWK的丰富功能。本文介绍了融合操作系统的体系结构和在Blue Gene/Q上的原型实现。fusion OS原型利用Linux进行了一些小修改作为FWK，并通过利用CNK实现了一个名为Compute Library (CL)的用户级LWK。我们展示了低噪声的CL性能结果，并展示了与CNK提供的性能相称的微基准运行。

{"title":"FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment","authors":"Yoonho Park, E. V. Hensbergen, Marius Hillenbrand, T. Inglett, Bryan S. Rosenburg, K. D. Ryu, R. Wisniewski","doi":"10.1109/SBAC-PAD.2012.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.14","url":null,"abstract":"Traditionally, there have been two approaches to providing an operating environment for high performance computing (HPC). A Full-Weight Kernel(FWK) approach starts with a general-purpose operating system and strips it down to better scale up across more cores and out across larger clusters. A Light-Weight Kernel (LWK) approach starts with a new thin kernel code base and extends its functionality by adding more system services needed by applications. In both cases, the goal is to provide end-users with a scalable HPC operating environment with the functionality and services needed to reliably run their applications. To achieve this goal, we propose a new approach, called Fused OS, that combines the FWK and LWK approaches. Fused OS provides an infrastructure capable of partitioning the resources of a multicoreheterogeneous system and collaboratively running different operating environments on subsets of the cores and memory, without the use of a virtual machine monitor. With Fused OS, HPC applications can enjoy both the performance characteristics of an LWK and the rich functionality of an FWK through cross-core system service delegation. This paper presents the Fused OS architecture and a prototype implementation on Blue Gene/Q. The Fused OS prototype leverages Linux with small modifications as a FWK and implements a user-level LWK called Compute Library (CL) by leveraging CNK. We present CL performance results demonstrating low noise and show micro-benchmarks running with performance commensurate with that provided by CNK.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124625877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Exploiting Phase-Change Memory in Cooperative Caches 利用协同缓存中的相变存储器

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.11

Luiz E. Ramos, R. Bianchini

Modern servers require large main memories, which so far have been enabled by improvements in DRAM density. However, the scalability of DRAM is approaching its limit, so Phase-Change Memory (PCM) is being considered as an alternative technology. PCM is denser, more scalable, and consumes lower idle power than DRAM, while exhibiting byte-address ability and access times in the nanosecond range. Unfortunately, PCM is also slower than DRAM and has limited endurance. These characteristics prompted the study of hybrid memory systems, combining a small amount of DRAM and a large amount of PCM. In this paper, we leverage hybrid memories to improve the performance of cooperative memory caches in server clusters. Our approach entails a novel policy that exploits popularity information in placing objects across servers and memory technologies. Our results show that (1) DRAM-only and PCM-only memory systems do not perform well in all cases, and (2) when managed properly, hybrid memories always exhibit the best or close-to-best performance, with significant gains in many cases, without increasing energy consumption.

现代服务器需要大容量的主存储器，到目前为止，这是由于DRAM密度的提高而实现的。然而，DRAM的可扩展性正在接近极限，因此相变存储器(PCM)被认为是一种替代技术。PCM密度更大，可扩展性更强，并且比DRAM消耗更低的空闲功率，同时显示出字节寻址能力和纳秒范围内的访问时间。不幸的是，PCM也比DRAM慢，而且续航时间有限。这些特点促使了混合存储系统的研究，结合了少量的DRAM和大量的PCM。在本文中，我们利用混合内存来提高服务器集群中协作内存缓存的性能。我们的方法需要一种新颖的策略，利用流行信息跨服务器和内存技术放置对象。我们的研究结果表明:(1)纯dram和纯pcm存储系统并非在所有情况下都表现良好;(2)如果管理得当，混合存储器总是表现出最佳或接近最佳的性能，在许多情况下具有显着的增益，而不会增加能耗。

引用次数: 11

Sparse Fast Fourier Transform on GPUs and Multi-core CPUs gpu和多核cpu的稀疏快速傅里叶变换

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.34

Jiaxi Hu, Zhaosen Wang, Qiyuan Qiu, Weijun Xiao, D. Lilja

Given an N-point sequence, finding its k largest components in the frequency domain is a problem of great interest. This problem, which is usually referred to as a sparse Fourier Transform, was recently brought back on stage by a newly proposed algorithm called the sFFT. In this paper, we present a parallel implementation of sFFT on both multi-core CPUs and GPUs using a human voice signal as a case study. Using this example, an estimate of k for the 3dB cutoff points was conducted through concrete experiments. In addition, three optimization strategies are presented in this paper. We demonstrate that the multi-core-based sFFT achieves speedups of up to three times a single-threaded sFFT while a GPU-based version achieves up to ten times speedup. For large scale cases, the GPU-based sFFT also shows its considerable advantages, which is about 40 times speedup compared to the latest out-of-card FFT implementations [2].

给定一个n点序列，在频域中找到它的k个最大分量是一个非常有趣的问题。这个问题，通常被称为稀疏傅里叶变换，最近被一个新提出的算法称为sFFT带回了舞台。在本文中，我们使用人类语音信号作为案例研究，在多核cpu和gpu上并行实现sFFT。以此为例，通过具体实验对3dB截止点的k值进行了估计。此外，本文还提出了三种优化策略。我们证明了基于多核的sFFT实现了单线程sFFT的三倍加速，而基于gpu的版本实现了高达十倍的加速。对于大规模情况，基于gpu的sFFT也显示出相当大的优势，与最新的卡外FFT实现相比，其速度提升约40倍[2]。

引用次数: 13

Efficiently Handling Memory Accesses to Improve QoS in Multicore Systems under Real-Time Constraints 实时约束下多核系统中有效处理内存访问以提高QoS

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.16

José Luis March, S. Petit, J. Sahuquillo, H. Hassan, J. Duato

Chip multiprocessors (CMPs) are becoming the common choice to implement embedded systems due to they achieve a good tradeoff between performance and power. Because of manufacturability reasons, CMPs use to implement one or several memory controllers, each one shared by a set of cores. Thus, memory requests from distinct cores compete among them when accessing to memory. This means that the memory access latency can widely vary depending on the co-runners and the memory controller scheduling policy, thus yielding to unpredictable behavior. This work focuses on the design of a memory controller to support workloads with real-time constraints, both hard real-time (HRT) and soft real-time (SRT) applications. These systems must guarantee the execution of HRT applications while improving the performance of the SRT applications. In this paper we propose two memory controller policies for multicore embedded systems: HR-first and ATR-first. The former prioritizes memory requests of HRT tasks, achieving important energy savings but poor performance for SRT applications. The latter gives priority to those HRT requests that are critical to guarantee schedulability. Results show that the ATR-first policy presents similar energy consumption as the HR-first policy while reducing the number of SRT deadline misses around 49%, on average, and reaching the fulfillment of all deadlines in some scenarios.

芯片多处理器(cmp)正成为实现嵌入式系统的常用选择，因为它们在性能和功耗之间实现了良好的权衡。由于可制造性的原因，cmp使用实现一个或多个内存控制器，每个内存控制器由一组内核共享。因此，来自不同内核的内存请求在访问内存时相互竞争。这意味着内存访问延迟可能会因共同运行程序和内存控制器调度策略的不同而有很大差异，从而产生不可预测的行为。这项工作的重点是内存控制器的设计，以支持具有实时约束的工作负载，包括硬实时(HRT)和软实时(SRT)应用。这些系统必须保证HRT应用程序的执行，同时提高SRT应用程序的性能。本文提出了两种多核嵌入式系统的内存控制器策略:hr优先和atr优先。前者优先考虑HRT任务的内存请求，实现了重要的节能，但SRT应用程序的性能较差。后者优先考虑那些对保证可调度性至关重要的HRT请求。结果表明，atr优先策略的能耗与hr优先策略相似，但在某些情况下，SRT期限错过次数平均减少了49%左右，并且达到了所有期限的实现。

{"title":"Efficiently Handling Memory Accesses to Improve QoS in Multicore Systems under Real-Time Constraints","authors":"José Luis March, S. Petit, J. Sahuquillo, H. Hassan, J. Duato","doi":"10.1109/SBAC-PAD.2012.16","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.16","url":null,"abstract":"Chip multiprocessors (CMPs) are becoming the common choice to implement embedded systems due to they achieve a good tradeoff between performance and power. Because of manufacturability reasons, CMPs use to implement one or several memory controllers, each one shared by a set of cores. Thus, memory requests from distinct cores compete among them when accessing to memory. This means that the memory access latency can widely vary depending on the co-runners and the memory controller scheduling policy, thus yielding to unpredictable behavior. This work focuses on the design of a memory controller to support workloads with real-time constraints, both hard real-time (HRT) and soft real-time (SRT) applications. These systems must guarantee the execution of HRT applications while improving the performance of the SRT applications. In this paper we propose two memory controller policies for multicore embedded systems: HR-first and ATR-first. The former prioritizes memory requests of HRT tasks, achieving important energy savings but poor performance for SRT applications. The latter gives priority to those HRT requests that are critical to guarantee schedulability. Results show that the ATR-first policy presents similar energy consumption as the HR-first policy while reducing the number of SRT deadline misses around 49%, on average, and reaching the fulfillment of all deadlines in some scenarios.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127432531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Energy-Performance Tradeoffs in Software Transactional Memory 软件事务性内存中的能量-性能权衡

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.19

A. Baldassin, J. P. L. Carvalho, L. A. G. Garcia, R. Azevedo

Transactional memory (TM) is a new synchronization mechanism devised to simplify parallel programming, thereby helping programmers to unleash the power of current multicore processors. Although software implementations of TM (STM) have been extensively analyzed in terms of runtime performance, little attention has been paid to an equally important constraint faced by nearly all computer systems: energy consumption. In this work we conduct a comprehensive study of energy and runtime tradeoff sin software transactional memory systems. We characterize the behavior of three state-of-the-art lock-based STM algorithms, along with three different conflict resolution schemes. As a result of this characterization, we propose a DVFS-based technique that can be integrated into the resolution policies so as to improve the energy-delay product (EDP). Experimental results show that our DVFS-enhanced policies are indeed beneficial for applications with high contention levels. Improvements of up to 59% in EDP can be observed in this scenario, with an average EDP reduction of 16% across the STAMP workloads.

事务性内存是一种新的同步机制，旨在简化并行编程，从而帮助程序员释放当前多核处理器的强大功能。尽管从运行时性能的角度对TM (STM)的软件实现进行了广泛的分析，但很少有人注意到几乎所有计算机系统都面临的一个同样重要的约束:能源消耗。在这项工作中，我们对软件事务性内存系统的能量和运行时权衡进行了全面的研究。我们描述了三种最先进的基于锁的STM算法的行为，以及三种不同的冲突解决方案。基于这一特征，我们提出了一种基于dvfs的技术，该技术可以集成到分辨率策略中，从而提高能量延迟积(EDP)。实验结果表明，我们的dvfs增强策略确实有利于高争用级别的应用程序。在这个场景中可以观察到EDP的改进高达59%，在STAMP工作负载中平均EDP减少了16%。

{"title":"Energy-Performance Tradeoffs in Software Transactional Memory","authors":"A. Baldassin, J. P. L. Carvalho, L. A. G. Garcia, R. Azevedo","doi":"10.1109/SBAC-PAD.2012.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.19","url":null,"abstract":"Transactional memory (TM) is a new synchronization mechanism devised to simplify parallel programming, thereby helping programmers to unleash the power of current multicore processors. Although software implementations of TM (STM) have been extensively analyzed in terms of runtime performance, little attention has been paid to an equally important constraint faced by nearly all computer systems: energy consumption. In this work we conduct a comprehensive study of energy and runtime tradeoff sin software transactional memory systems. We characterize the behavior of three state-of-the-art lock-based STM algorithms, along with three different conflict resolution schemes. As a result of this characterization, we propose a DVFS-based technique that can be integrated into the resolution policies so as to improve the energy-delay product (EDP). Experimental results show that our DVFS-enhanced policies are indeed beneficial for applications with high contention levels. Improvements of up to 59% in EDP can be observed in this scenario, with an average EDP reduction of 16% across the STAMP workloads.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Global Data Re-allocation via Communication Aggregation in Chapel 在Chapel中通过通信聚合实现全局数据重新分配

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.18

Alberto Sanz, R. Asenjo, Juan López, R. Larrosa, A. Navarro, V. Litvinov, Sung-Eun Choi, B. Chamberlain

Chapel is a parallel programming language designed to improve the productivity and ease of use of conventional and parallel computers. This language currently delivers sub optimal performance when executing codes that perform global data re-allocation operations on distributed memory architectures. This is mainly due to data communication that is done without aggregation (one message for each remote array element). In this work, we analyze Chapel's standard Block and Cyclic distribution modules and optimize the communication routines for array assignments by performing aggregation. Thanks to the expressive power of Chapel, the compiler and runtime have enough information to do communication aggregation without user intervention. The runtime relies on the low-level GAS Net networking layer, whose versions of one-sided bulk put/get routines that support strides are particularly useful for us. Experimental results conducted on Hector (a Cray XE6) and Jaguar (Cray XK6)reveal that the implemented techniques can lead to significant reductions in communication time.

Chapel是一种并行编程语言，旨在提高传统和并行计算机的生产率和易用性。这种语言目前在执行在分布式内存体系结构上执行全局数据重新分配操作的代码时提供了次优性能。这主要是由于数据通信没有聚合(每个远程数组元素都有一条消息)。在这项工作中，我们分析了Chapel的标准块和循环分布模块，并通过执行聚合来优化数组分配的通信例程。由于Chapel的表达能力，编译器和运行时有足够的信息来进行通信聚合，而无需用户干预。运行时依赖于低级的GAS Net网络层，其支持大步的单侧批量put/get例程的版本对我们特别有用。在Hector (Cray XE6)和Jaguar (Cray XK6)上进行的实验结果表明，实施的技术可以显著减少通信时间。

{"title":"Global Data Re-allocation via Communication Aggregation in Chapel","authors":"Alberto Sanz, R. Asenjo, Juan López, R. Larrosa, A. Navarro, V. Litvinov, Sung-Eun Choi, B. Chamberlain","doi":"10.1109/SBAC-PAD.2012.18","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.18","url":null,"abstract":"Chapel is a parallel programming language designed to improve the productivity and ease of use of conventional and parallel computers. This language currently delivers sub optimal performance when executing codes that perform global data re-allocation operations on distributed memory architectures. This is mainly due to data communication that is done without aggregation (one message for each remote array element). In this work, we analyze Chapel's standard Block and Cyclic distribution modules and optimize the communication routines for array assignments by performing aggregation. Thanks to the expressive power of Chapel, the compiler and runtime have enough information to do communication aggregation without user intervention. The runtime relies on the low-level GAS Net networking layer, whose versions of one-sided bulk put/get routines that support strides are particularly useful for us. Experimental results conducted on Hector (a Cray XE6) and Jaguar (Cray XK6)reveal that the implemented techniques can lead to significant reductions in communication time.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123667424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Network Endpoints for Clusters of SMPs smp集群的网络端点

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.15

Ilie Gabriel Tanase, G. Almási, Hanhong Xue, C. Archer

Modern large scale parallel machines feature an increasingly deep hierarchy of interconnections. Individual processing cores employ simultaneous multithreading (SMT) to better exploit functional units, multiple coherent processors are collocated in a node to better exploit links to cache, memory and network (SMP), and multiple nodes are interconnected by specialized low latency/high speed networks. Current trends indicate ever wider SMP nodes in the future. To service these nodes, modern high performance network devices (including Infiniband and all of IBM's recent offerings) offer the ability to sub-divide the network devices' resources among the processing threads. System software, however, lags in exploiting these capabilities, leaving users of e.g., MPI[14], UPC[19] in a bind, requiring complex and fragile workarounds in user programs. In this paper we discuss our implementation of endpoints, the software paradigm central to the IBM PAMI messaging library [3]. A PAMI endpoint is an expression in software of a slice of the network device. System software can service endpoints without serializing the many threads on an SMP by forcing them through a critical section. In the paper we describe the basic guarantees offered by PAMI to the programmer, and how these can be used to enable efficient implementations of high level libraries and programming languages like UPC. We evaluate the efficiency of our implementation on a novel P7IHsystem with up to 4096 cores, running micro benchmarks designed to find performance deficiencies in the endpoints implementation of both point-to-point and collective functions.

现代大型并行机的特点是互联层次越来越深。单个处理核心采用同步多线程(SMT)来更好地利用功能单元，多个连贯处理器在一个节点中并置以更好地利用到缓存、内存和网络(SMP)的链接，多个节点通过专门的低延迟/高速网络相互连接。目前的趋势表明，未来的SMP节点将越来越宽。为了服务这些节点，现代高性能网络设备(包括Infiniband和所有IBM最近提供的产品)提供了在处理线程之间细分网络设备资源的能力。然而，系统软件在利用这些功能方面滞后，使MPI[14]、UPC[19]等用户陷入困境，需要在用户程序中使用复杂而脆弱的变通方法。在本文中，我们讨论端点的实现，端点是IBM PAMI消息库的核心软件范例[3]。PAMI端点是网络设备的一个切片的软件表达式。系统软件可以在不序列化SMP上的许多线程的情况下为端点提供服务，方法是强迫它们通过临界区。在本文中，我们描述了PAMI为程序员提供的基本保证，以及如何使用这些保证来实现高级库和编程语言(如UPC)的有效实现。我们在一个具有多达4096个内核的新型p7ih系统上评估了我们的实现效率，运行微基准测试，旨在发现点对点和集体功能的端点实现中的性能缺陷。

{"title":"Network Endpoints for Clusters of SMPs","authors":"Ilie Gabriel Tanase, G. Almási, Hanhong Xue, C. Archer","doi":"10.1109/SBAC-PAD.2012.15","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.15","url":null,"abstract":"Modern large scale parallel machines feature an increasingly deep hierarchy of interconnections. Individual processing cores employ simultaneous multithreading (SMT) to better exploit functional units, multiple coherent processors are collocated in a node to better exploit links to cache, memory and network (SMP), and multiple nodes are interconnected by specialized low latency/high speed networks. Current trends indicate ever wider SMP nodes in the future. To service these nodes, modern high performance network devices (including Infiniband and all of IBM's recent offerings) offer the ability to sub-divide the network devices' resources among the processing threads. System software, however, lags in exploiting these capabilities, leaving users of e.g., MPI[14], UPC[19] in a bind, requiring complex and fragile workarounds in user programs. In this paper we discuss our implementation of endpoints, the software paradigm central to the IBM PAMI messaging library [3]. A PAMI endpoint is an expression in software of a slice of the network device. System software can service endpoints without serializing the many threads on an SMP by forcing them through a critical section. In the paper we describe the basic guarantees offered by PAMI to the programmer, and how these can be used to enable efficient implementations of high level libraries and programming languages like UPC. We evaluate the efficiency of our implementation on a novel P7IHsystem with up to 4096 cores, running micro benchmarks designed to find performance deficiencies in the endpoints implementation of both point-to-point and collective functions.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"464 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129358735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

The Network Adapter: The Missing Link between MPI Applications and Network Performance 网络适配器:MPI应用程序和网络性能之间缺失的一环

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.17

G. Rodríguez, C. Minkenberg, R. Luijten, R. Beivide, P. Geoffray, J. Labarta, M. Valero, Steve Poole

Network design aspects that influence cost and performance can be classified according to their distance from the applications, into issues concerning topology, switch technology, link technology, network adapter, and communication library. The network adapter has a privileged position to take decisions with more global information than any other component in the network. It receives feedback from the switches and requests from the communication libraries and applications. Also, compared to a network switch, an adapter has access to significantly more memory (host memory and on-chip memory) and memory bandwidth (which typically exceeds network bandwidth). The potential of the adapter to improve global network performance has not yet been fully exploited. In this work we show a series of noticeable performance improvements (of at least 10% to 15%) for medium-sized message exchanges in typical HPC communication patterns by optimizing message segmentation and packet injection policies, that can be implemented in an adapter's firmware inexpensively. We also show that implementing equivalent solutions in the switch (as opposed to the adapter) leads to only marginal performance improvements as the ones obtained by controlling the segmentation and injection policy at the adapter, while involving significantly more cost. In addition, enhancing the adapter will lead to less hardware complexity in the switches, thus reducing cost and energy consumption.

影响成本和性能的网络设计方面可以根据其与应用程序的距离分为拓扑、交换技术、链路技术、网络适配器和通信库等问题。与网络中的任何其他组件相比，网络适配器具有特权地位，可以根据更多的全局信息做出决策。它接收来自交换机的反馈和来自通信库和应用程序的请求。此外，与网络交换机相比，适配器可以访问更多的内存(主机内存和片上内存)和内存带宽(通常超过网络带宽)。适配器改善全球网络性能的潜力尚未得到充分利用。在这项工作中，我们通过优化消息分段和数据包注入策略，展示了典型HPC通信模式中中型消息交换的一系列显著性能改进(至少10%到15%)，这些改进可以在适配器的固件中实现，成本较低。我们还表明，在交换机(而不是适配器)中实现等效的解决方案，与通过在适配器上控制分段和注入策略获得的解决方案相比，只会带来边际性能改进，同时涉及更多的成本。此外，增强适配器将降低交换机的硬件复杂性，从而降低成本和能耗。

{"title":"The Network Adapter: The Missing Link between MPI Applications and Network Performance","authors":"G. Rodríguez, C. Minkenberg, R. Luijten, R. Beivide, P. Geoffray, J. Labarta, M. Valero, Steve Poole","doi":"10.1109/SBAC-PAD.2012.17","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.17","url":null,"abstract":"Network design aspects that influence cost and performance can be classified according to their distance from the applications, into issues concerning topology, switch technology, link technology, network adapter, and communication library. The network adapter has a privileged position to take decisions with more global information than any other component in the network. It receives feedback from the switches and requests from the communication libraries and applications. Also, compared to a network switch, an adapter has access to significantly more memory (host memory and on-chip memory) and memory bandwidth (which typically exceeds network bandwidth). The potential of the adapter to improve global network performance has not yet been fully exploited. In this work we show a series of noticeable performance improvements (of at least 10% to 15%) for medium-sized message exchanges in typical HPC communication patterns by optimizing message segmentation and packet injection policies, that can be implemented in an adapter's firmware inexpensively. We also show that implementing equivalent solutions in the switch (as opposed to the adapter) leads to only marginal performance improvements as the ones obtained by controlling the segmentation and injection policy at the adapter, while involving significantly more cost. In addition, enhancing the adapter will lead to less hardware complexity in the switches, thus reducing cost and energy consumption.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121712323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Level-3 BLAS on the TI C6678 Multi-core DSP TI C6678多核DSP上的3级BLAS

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.26

Murtaza Ali, E. Stotzer, Francisco D. Igual, R. V. D. Geijn

Digital Signal Processors (DSP) are commonly employed in embedded systems. The increase of processing needs in cellular base-stations, radio controllers and industrial/medical imaging systems, has led to the development of multi-core DSPs as well as inclusion of floating point operations while maintaining low power dissipation. The eight-core DSP from Texas Instruments, codenamed TMS320C6678, provides a peak performance of 128 GFLOPS (single precision) and an effective 32 GFLOPS(double precision) for only 10 watts. In this paper, we present the first complete implementation and report performance of the Level-3 Basic Linear Algebra Subprograms(BLAS) routines for this DSP. These routines are first optimized for single core and then parallelized over the different cores using OpenMP constructs. The results show that we can achieve about 8 single precision GFLOPS/watt and 2.2double precision GFLOPS/watt for General Matrix-Matrix multiplication (GEMM). The performance of the rest of theLevel-3 BLAS routines is within 90% of the corresponding GEMM routines.

数字信号处理器(DSP)通常用于嵌入式系统。蜂窝基站、无线电控制器和工业/医疗成像系统中处理需求的增加导致了多核dsp的发展以及在保持低功耗的同时包含浮点运算。来自德州仪器的八核DSP，代号为TMS320C6678，提供128 GFLOPS(单精度)的峰值性能和有效的32 GFLOPS(双精度)，仅为10瓦。在本文中，我们提出了该DSP的第一个完整的三级基本线性代数子程序(BLAS)例程的实现和性能报告。这些例程首先针对单核进行优化，然后使用OpenMP结构在不同的核上并行化。结果表明，通用矩阵-矩阵乘法(GEMM)的单精度GFLOPS/瓦特约为8，双精度GFLOPS/瓦特约为2.2。其余3级BLAS例程的性能在相应的GEMM例程的90%以内。

引用次数: 32