2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

英文中文

Message from the WAMCA 2020 General Chair 2020年WAMCA大会主席致辞

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/sbac-pad49847.2020.00053

WAMCA was created as an associated workshop of SBAC-PAD in 2009. The aim was to provide a specific channel for contributions and discussions on multi-core applications. Then, we have striven to implement it every year in conjunction with the corresponding SBAC-PAD. The initial topic has been extended to cover all topics related to shared memory parallelism and accelerators. This adaptation was necessary because most of accelerators that have emerged so far follow a shared memory model, even if the original data typically come from a remote main memory. This year, we received 23 submissions and accepted 10, with an average of 3 reviews per paper, thus an acceptance rate of 43%. We thank the authors of all submitted papers for their consideration and we expect to remain attractive for an increasingly larger community.

WAMCA成立于2009年，是SBAC-PAD的一个联合研讨会。其目的是为多核应用程序的贡献和讨论提供一个特定的渠道。然后，我们每年都努力与相应的SBAC-PAD一起实施它。最初的主题已经扩展到涵盖与共享内存并行性和加速器相关的所有主题。这种调整是必要的，因为到目前为止出现的大多数加速器都遵循共享内存模型，即使原始数据通常来自远程主内存。今年我们共收到投稿23篇，录用10篇，平均每篇评审3篇，录用率43%。我们感谢所有提交论文的作者的考虑，我们希望对越来越大的社区保持吸引力。

引用次数: 0

Analyzing the Loop Scheduling Mechanisms on Julia Multithreading Julia多线程的循环调度机制分析

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00043

Diana A. Barros, C. Bentes

Julia is a quite recent dynamic language proposed to tackle the trade-off between productivity and efficiency. The idea is to provide the usability o flanguages such as Python or MATLAB side by sidewith the performance of C and C++. The support for multithreading programming in Julia was only released last year, and therefore still requires performance studies. In this work, we focus on the parallel loops and more specifically on the available mechanisms for assigning the loop iterations to the threads. We analyse the per-formance of the macros @spawn and @threads, used for loop parallelization. Our results show that there is no best fit solution for all cases. The use of @spawn provides better load balance for unbalanced loops with reasonably heavy iterations, but incurs in high overhead for workstealing. While @threads has low overhead, and workswell for loops with good balance among iterations.

Julia是一种最新的动态语言，旨在解决生产力和效率之间的权衡问题。其想法是在C和c++的性能的同时提供Python或MATLAB等语言的可用性。Julia中对多线程编程的支持是去年才发布的，因此仍然需要进行性能研究。在这项工作中，我们关注并行循环，更具体地说，关注将循环迭代分配给线程的可用机制。我们分析了用于循环并行化的宏@spawn和@threads的性能。我们的结果表明，不存在所有情况下的最佳拟合解。@spawn的使用为具有相当繁重迭代的不平衡循环提供了更好的负载平衡，但会导致工作窃取的高开销。而@threads的开销很低，并且在迭代之间很好地平衡循环。

引用次数: 2

High-Performance Low-Memory Lowering: GEMM-based Algorithms for DNN Convolution 高性能低内存降低:基于gem的DNN卷积算法

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00024

Andrew Anderson, Aravind Vasudevan, Cormac Keane, David Gregg

Deep Neural Network Convolution is often implemented with general matrix multiplication ( GEMM ) using the well-known im2col algorithm. This algorithm constructs a Toeplitz matrix from the input feature maps, and multiplies them by the convolutional kernel. With input feature map dimensions C × H × W and kernel dimensions M × C × K^2, im2col requires O(K^2CHW ) additional space. Although this approach is very popular, there has been little study of the associated design space. We show that the im2col algorithm is just one point in a regular design space of algorithms which translate convolution to GEMM. We enumerate this design space, and experimentally evaluate each algorithmic variant. Our evaluation yields several novel low-memory algorithms which match the performance of the best known approaches despite requiring only a small fraction of the additional memory.

深度神经网络卷积通常使用通用矩阵乘法(GEMM)实现，使用著名的im2col算法。该算法从输入特征映射中构造Toeplitz矩阵，并将其与卷积核相乘。输入特征映射的维度为C × H × W，内核的维度为M × C × K^2, im2col需要O(K^2CHW)的额外空间。尽管这种方法非常流行，但对相关设计空间的研究却很少。我们证明了im2col算法只是将卷积转换为GEMM的算法的规则设计空间中的一个点。我们列举了这个设计空间，并实验评估了每个算法变体。我们的评估产生了几种新的低内存算法，它们的性能与最知名的方法相匹配，尽管只需要一小部分额外的内存。

引用次数: 13

XPySom: High-Performance Self-Organizing Maps 高性能自组织映射

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00037

Riccardo Mancini, Antonio Ritacco, Giacomo Lanciano, T. Cucinotta

In this paper, we introduce XPySom, a new open-source Python implementation of the well-known Self-Organizing Maps (SOM) technique. It is designed to achieve high performance on a single node, exploiting widely available Python libraries for vector processing on multi-core CPUs and GP-GPUs. We present results from an extensive experimental evaluation of XPySom in comparison to widely used open-source SOM implementations, showing that it outperforms the other available alternatives. Indeed, our experimentation carried out using the Extended MNIST open data set shows a speed-up of about 7x and 100x when compared to the best open-source multi-core implementations we could find with multi-core and GP-GPU acceleration, respectively, achieving the same accuracy levels in terms of quantization error.

在本文中，我们介绍了XPySom，这是一个新的开源Python实现，它实现了著名的自组织映射(SOM)技术。它旨在在单个节点上实现高性能，利用广泛可用的Python库在多核cpu和gp - gpu上进行矢量处理。我们将XPySom的广泛实验评估结果与广泛使用的开源SOM实现进行比较，表明它优于其他可用的替代方案。事实上，我们使用扩展MNIST开放数据集进行的实验显示，与我们可以找到的最佳开源多核实现相比，我们分别使用多核和GP-GPU加速的速度提高了约7倍和100倍，在量化误差方面达到了相同的精度水平。

引用次数: 6

Selective Protection for Sparse Iterative Solvers to Reduce the Resilience Overhead 稀疏迭代求解器的选择性保护以减少弹性开销

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00029

Hongyang Sun, Ana Gainaru, Manu Shantharam, P. Raghavan

The increasing scale and complexity of today's high-performance computing (HPC) systems demand a renewed focus on enhancing the resilience of long-running scientific applications in the presence of faults. Many of these applications are iterative in nature as they operate on sparse matrices that concern the simulation of partial differential equations (PDEs) which numerically capture the physical properties on discretized spatial domains. While these applications currently benefit from many application-agnostic resilience techniques at the system level, such as checkpointing and replication, there is significant overhead in deploying these techniques. In this paper, we seek to develop application-aware resilience techniques that leverage an iterative application's intrinsic resiliency to faults and selectively protect certain elements, thereby reducing the resilience overhead. Specifically, we investigate the impact of soft errors on the widely used Preconditioned Conjugate Gradient (PCG) method, whose reliability depends heavily on the error propagation through the sparse matrix-vector multiplication (SpMV) operation. By characterizing the performance of PCG in correlation with a numerical property of the underlying sparse matrix, we propose a selective protection scheme that protects only certain critical elements of the operation based on an analytical model. An experimental evaluation using 20 sparse matrices from the SuiteSparse Matrix Collection shows that our proposed scheme is able to reduce the resilience overhead by as much as 70.2% and an average of 32.6% compared to the baseline techniques with full-protection or zero-protection.

当今高性能计算(HPC)系统的规模和复杂性不断增加，需要重新关注在存在故障的情况下增强长期运行的科学应用程序的弹性。许多这些应用程序本质上是迭代的，因为它们操作在稀疏矩阵上，这些矩阵与偏微分方程(PDEs)的模拟有关，偏微分方程(PDEs)在数值上捕获离散空间域上的物理性质。虽然这些应用程序目前受益于系统级的许多与应用程序无关的弹性技术，例如检查点和复制，但部署这些技术的开销很大。在本文中，我们试图开发应用程序感知的弹性技术，利用迭代应用程序对故障的内在弹性，并有选择地保护某些元素，从而减少弹性开销。具体来说，我们研究了软误差对广泛使用的预条件共轭梯度(PCG)方法的影响，该方法的可靠性很大程度上取决于通过稀疏矩阵向量乘法(SpMV)运算的误差传播。通过描述PCG的性能与底层稀疏矩阵的数值性质的相关性，我们提出了一种基于解析模型的选择性保护方案，该方案仅保护操作的某些关键元素。使用来自SuiteSparse矩阵集合的20个稀疏矩阵进行的实验评估表明，与具有完全保护或零保护的基线技术相比，我们提出的方案能够将弹性开销减少多达70.2%，平均减少32.6%。

{"title":"Selective Protection for Sparse Iterative Solvers to Reduce the Resilience Overhead","authors":"Hongyang Sun, Ana Gainaru, Manu Shantharam, P. Raghavan","doi":"10.1109/SBAC-PAD49847.2020.00029","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00029","url":null,"abstract":"The increasing scale and complexity of today's high-performance computing (HPC) systems demand a renewed focus on enhancing the resilience of long-running scientific applications in the presence of faults. Many of these applications are iterative in nature as they operate on sparse matrices that concern the simulation of partial differential equations (PDEs) which numerically capture the physical properties on discretized spatial domains. While these applications currently benefit from many application-agnostic resilience techniques at the system level, such as checkpointing and replication, there is significant overhead in deploying these techniques. In this paper, we seek to develop application-aware resilience techniques that leverage an iterative application's intrinsic resiliency to faults and selectively protect certain elements, thereby reducing the resilience overhead. Specifically, we investigate the impact of soft errors on the widely used Preconditioned Conjugate Gradient (PCG) method, whose reliability depends heavily on the error propagation through the sparse matrix-vector multiplication (SpMV) operation. By characterizing the performance of PCG in correlation with a numerical property of the underlying sparse matrix, we propose a selective protection scheme that protects only certain critical elements of the operation based on an analytical model. An experimental evaluation using 20 sparse matrices from the SuiteSparse Matrix Collection shows that our proposed scheme is able to reduce the resilience overhead by as much as 70.2% and an average of 32.6% compared to the baseline techniques with full-protection or zero-protection.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121317829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Towards Profile-Guided Optimization for Safe and Efficient Parallel Stream Processing in Rust 面向Rust安全高效并行流处理的剖面导向优化

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00047

Stefan Sydow, Mohannad Nabelsee, S. Glesner, Paula Herber

The efficient mapping of stream processing applications to parallel hardware architectures is a difficult problem. While parallelization is often highly desirable as it reduces the overall execution time, its advantages must be carefully weighed against the parallelization overhead of complexity and communication costs. This paper presents a novel profile-guided optimization for parallel stream processing based on the multi-paradigm system programming language Rust. Our approach's key idea is to systematically balance the performance gain that can be achieved from parallelization with the communication overhead. To achieve this, we 1) use profiling to gain tight estimates of task execution times, 2) evaluate the cost of the fundamental concurrency constructs in Rust with synthetic benchmarks, and exploit this information to estimate the communication overhead introduced by various degrees of parallelism, and 3) present a novel optimization algorithm that exploits both estimates to fine-tune the degree of parallelism and train processing in a given application. Overall, our approach enables us to map parallel stream processing applications to parallel hardware efficiently. The safety concepts anchored in Rust ensure the reliability of the resulting implementation. We demonstrate our approach's practical applicability with two case studies: the word count problem and aircraft telemetry decoding.

流处理应用程序到并行硬件架构的有效映射是一个难题。虽然并行化通常是非常可取的，因为它减少了总体执行时间，但必须仔细权衡并行化的复杂性开销和通信成本。本文提出了一种基于多范式系统编程语言Rust的并行流处理配置文件导向优化方法。我们的方法的关键思想是系统地平衡从并行化中获得的性能增益和通信开销。为了实现这一点，我们1)使用概要分析来获得任务执行时间的严格估计，2)使用综合基准评估Rust中基本并发构造的成本，并利用这些信息来估计各种并行度引入的通信开销，以及3)提出一种新的优化算法，该算法利用这两种估计来微调并行度，并在给定的应用程序中训练处理。总的来说，我们的方法使我们能够有效地将并行流处理应用程序映射到并行硬件。Rust中的安全概念确保了最终实现的可靠性。我们通过两个案例研究证明了我们的方法的实际适用性:单词计数问题和飞机遥测解码。

{"title":"Towards Profile-Guided Optimization for Safe and Efficient Parallel Stream Processing in Rust","authors":"Stefan Sydow, Mohannad Nabelsee, S. Glesner, Paula Herber","doi":"10.1109/SBAC-PAD49847.2020.00047","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00047","url":null,"abstract":"The efficient mapping of stream processing applications to parallel hardware architectures is a difficult problem. While parallelization is often highly desirable as it reduces the overall execution time, its advantages must be carefully weighed against the parallelization overhead of complexity and communication costs. This paper presents a novel profile-guided optimization for parallel stream processing based on the multi-paradigm system programming language Rust. Our approach's key idea is to systematically balance the performance gain that can be achieved from parallelization with the communication overhead. To achieve this, we 1) use profiling to gain tight estimates of task execution times, 2) evaluate the cost of the fundamental concurrency constructs in Rust with synthetic benchmarks, and exploit this information to estimate the communication overhead introduced by various degrees of parallelism, and 3) present a novel optimization algorithm that exploits both estimates to fine-tune the degree of parallelism and train processing in a given application. Overall, our approach enables us to map parallel stream processing applications to parallel hardware efficiently. The safety concepts anchored in Rust ensure the reliability of the resulting implementation. We demonstrate our approach's practical applicability with two case studies: the word count problem and aircraft telemetry decoding.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126807653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

On-chip Parallel Photonic Reservoir Computing using Multiple Delay Lines 基于多延迟线的片上并行光子库计算

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00015

S. Hasnain, R. Mahapatra

Silicon-Photonics architectures have enabled high speed hardware implementations of Reservoir Computing (RC). With a delayed feedback reservoir (DFR) model, only one non-linear node can be used to perform RC. However, the delay is often provided by using off-chip fiber optics which is not only space inconvenient but it also becomes architectural bottleneck and hinders to scalability. In this paper, we propose a completely on-chip photonic RC architecture for high performance computing, employing multiple electronically tunable delay lines and micro-ring resonator (MRR) switch for multi-tasking. Proposed architecture provides 84% less error compared to the state-of-the-art standalone architecture in [8] for executing NARMA task. For multi-tasking, the proposed architecture shows 80% better performance than [8]. The architecture outperforms all other proposed architectures as well. The on-chip area and power overhead of proposed architecture due to delay lines and MRR switch are 0.0184mm^2 and 26mW respectively.

硅光子学架构使水库计算(RC)的高速硬件实现成为可能。对于延迟反馈水库(DFR)模型，只有一个非线性节点可以进行RC。然而，通常使用片外光纤提供延迟，不仅空间不便，而且成为架构瓶颈，阻碍了可扩展性。在本文中，我们提出了一个完整的片上光子RC架构，用于高性能计算，采用多条电子可调谐延迟线和微环谐振器(MRR)开关进行多任务处理。与[8]中最先进的独立体系结构相比，所提出的体系结构在执行NARMA任务时提供的错误减少了84%。对于多任务处理，该架构的性能比[8]提高了80%。该体系结构也优于所有其他提出的体系结构。由于延迟线和MRR开关造成的片上面积和功耗开销分别为0.0184mm^2和26mW。

引用次数: 2

A Fast and Concise Parallel Implementation of the 8x8 2D IDCT using Halide 使用卤化物的8x8 2D IDCT的快速简洁并行实现

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00032

Martin J. Johnson, D. Playne

The Inverse Discrete Cosine Transform (IDCT) is commonly used for image and video decoding. Due to the ubiquitous nature of this application area, very efficient implementations of the IDCT transform are of great importance and have lead to the development of highly optimized libraries. The popular libjpeg-turbo library contains 1000s of lines of handwritten assembly code utilizing SIMD instruction sets for a variety of architectures. We present an alternative approach, implementing the 8x8 2D IDCT written in the image processing language Halide - a high-level, functional language that allows for concise, portable, parallel and very efficient code. We show how less than 100 lines of Halide can replace over 1000 lines of code for each architecture in the libjpeg-turbo library to perform JPEG decoding. The Halide implementation is compared for ARMv8 and x86-64 SIMD extensions and shows a 5-25 percent performance improvement over the SIMD code in libjpeg-turbo while also being much easier to maintain and port to new architectures.

反离散余弦变换(IDCT)通常用于图像和视频解码。由于该应用领域的普遍性，非常有效地实现IDCT转换是非常重要的，并且已经导致了高度优化库的开发。流行的libjpeg-turbo库包含1000行手写汇编代码，这些汇编代码利用了适用于各种体系结构的SIMD指令集。我们提出了一种替代方法，实现用图像处理语言Halide编写的8x8 2D IDCT - Halide是一种高级功能语言，允许简洁，可移植，并行和非常高效的代码。我们展示了不到100行的Halide如何替换libjpeg-turbo库中每个架构的1000多行代码来执行JPEG解码。我们将Halide实现与ARMv8和x86-64 SIMD扩展进行了比较，结果显示，与libjpeg-turbo中的SIMD代码相比，Halide实现的性能提高了5- 25%，同时也更容易维护和移植到新的体系结构中。

引用次数: 0

Optimizing Green Energy Consumption of Fog Computing Architectures 优化雾计算架构的绿色能耗

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00021

A. Gougeon, Benjamin Camus, Anne-Cécile Orgerie

The Cloud already represents an important part of the global energy consumption, and this consumption keeps increasing. Many solutions have been investigated to increase its energy efficiency and to reduce its environmental impact. However, with the introduction of new requirements, notably in terms of latency, an architecture complementary to the Cloud is emerging: the Fog. The Fog computing paradigm represents a distributed architecture closer to the end-user. Its necessity and feasibility keep being demonstrated in recent works. However, its impact on energy consumption is often neglected and the integration of renewable energy has not been considered yet. The goal of this work is to exhibit an energy-efficient Fog architecture considering the integration of renewable energy. We explore three resource allocation algorithms and three consolidation policies. Our simulation results, based on real traces, show that the intrinsic low computing capability of the nodes in a Fog context makes it harder to exploit renewable energy. In addition, the share of the consumption from the communication network between the computing resources increases in this context, and the communication devices are even harder to power through renewable sources.

云计算已经成为全球能源消耗的重要组成部分，而且这种消耗还在不断增加。已经研究了许多解决方案来提高其能源效率并减少其对环境的影响。然而，随着新需求的引入，特别是在延迟方面，一种与云互补的架构正在出现:Fog。雾计算范式代表了一种更接近最终用户的分布式架构。其必要性和可行性在近年来的工作中不断得到论证。然而，其对能源消费的影响往往被忽视，可再生能源的整合尚未得到考虑。这项工作的目标是展示一个考虑到可再生能源整合的节能雾建筑。我们探讨了三种资源分配算法和三种整合策略。基于真实轨迹的仿真结果表明，雾环境下节点固有的低计算能力使得可再生能源的开发更加困难。此外，在这种情况下，计算资源之间的通信网络的消耗份额增加了，通信设备更难通过可再生能源供电。

{"title":"Optimizing Green Energy Consumption of Fog Computing Architectures","authors":"A. Gougeon, Benjamin Camus, Anne-Cécile Orgerie","doi":"10.1109/SBAC-PAD49847.2020.00021","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00021","url":null,"abstract":"The Cloud already represents an important part of the global energy consumption, and this consumption keeps increasing. Many solutions have been investigated to increase its energy efficiency and to reduce its environmental impact. However, with the introduction of new requirements, notably in terms of latency, an architecture complementary to the Cloud is emerging: the Fog. The Fog computing paradigm represents a distributed architecture closer to the end-user. Its necessity and feasibility keep being demonstrated in recent works. However, its impact on energy consumption is often neglected and the integration of renewable energy has not been considered yet. The goal of this work is to exhibit an energy-efficient Fog architecture considering the integration of renewable energy. We explore three resource allocation algorithms and three consolidation policies. Our simulation results, based on real traces, show that the intrinsic low computing capability of the nodes in a Fog context makes it harder to exploit renewable energy. In addition, the share of the consumption from the communication network between the computing resources increases in this context, and the communication devices are even harder to power through renewable sources.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127335772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Online Sharing-Aware Thread Mapping in Software Transactional Memory 软件事务性内存中支持在线共享的线程映射

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2020-09-01 DOI: 10.1109/SBAC-PAD49847.2020.00016

Douglas Pereira Pasqualin, M. Diener, A. R. D. Bois, M. Pilla

Software Transactional Memory (STM) is an alternative abstraction to synchronize processes in parallel programming. One advantage is simplicity since it is possible to replace the use of explicit locks with atomic blocks. Regarding STM performance, many studies already have been made focusing on reducing the number of aborts. However, in current multicore architectures with complex memory hierarchies, it is also important to consider where the memory of a program is allocated and how it is accessed. This paper proposes the use of a technique called sharing-aware mapping, which maps threads to cores of an application based on their memory access behavior, to achieve better performance in STM systems. We introduce STMap, an online, low overhead mechanism to detect the sharing behavior and perform the mapping directly inside the STM library, by tracking and analyzing how threads perform STM operations. In experiments with the STAMP benchmark suite and synthetic benchmarks, STMap shows performance gains of up to 77% on a Xeon system (17.5% on average) and 85% on an Opteron system (9.1% on average), compared to the Linux scheduler.

软件事务性内存(STM)是并行编程中同步进程的另一种抽象。一个优点是简单，因为可以用原子块代替显式锁的使用。关于STM的性能，已经进行了许多研究，重点是减少流产次数。然而，在当前具有复杂内存层次结构的多核体系结构中，考虑程序的内存分配位置以及如何访问它也很重要。本文建议使用一种称为共享感知映射的技术，该技术根据线程的内存访问行为将线程映射到应用程序的核心，从而在STM系统中实现更好的性能。通过跟踪和分析线程如何执行STM操作，我们引入了STMap，这是一种在线、低开销的机制，用于检测共享行为并直接在STM库中执行映射。在使用STAMP基准测试套件和合成基准测试的实验中，与Linux调度器相比，STMap在Xeon系统上的性能提升高达77%(平均17.5%)，在Opteron系统上的性能提升高达85%(平均9.1%)。

{"title":"Online Sharing-Aware Thread Mapping in Software Transactional Memory","authors":"Douglas Pereira Pasqualin, M. Diener, A. R. D. Bois, M. Pilla","doi":"10.1109/SBAC-PAD49847.2020.00016","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00016","url":null,"abstract":"Software Transactional Memory (STM) is an alternative abstraction to synchronize processes in parallel programming. One advantage is simplicity since it is possible to replace the use of explicit locks with atomic blocks. Regarding STM performance, many studies already have been made focusing on reducing the number of aborts. However, in current multicore architectures with complex memory hierarchies, it is also important to consider where the memory of a program is allocated and how it is accessed. This paper proposes the use of a technique called sharing-aware mapping, which maps threads to cores of an application based on their memory access behavior, to achieve better performance in STM systems. We introduce STMap, an online, low overhead mechanism to detect the sharing behavior and perform the mapping directly inside the STM library, by tracking and analyzing how threads perform STM operations. In experiments with the STAMP benchmark suite and synthetic benchmarks, STMap shows performance gains of up to 77% on a Xeon system (17.5% on average) and 85% on an Opteron system (9.1% on average), compared to the Linux scheduler.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127867442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀