首页 > 最新文献

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

英文 中文
Architecting a Novel Hybrid Cache with Low Energy 一种新型的低能量混合缓存架构
Jiacong He, Joseph Callenes-Sloan
To handle the memory wall problem and satisfy the high processing speed of the multicore processors, there is significant demand for a large cache capacity in future. The 3D die-stacking DRAM cache with high density can be used as a large cache compared with conventional SRAM cache. However, energy becomes an inevitable challenge with the increasing size of DRAM cache. STT-RAM with near-zero leakage can be integrated with DRAM cache as a hybrid cache to reduce static energy, but the high write energy of STT-RAM brings another energy challenge. We observe that volatile STT-RAM can be utilized in the hybrid cache as a buffer to balance the high static energy of DRAM and the high dynamic energy of non-volatile STT-RAM.
为了解决内存墙问题和满足多核处理器的高处理速度,未来对大缓存容量的需求很大。与传统的SRAM缓存相比,高密度的3D模叠DRAM缓存可以用作大缓存。然而,随着DRAM高速缓存的不断增大,能量成为一个不可避免的挑战。几乎为零泄漏的STT-RAM可以与DRAM缓存集成作为混合缓存来降低静态能量,但STT-RAM的高写能量带来了另一个能量挑战。我们观察到易失性STT-RAM可以在混合缓存中作为缓冲来平衡DRAM的高静态能量和非易失性STT-RAM的高动态能量。
{"title":"Architecting a Novel Hybrid Cache with Low Energy","authors":"Jiacong He, Joseph Callenes-Sloan","doi":"10.1109/PACT.2017.47","DOIUrl":"https://doi.org/10.1109/PACT.2017.47","url":null,"abstract":"To handle the memory wall problem and satisfy the high processing speed of the multicore processors, there is significant demand for a large cache capacity in future. The 3D die-stacking DRAM cache with high density can be used as a large cache compared with conventional SRAM cache. However, energy becomes an inevitable challenge with the increasing size of DRAM cache. STT-RAM with near-zero leakage can be integrated with DRAM cache as a hybrid cache to reduce static energy, but the high write energy of STT-RAM brings another energy challenge. We observe that volatile STT-RAM can be utilized in the hybrid cache as a buffer to balance the high static energy of DRAM and the high dynamic energy of non-volatile STT-RAM.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116512409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introspective Computing 内省计算
Karl Taht, R. Balasubramonian
We live in an advent of specialized tasks ranging from graphics, to networking and graph processing, to machine learning and more. While hardware accelerators cater to mainstream demands, general purpose units will always be challenged to run new software. Introspective Computing focuses on building a feedback mechanism to tune dynamic hardware features in real-time. Unlike most prior work, our study is done completely on a real system using hardware resources tunable in most modern Intel processors.
我们生活在一个专业任务的时代,从图形到网络和图形处理,再到机器学习等等。虽然硬件加速器迎合了主流需求,但通用设备在运行新软件时总是面临挑战。内省计算侧重于构建一种反馈机制来实时调整动态硬件特性。与大多数先前的工作不同,我们的研究完全是在使用大多数现代英特尔处理器可调硬件资源的真实系统上完成的。
{"title":"Introspective Computing","authors":"Karl Taht, R. Balasubramonian","doi":"10.1109/PACT.2017.49","DOIUrl":"https://doi.org/10.1109/PACT.2017.49","url":null,"abstract":"We live in an advent of specialized tasks ranging from graphics, to networking and graph processing, to machine learning and more. While hardware accelerators cater to mainstream demands, general purpose units will always be challenged to run new software. Introspective Computing focuses on building a feedback mechanism to tune dynamic hardware features in real-time. Unlike most prior work, our study is done completely on a real system using hardware resources tunable in most modern Intel processors.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123421385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SuperGraph-SLP Auto-Vectorization 自动向量化SuperGraph-SLP
Vasileios Porpodas
SIMD vectors help improve the performance of certain applications. The code gets vectorized into SIMD form either by hand, or automatically with auto-vectorizing compilers. The Superword-Level Parallelism (SLP) vectorization algorithm is a widely used algorithm for vectorizing straight-line code and is part of most industrial compilers. The algorithm attempts to pack scalar instructions into vectors starting from specific seed instructions in a bottom-up way. This approach, however, suffers from two main problems: (i) the algorithm may not reach instructions that could have been vectorized, and (ii) atomically operating on individual SLP graphs suffers from cost overestimation when consecutive SLP graphs share data. Both issues lead to missed vectorization opportunities even in simple code.In this work we propose SuperGraph-SLP (SG-SLP), an improved vectorization algorithm that overcomes these limitations of the existing algorithm. SG-SLP operates on a larger region, called the SuperGraph. This allows it to reach and successfully vectorize code that was previously unreachable. Moreover, the new region helps eliminate the inaccuracies in the cost-calculation as it allows for a more holistic view of the code. Our experiments show that SG-SLP improves the vectorization coverage and outperforms the state-of-the-art SLP across a number kernels by 36% on average, without affecting the compilation time.
SIMD向量有助于提高某些应用程序的性能。代码可以手工或使用自动向量化编译器自动向量化为SIMD形式。超字级并行(Superword-Level Parallelism, SLP)向量化算法是一种广泛使用的直线代码向量化算法,是大多数工业编译器的组成部分。该算法试图以自下而上的方式将标量指令打包成从特定种子指令开始的向量。然而,这种方法存在两个主要问题:(i)算法可能无法达到本可以矢量化的指令,以及(ii)当连续的SLP图共享数据时,对单个SLP图进行自动操作会导致成本高估。这两个问题导致即使在简单的代码中也会错过向量化的机会。在这项工作中,我们提出了一种改进的向量化算法SuperGraph-SLP (SG-SLP),克服了现有算法的这些局限性。SG-SLP操作在一个更大的区域,称为超级图。这允许它到达并成功地向量化以前无法到达的代码。此外,新的区域有助于消除成本计算中的不准确性,因为它允许对代码进行更全面的查看。我们的实验表明,SG-SLP提高了矢量化覆盖率,并且在不影响编译时间的情况下,在多个内核上比最先进的SLP平均高出36%。
{"title":"SuperGraph-SLP Auto-Vectorization","authors":"Vasileios Porpodas","doi":"10.1109/PACT.2017.21","DOIUrl":"https://doi.org/10.1109/PACT.2017.21","url":null,"abstract":"SIMD vectors help improve the performance of certain applications. The code gets vectorized into SIMD form either by hand, or automatically with auto-vectorizing compilers. The Superword-Level Parallelism (SLP) vectorization algorithm is a widely used algorithm for vectorizing straight-line code and is part of most industrial compilers. The algorithm attempts to pack scalar instructions into vectors starting from specific seed instructions in a bottom-up way. This approach, however, suffers from two main problems: (i) the algorithm may not reach instructions that could have been vectorized, and (ii) atomically operating on individual SLP graphs suffers from cost overestimation when consecutive SLP graphs share data. Both issues lead to missed vectorization opportunities even in simple code.In this work we propose SuperGraph-SLP (SG-SLP), an improved vectorization algorithm that overcomes these limitations of the existing algorithm. SG-SLP operates on a larger region, called the SuperGraph. This allows it to reach and successfully vectorize code that was previously unreachable. Moreover, the new region helps eliminate the inaccuracies in the cost-calculation as it allows for a more holistic view of the code. Our experiments show that SG-SLP improves the vectorization coverage and outperforms the state-of-the-art SLP across a number kernels by 36% on average, without affecting the compilation time.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121761070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
POSTER: Cutting the Fat: Speeding Up RBM for Fast Deep Learning Through Generalized Redundancy Elimination 海报:削减脂肪:通过广义冗余消除加速RBM的快速深度学习
Lin Ning, Randall Pittman, Xipeng Shen
Restricted Boltzmann Machine (RBM) is the building block of Deep Belief Nets and other deep learning tools. Fast learning and prediction are both essential for practical usage of RBM-based machine learning techniques. This paper presents a concept named generalized redundancy elimination to avoid most of the the computations required in RBM learning and prediction without changing the results. It consists of two optimization techniques. The first is called bounds-based filtering, which, through triangular inequality, replaces expensive calculations of many vector dot products with fast bounds calculations. The second is delta product, which effectively detects and avoids many repeated calculations in the core operation of RBM, Gibbs Sampling. The optimizations are applicable to both the standard contrastive divergence learning algorithm and its variations. In addition, the paper presents how to address some complexities these optimizations create for them to be used together and for them to be implemented efficiently on massively parallel processors. Results show that the optimizations can produce several-fold (up to 3X for training and 5.3X for prediction) speedups.
受限玻尔兹曼机(RBM)是深度信念网和其他深度学习工具的基石。快速学习和预测对于基于rbm的机器学习技术的实际应用都是必不可少的。本文提出了广义冗余消除的概念,在不改变结果的情况下,避免了RBM学习和预测所需的大部分计算。它包括两种优化技术。第一种是基于边界的滤波,它通过三角不等式,用快速的边界计算取代了许多向量点积的昂贵计算。二是delta积,有效检测并避免了RBM核心操作Gibbs Sampling中的多次重复计算。该优化方法既适用于标准的对比发散学习算法,也适用于其变体。此外,本文还介绍了如何解决这些优化为它们一起使用和在大规模并行处理器上有效实现而产生的一些复杂性。结果表明,优化可以产生几倍的速度(训练最多3倍,预测最多5.3倍)。
{"title":"POSTER: Cutting the Fat: Speeding Up RBM for Fast Deep Learning Through Generalized Redundancy Elimination","authors":"Lin Ning, Randall Pittman, Xipeng Shen","doi":"10.1109/PACT.2017.36","DOIUrl":"https://doi.org/10.1109/PACT.2017.36","url":null,"abstract":"Restricted Boltzmann Machine (RBM) is the building block of Deep Belief Nets and other deep learning tools. Fast learning and prediction are both essential for practical usage of RBM-based machine learning techniques. This paper presents a concept named generalized redundancy elimination to avoid most of the the computations required in RBM learning and prediction without changing the results. It consists of two optimization techniques. The first is called bounds-based filtering, which, through triangular inequality, replaces expensive calculations of many vector dot products with fast bounds calculations. The second is delta product, which effectively detects and avoids many repeated calculations in the core operation of RBM, Gibbs Sampling. The optimizations are applicable to both the standard contrastive divergence learning algorithm and its variations. In addition, the paper presents how to address some complexities these optimizations create for them to be used together and for them to be implemented efficiently on massively parallel processors. Results show that the optimizations can produce several-fold (up to 3X for training and 5.3X for prediction) speedups.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128679597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
POSTER: Elastic Reconfiguration for Heterogeneous NoCs with BiNoCHS 海报:具有BiNoCHS的异构noc的弹性重构
Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, H. Sarbazi-Azad, T. Wenisch
CPU-GPU heterogeneous systems are emerging are emerging as architectures of choice for high-performance energy-efficient computing. Designing on-chip interconnects for such systems is challenging: CPUs typically benefit greatly from optimizations that reduce latency, but rarely saturate bandwidth or queueing resources. In contrast, GPUs generate intense traffic that produces local congestion, harming CPU performance. Congestion-optimized interconnects can mitigate this problem through larger virtual and physical channel resources. However, when there is little traffic, such networks become suboptimal due to higher unloaded packet latencies and critical path delays. We argue for a reconfigurable network that can activate additional channels under high load/congestion and shut them off when the network is unloaded. However, these additional resources consume more power, making it difficult to statically provision a power budget for the network. We propose Elastic Network Reconfiguration, wherein we aggressively reduce voltage to free power budget to activate additional channels. Our key observation is that, under high load, the reduced queueing due to additional channels more than compensates for the increase in per-hop latency of the reduced clock frequency. We introduce BiNoCHS as a voltage-scalable NoC that specifically targets CPU-GPU heterogeneous systems and employs elastic network reconfiguration to maintain a constant power budget while adapting between latency- and congestion-optimized modes.
CPU-GPU异构系统正在成为高性能节能计算的首选架构。为这样的系统设计片上互连具有挑战性:cpu通常从减少延迟的优化中受益匪浅,但很少使带宽或排队资源饱和。相反,gpu会产生大量流量,导致局部拥塞,影响CPU性能。拥塞优化互连可以通过更大的虚拟和物理通道资源来缓解这个问题。然而,当流量很少时,由于更高的未加载数据包延迟和关键路径延迟,这种网络变得次优。我们主张一个可重新配置的网络,它可以在高负载/拥塞下激活额外的通道,并在网络卸载时关闭它们。然而,这些额外的资源消耗更多的电力,使得为网络静态地提供电力预算变得困难。我们提出弹性网络重构,其中我们积极降低电压以释放功率预算以激活额外的通道。我们的主要观察结果是,在高负载下,由于额外的通道而减少的队列远远补偿了时钟频率降低所带来的每跳延迟的增加。我们介绍了BiNoCHS作为电压可扩展的NoC,专门针对CPU-GPU异构系统,并采用弹性网络重构来保持恒定的功率预算,同时适应延迟和拥塞优化模式。
{"title":"POSTER: Elastic Reconfiguration for Heterogeneous NoCs with BiNoCHS","authors":"Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, H. Sarbazi-Azad, T. Wenisch","doi":"10.1109/PACT.2017.46","DOIUrl":"https://doi.org/10.1109/PACT.2017.46","url":null,"abstract":"CPU-GPU heterogeneous systems are emerging are emerging as architectures of choice for high-performance energy-efficient computing. Designing on-chip interconnects for such systems is challenging: CPUs typically benefit greatly from optimizations that reduce latency, but rarely saturate bandwidth or queueing resources. In contrast, GPUs generate intense traffic that produces local congestion, harming CPU performance. Congestion-optimized interconnects can mitigate this problem through larger virtual and physical channel resources. However, when there is little traffic, such networks become suboptimal due to higher unloaded packet latencies and critical path delays. We argue for a reconfigurable network that can activate additional channels under high load/congestion and shut them off when the network is unloaded. However, these additional resources consume more power, making it difficult to statically provision a power budget for the network. We propose Elastic Network Reconfiguration, wherein we aggressively reduce voltage to free power budget to activate additional channels. Our key observation is that, under high load, the reduced queueing due to additional channels more than compensates for the increase in per-hop latency of the reduced clock frequency. We introduce BiNoCHS as a voltage-scalable NoC that specifically targets CPU-GPU heterogeneous systems and employs elastic network reconfiguration to maintain a constant power budget while adapting between latency- and congestion-optimized modes.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134298271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
MultiGraph: Efficient Graph Processing on GPUs MultiGraph: gpu上高效的图形处理
Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, P. Sadayappan
High-level GPU graph processing frameworks are an attractive alternative for achieving both high productivity and high performance. Hence, several high-level frameworks for graph processing on GPUs have been developed. In this paper, we develop an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks. It uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices. A two-phase edge processing approach trades off extra data movement for improved load balancing across GPU threads, by using a 2D blocked representation for edge data. Experimental results demonstrate performance improvement over current state-of-the-art GPU graph processing frameworks for many benchmark programs and data sets.
高级GPU图形处理框架是实现高生产力和高性能的有吸引力的替代方案。因此,gpu上图形处理的几个高级框架已经被开发出来。在本文中,我们开发了一种在gpu上进行图形处理的方法,旨在克服现有框架的一些性能限制。它使用多种数据表示和执行策略来处理密集和稀疏的顶点边界,这取决于活动图顶点的比例。两阶段边缘处理方法通过使用边缘数据的2D阻塞表示来改善GPU线程之间的负载平衡,从而减少了额外的数据移动。实验结果表明,在许多基准程序和数据集上,性能优于当前最先进的GPU图形处理框架。
{"title":"MultiGraph: Efficient Graph Processing on GPUs","authors":"Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, P. Sadayappan","doi":"10.1109/PACT.2017.48","DOIUrl":"https://doi.org/10.1109/PACT.2017.48","url":null,"abstract":"High-level GPU graph processing frameworks are an attractive alternative for achieving both high productivity and high performance. Hence, several high-level frameworks for graph processing on GPUs have been developed. In this paper, we develop an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks. It uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices. A two-phase edge processing approach trades off extra data movement for improved load balancing across GPU threads, by using a 2D blocked representation for edge data. Experimental results demonstrate performance improvement over current state-of-the-art GPU graph processing frameworks for many benchmark programs and data sets.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116455555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
POSTER: NUMA-Aware Power Management for Chip Multiprocessors 海报:芯片多处理器的NUMA-Aware电源管理
Changmin Ahn, Camilo A. Celis Guzman, Bernhard Egger
Traditional approaches for cache-coherent shared-memory architectures running symmetric multiprocessing (SMP) operating systems are not adequate for future manycore chips where power management presents one of the most important challenges. In this work, we present a power management framework for many-core systems that does not require coherent shared memory and supports multiple-voltage/multiple-frequency (MVMF) architectures. A hierar-chical NUMA-aware power management technique combines dynamic voltage and frequency scaling (DVFS) with workload migration. The conflicting goals of grouping workloads with similar utilization patterns and placing workloads as close as possible to their data are considered by a greedy placement algorithm. Implemented in software and evaluated on existing hardware, the proposed technique achieves a 30 and 8 percent improvement in performance-per-watt compared to DVFS-only and NUMA-unaware power management.
运行对称多处理(SMP)操作系统的缓存一致共享内存架构的传统方法不适用于未来的多核芯片,其中电源管理是最重要的挑战之一。在这项工作中,我们提出了一个多核系统的电源管理框架,它不需要相干共享内存,并支持多电压/多频率(MVMF)架构。一种分层的numa感知电源管理技术将动态电压和频率缩放(DVFS)与工作负载迁移相结合。贪婪放置算法考虑了对具有相似利用模式的工作负载进行分组和将工作负载放置在尽可能靠近其数据的位置这两个相互冲突的目标。在软件中实现并在现有硬件上进行评估,与只使用dvfs和不使用numa的电源管理相比,所提出的技术在每瓦特性能方面分别提高了30%和8%。
{"title":"POSTER: NUMA-Aware Power Management for Chip Multiprocessors","authors":"Changmin Ahn, Camilo A. Celis Guzman, Bernhard Egger","doi":"10.1109/PACT.2017.31","DOIUrl":"https://doi.org/10.1109/PACT.2017.31","url":null,"abstract":"Traditional approaches for cache-coherent shared-memory architectures running symmetric multiprocessing (SMP) operating systems are not adequate for future manycore chips where power management presents one of the most important challenges. In this work, we present a power management framework for many-core systems that does not require coherent shared memory and supports multiple-voltage/multiple-frequency (MVMF) architectures. A hierar-chical NUMA-aware power management technique combines dynamic voltage and frequency scaling (DVFS) with workload migration. The conflicting goals of grouping workloads with similar utilization patterns and placing workloads as close as possible to their data are considered by a greedy placement algorithm. Implemented in software and evaluated on existing hardware, the proposed technique achieves a 30 and 8 percent improvement in performance-per-watt compared to DVFS-only and NUMA-unaware power management.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129825244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploiting Asymmetric SIMD Register Configurations in ARM-to-x86 Dynamic Binary Translation 在arm到x86的动态二进制转换中利用非对称SIMD寄存器配置
Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, W. Hsu
Processor manufacturers have adopted SIMD for decades because of its superior performance and power efficiency. The configurations of SIMD registers (i.e., the number and width) have evolved and diverged rapidly through various ISA extensions on different architectures. However, migrating legacy or proprietary applications optimized for one guest ISA to another host ISA that has fewer but longer SIMD registers through binary translation raises the issues of asymmetric SIMD register configurations. To date, these issues have been overlooked. As a result, only a small fraction of the potential performance gain is realized due to underutilization of the host's SIMD parallelism and register capacity.In this paper, we present a novel dynamic binary translation technique called spill-aware SLP (saSLP), which combines short ARMv8 NEON instructions and registers in the guest binary loops to fully utilize the x86 AVX host's parallelism as well as minimize register spilling. Our experiment results show that saSLP improves the performance by 1.6X (2.3X) across a number of benchmarks, and reduces spilling by 97% (99%) for ARMv8 NEON to x86 AVX2 (AVX-512) translation.
处理器制造商几十年来一直采用SIMD,因为它具有卓越的性能和功率效率。SIMD寄存器的配置(即数量和宽度)通过不同架构上的各种ISA扩展而迅速发展和分化。但是,通过二进制转换将针对一个来宾ISA优化的遗留或专有应用程序迁移到具有更少但更长的SIMD寄存器的另一个主机ISA,会引起SIMD寄存器配置不对称的问题。迄今为止,这些问题一直被忽视。因此,由于未充分利用主机的SIMD并行性和寄存器容量,只实现了一小部分潜在的性能增益。在本文中,我们提出了一种新的动态二进制翻译技术,称为溢出感知SLP (saSLP),它结合了短ARMv8 NEON指令和来宾二进制循环中的寄存器,以充分利用x86 AVX主机的并行性,并最大限度地减少寄存器溢出。我们的实验结果表明,saSLP在许多基准测试中将性能提高了1.6倍(2.3倍),并且在ARMv8 NEON到x86 AVX2 (AVX-512)的转换中减少了97%(99%)的溢出。
{"title":"Exploiting Asymmetric SIMD Register Configurations in ARM-to-x86 Dynamic Binary Translation","authors":"Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, W. Hsu","doi":"10.1109/PACT.2017.15","DOIUrl":"https://doi.org/10.1109/PACT.2017.15","url":null,"abstract":"Processor manufacturers have adopted SIMD for decades because of its superior performance and power efficiency. The configurations of SIMD registers (i.e., the number and width) have evolved and diverged rapidly through various ISA extensions on different architectures. However, migrating legacy or proprietary applications optimized for one guest ISA to another host ISA that has fewer but longer SIMD registers through binary translation raises the issues of asymmetric SIMD register configurations. To date, these issues have been overlooked. As a result, only a small fraction of the potential performance gain is realized due to underutilization of the host's SIMD parallelism and register capacity.In this paper, we present a novel dynamic binary translation technique called spill-aware SLP (saSLP), which combines short ARMv8 NEON instructions and registers in the guest binary loops to fully utilize the x86 AVX host's parallelism as well as minimize register spilling. Our experiment results show that saSLP improves the performance by 1.6X (2.3X) across a number of benchmarks, and reduces spilling by 97% (99%) for ARMv8 NEON to x86 AVX2 (AVX-512) translation.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130895482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
POSTER: BACM: Barrier-Aware Cache Management for Irregular Memory-Intensive GPGPU Workloads 海报:BACM:不规则内存密集型GPGPU工作负载的屏障感知缓存管理
Yuxi Liu, Xia Zhao, Zhibin Yu, Zhenlin Wang, Xiaolin Wang, Yingwei Luo, L. Eeckhout
General-purpose workloads running on modern graphics processing units (GPGPUs) rely on hardware-based barriers to synchronize warps within a thread block (TB). However, imbalance may exist before reaching a barrier if a GPGPU workload contains irregular memory accesses, i.e., some warps may be critical while others may not. Ideally, cache space should be reserved for the critical warps. Unfortunately, current cache management policies are unaware of the existence of barriers and critical warps, which significantly limits the performance of irregular memory-intensive GPGPU workloads.In this work, we propose Barrier-Aware Cache Management (BACM), which is built on top of two underlying policies: a greedy policy and a friendly policy. The greedy policy does not allow non-critical warps to allocate cache lines in the L1 data cache; only critical warps can. The friendly policy allows non-critical warps to allocate cache lines but only over invalid or lower-priority cache lines. Based on the L1 data cache hit rate of non-critical warps, BACM dynamically chooses between the greedy and friendly policies. By doing so, BACM reserves more cache space to accelerate critical warps, thereby improving overall performance. Experimental results show that BACM achieves an average performance improvement of 24% and 20% compared to the GTO and BAWS policies, respectively. BACM's hardware cost is limited to 96 bytes per streaming multiprocessor.
在现代图形处理单元(gpgpu)上运行的通用工作负载依赖于基于硬件的屏障来同步线程块(TB)内的扭曲。然而,如果GPGPU工作负载包含不规则的内存访问,则在达到屏障之前可能存在不平衡,即一些扭曲可能是关键的,而另一些则可能不是。理想情况下,缓存空间应该为关键warp保留。不幸的是,当前的缓存管理策略没有意识到屏障和临界翘曲的存在,这极大地限制了不规则内存密集型GPGPU工作负载的性能。在这项工作中,我们提出了屏障感知缓存管理(BACM),它建立在两个基本策略之上:贪婪策略和友好策略。贪心策略不允许非关键扭曲在L1数据缓存中分配缓存线;只有临界经线可以。友好策略允许非关键warp分配缓存线,但只能在无效或低优先级的缓存线上分配。BACM基于非临界弯曲的L1数据缓存命中率,动态选择贪婪策略和友好策略。通过这样做,BACM保留了更多的缓存空间来加速关键翘曲,从而提高了整体性能。实验结果表明,与GTO和BAWS策略相比,BACM策略的平均性能分别提高了24%和20%。BACM的硬件成本限制在每个流多处理器96字节。
{"title":"POSTER: BACM: Barrier-Aware Cache Management for Irregular Memory-Intensive GPGPU Workloads","authors":"Yuxi Liu, Xia Zhao, Zhibin Yu, Zhenlin Wang, Xiaolin Wang, Yingwei Luo, L. Eeckhout","doi":"10.1109/PACT.2017.55","DOIUrl":"https://doi.org/10.1109/PACT.2017.55","url":null,"abstract":"General-purpose workloads running on modern graphics processing units (GPGPUs) rely on hardware-based barriers to synchronize warps within a thread block (TB). However, imbalance may exist before reaching a barrier if a GPGPU workload contains irregular memory accesses, i.e., some warps may be critical while others may not. Ideally, cache space should be reserved for the critical warps. Unfortunately, current cache management policies are unaware of the existence of barriers and critical warps, which significantly limits the performance of irregular memory-intensive GPGPU workloads.In this work, we propose Barrier-Aware Cache Management (BACM), which is built on top of two underlying policies: a greedy policy and a friendly policy. The greedy policy does not allow non-critical warps to allocate cache lines in the L1 data cache; only critical warps can. The friendly policy allows non-critical warps to allocate cache lines but only over invalid or lower-priority cache lines. Based on the L1 data cache hit rate of non-critical warps, BACM dynamically chooses between the greedy and friendly policies. By doing so, BACM reserves more cache space to accelerate critical warps, thereby improving overall performance. Experimental results show that BACM achieves an average performance improvement of 24% and 20% compared to the GTO and BAWS policies, respectively. BACM's hardware cost is limited to 96 bytes per streaming multiprocessor.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122248782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAM: Optimizing Multithreaded Cores for Speculative Parallelism SAM:优化多线程内核的推测并行性
Maleen Abeydeera, Suvinay Subramanian, M. C. Jeffrey, J. Emer, Daniel Sánchez
This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. This disconnect causes major performance pathologies: increasing the number of threads per core adds conflicts and wasted work, and puts pressure on speculative execution resources. These pathologies often squander the benefits of multithreading.We present speculation-aware multithreading (SAM), a simple policy that addresses these pathologies. By coordinating instruction dispatch and conflict resolution priorities, SAM focuses execution resources on work that is more likely to commit, avoiding aborts and using speculation resources more efficiently.We design SAM variants for in-order and out-of-order cores. SAM is cheap to implement and makes multithreaded cores much more beneficial on speculative parallel programs. We evaluate SAM on systems with up to 64 SMT cores. With SAM, 8-threaded cores outperform single-threaded cores by 2.33x on average, while a speculation-oblivious policy yields a 1.85x speedup. SAM also reduces wasted work by 52%.
这项工作研究了多线程内核和推测并行(例如,事务性内存或线程级推测)之间的相互作用。这些技术经常一起使用,但它们都是独立开发的。这种断开会导致主要的性能问题:增加每个内核的线程数量会增加冲突和浪费的工作,并给推测性执行资源带来压力。这些病态常常浪费多线程的好处。我们提出了推测感知多线程(SAM),这是一种解决这些问题的简单策略。通过协调指令调度和冲突解决优先级,SAM将执行资源集中在更有可能提交的工作上,从而避免中止并更有效地使用推测资源。我们为有序核和无序核设计了SAM变体。SAM的实现成本很低,并且使多线程内核对推测性并行程序更有利。我们在多达64个SMT内核的系统上评估SAM。使用SAM, 8线程内核的性能比单线程内核平均高出2.33倍,而投机无关策略的速度提高了1.85倍。SAM还减少了52%的工作浪费。
{"title":"SAM: Optimizing Multithreaded Cores for Speculative Parallelism","authors":"Maleen Abeydeera, Suvinay Subramanian, M. C. Jeffrey, J. Emer, Daniel Sánchez","doi":"10.1109/PACT.2017.37","DOIUrl":"https://doi.org/10.1109/PACT.2017.37","url":null,"abstract":"This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. This disconnect causes major performance pathologies: increasing the number of threads per core adds conflicts and wasted work, and puts pressure on speculative execution resources. These pathologies often squander the benefits of multithreading.We present speculation-aware multithreading (SAM), a simple policy that addresses these pathologies. By coordinating instruction dispatch and conflict resolution priorities, SAM focuses execution resources on work that is more likely to commit, avoiding aborts and using speculation resources more efficiently.We design SAM variants for in-order and out-of-order cores. SAM is cheap to implement and makes multithreaded cores much more beneficial on speculative parallel programs. We evaluate SAM on systems with up to 64 SMT cores. With SAM, 8-threaded cores outperform single-threaded cores by 2.33x on average, while a speculation-oblivious policy yields a 1.85x speedup. SAM also reduces wasted work by 52%.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129291646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1