2014 47th Annual IEEE/ACM International Symposium on Microarchitecture最新文献_第2页

Locality-Aware Mapping of Nested Parallel Patterns on GPUs gpu上嵌套并行模式的位置感知映射

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.23

HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, K. Olukotun

Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.

最近的工作是探索使用高级语言来提高gpu上程序员的工作效率。这些语言通常使用高级计算模式(例如Map和Reduce)来编码并行语义，以便自动编译到GPU内核。然而，当模式嵌套时，有效地将模式映射到GPU硬件的问题变得更加困难，这在重要的应用程序中很常见。为了解决这个问题，我们提出了一个通用的分析框架，用于自动有效地将嵌套模式映射到gpu上。分析将嵌套模式映射到逻辑多维域，并参数化每个维度中的块大小和并行度。然后，我们添加特定于gpu的硬约束和软约束来修剪可能映射的空间并选择最佳映射。我们还执行由映射引导的多个编译器优化，以避免动态内存分配，并自动利用GPU内核中的共享内存。我们将自动选择的映射与手动优化的实现在多个基准测试上的性能进行了比较，发现8个基准测试中有7个的平均性能差距为24%。此外，我们的映射策略比简单的1D映射和现有的2D映射分别高出28.6倍和9.6倍。

{"title":"Locality-Aware Mapping of Nested Parallel Patterns on GPUs","authors":"HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, K. Olukotun","doi":"10.1109/MICRO.2014.23","DOIUrl":"https://doi.org/10.1109/MICRO.2014.23","url":null,"abstract":"Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"83 1","pages":"63-74"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80209442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Dodec: Random-Link, Low-Radix On-Chip Networks 十二编:随机链路，低基数片上网络

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.19

Haofan Yang, J. Tripathi, Natalie D. Enright Jerger, Dan Gibson

Network topology plays a vital role in chip design, it largely determines network cost (power and area) and significantly impacts communication performance in many-core architectures. Conventional topologies such as a 2D mesh have drawbacks including high diameter as the network scales and poor load balancing for the center nodes. We propose a methodology to design random topologies for on-chip networks. Random topologies provide better scalability in terms of network diameter and provide inherent load balancing. As a proof-of-concept for random on-chip topologies, we explore a novel set of networks -- do decs -- and illustrate how they reduce network diameter with randomized low-radix router connections. While a 4 × 4 mesh has a diameter of 6, our dodec has a diameter of 4 with lower cost. By introducing randomness, dodec networks exhibit more uniform message latency. By using low-radix routers, dodec networks simplify the router micro architecture and attain 20% area and 22% power reduction compared to mesh routers while delivering the same overall application performance for PARSEC.

网络拓扑结构在芯片设计中起着至关重要的作用，它在很大程度上决定了网络成本(功耗和面积)，并对多核架构中的通信性能产生重大影响。传统的拓扑结构(如2D网格)存在一些缺点，包括随着网络规模的扩大，直径会变大，中心节点的负载平衡也会变差。我们提出了一种设计片上网络随机拓扑的方法。随机拓扑在网络直径方面提供了更好的可伸缩性，并提供了固有的负载平衡。作为随机片上拓扑的概念验证，我们探索了一组新颖的网络- do decs -并说明了它们如何通过随机低基数路由器连接减小网络直径。虽然4 × 4网格的直径为6，但我们的dodec的直径为4，成本较低。通过引入随机性，十二编网络表现出更均匀的消息延迟。通过使用低基数路由器，dodec网络简化了路由器的微架构，与网状路由器相比，可以减少20%的面积和22%的功耗，同时为PARSEC提供相同的整体应用性能。

引用次数: 17

Random Fill Cache Architecture 随机填充缓存架构

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.28

Fangfei Liu, R. Lee

Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.

正常运行的缓存已被证明可以通过各种类型的缓存侧通道攻击泄露关键机密，如加密密钥。这使强加密所提供的安全性失效，并允许违反机密性、假冒攻击和虚假服务。因此，未来的缓存设计必须考虑安全性，理想情况下不能降低性能和功率效率。我们引入了一种新的缓存侧通道攻击分类:基于争用的攻击和基于重用的攻击。以前的安全缓存设计只针对基于争用的攻击，我们表明它们不能防御基于重用的攻击。我们展示了一个令人惊讶的见解，即缓存的基本需求获取策略是一个安全漏洞，它会导致基于重用的攻击成功。我们提出了一种新的随机填充缓存架构，该架构在可配置的邻域窗口内用随机缓存填充取代需求提取。我们展示了随机填充缓存不会降低性能，事实上，对于某些类型的应用程序，它可以提高性能。我们还表明，它提供了针对基于重用的攻击的信息理论安全性。

引用次数: 221

Skewed Compressed Caches 扭曲压缩缓存

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.41

S. Sardashti, André Seznec, D. Wood

Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals. In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks super blocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., No backward pointers). Saccades this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.

缓存压缩寻求的是大缓存的好处和小缓存的面积和功率。理想情况下，压缩缓存通过紧密压缩压缩块来增加有效容量，具有较低的标记和元数据开销，并允许快速查找。然而，以前的压缩缓存设计无法实现所有这些目标。在本文中，我们提出了倾斜压缩缓存(SCC)，这是一种新的硬件压缩缓存，可以降低开销并提高性能。SCC跟踪超级块以减少标记开销，将块压缩成可变数量的子块以减少内部碎片，但保留直接的标记-数据映射以快速找到块并消除额外的元数据(即，没有向后指针)。使用新颖的稀疏超级块标记和考虑压缩大小的倾斜关联映射来解决这个问题。在我们的实验中，SCC提供了平均8%(高达22%)的更高性能，平均6%(高达20%)的总能量降低，实现了最近的解耦压缩缓存[26]的好处，面积开销降低了4倍，设计复杂性降低了。

引用次数: 61

Harnessing Soft Computations for Low-Budget Fault Tolerance 利用软计算实现低预算容错

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.33

D. Khudia, S. Mahlke

A growing number of applications from various domains such as multimedia, machine learning and computer vision are inherently fault tolerant. However, for these soft workloads, not all computations are fault tolerant (e.g., A loop trip count). In this paper, we propose a compiler-based approach that takes advantage of soft computations inherent in the aforementioned class of workloads to bring down the cost of software-only transient fault detection. The technique works by identifying a small subset of critical variables that are necessary for correct macro-operation of the program. Traditional duplication and comparison are used to protect these variables. For the remaining variables and temporaries that only affect the micro-operation of the program, strategic expected value checks are inserted into the code. Intuitively, a computation-chain result near the expected value is either correct or close enough to the correct result so that it does not matter for non-critical variables. Overall, the proposed solution has, on average, only 19.5% performance overhead and reduces the number of silent data corruptions from 15% down to 7.3% and user-visible silent data corruptions from 3.4% down to 1.2% in comparison to an unmodified application. This unacceptable silent data corruption rate is even lower than a traditional full duplication scheme that has, on average, 57% overhead.

越来越多来自多媒体、机器学习和计算机视觉等各个领域的应用都具有固有的容错性。然而，对于这些软工作负载，并不是所有的计算都是容错的(例如，环路行程计数)。在本文中，我们提出了一种基于编译器的方法，该方法利用了上述工作负载类别中固有的软计算来降低仅软件瞬时故障检测的成本。该技术的工作原理是识别程序的正确宏观操作所必需的一小部分关键变量。传统的复制和比较用于保护这些变量。对于仅影响程序的微观操作的剩余变量和临时变量，将在代码中插入策略期望值检查。直观地说，接近期望值的计算链结果要么是正确的，要么是足够接近正确的结果，因此对于非关键变量来说，这无关紧要。总的来说，与未修改的应用程序相比，所建议的解决方案平均只有19.5%的性能开销，并且将静默数据损坏的数量从15%降低到7.3%，将用户可见的静默数据损坏从3.4%降低到1.2%。这种不可接受的静默数据损坏率甚至低于传统的完全复制方案，后者的平均开销为57%。

{"title":"Harnessing Soft Computations for Low-Budget Fault Tolerance","authors":"D. Khudia, S. Mahlke","doi":"10.1109/MICRO.2014.33","DOIUrl":"https://doi.org/10.1109/MICRO.2014.33","url":null,"abstract":"A growing number of applications from various domains such as multimedia, machine learning and computer vision are inherently fault tolerant. However, for these soft workloads, not all computations are fault tolerant (e.g., A loop trip count). In this paper, we propose a compiler-based approach that takes advantage of soft computations inherent in the aforementioned class of workloads to bring down the cost of software-only transient fault detection. The technique works by identifying a small subset of critical variables that are necessary for correct macro-operation of the program. Traditional duplication and comparison are used to protect these variables. For the remaining variables and temporaries that only affect the micro-operation of the program, strategic expected value checks are inserted into the code. Intuitively, a computation-chain result near the expected value is either correct or close enough to the correct result so that it does not matter for non-critical variables. Overall, the proposed solution has, on average, only 19.5% performance overhead and reduces the number of silent data corruptions from 15% down to 7.3% and user-visible silent data corruptions from 3.4% down to 1.2% in comparison to an unmodified application. This unacceptable silent data corruption rate is even lower than a traditional full duplication scheme that has, on average, 57% overhead.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"24 1","pages":"319-330"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73769480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Wormhole: Wisely Predicting Multidimensional Branches 虫洞:明智地预测多维分支

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.40

Jorge Albericio, Joshua San Miguel, Natalie D. Enright Jerger, Andreas Moshovos

Improving branch prediction accuracy is essential in enabling high-performance processors to find more concurrency and to improve energy efficiency by reducing wrong path instruction execution, a paramount concern in today's power-constrained computing landscape. Branch prediction traditionally considers past branch outcomes as a linear, continuous bit stream through which it searches for patterns and correlations. The state-of-the-art TAGE predictor and its variants follow this approach while varying the length of the global history fragments they consider. This work identifies a construct, inherent to several applications that challenges existing, linear history based branch prediction strategies. It finds that applications have branches that exhibit multi-dimensional correlations. These are branches with the following two attributes: 1) they are enclosed within nested loops, and 2) they exhibit correlation across iterations of the outer loops. Folding the branch history and interpreting it as a multidimensional piece of information, exposes these cross-iteration correlations allowing predictors to search for more complex correlations in the history space with lower cost. We present wormhole, a new side-predictor that exploits these multidimensional histories. Wormhole is integrated alongside ISL-TAGE and leverages information from its existing side-predictors. Experiments show that the wormhole predictor improves accuracy more than existing side-predictors, some of which are commercially available, with a similar hardware cost. Considering 40 diverse application traces, the wormhole predictor reduces MPKI by an average of 2.53% and 3.15% on top of 4KB and 32KB ISL-TAGE predictors respectively. When considering the top four workloads that exhibit multi-dimensional history correlations, Wormhole achieves 22% and 20% MPKI average reductions over 4KB and 32KB ISL-TAGE.

提高分支预测的准确性对于使高性能处理器能够发现更多的并发性和通过减少错误路径指令执行来提高能源效率至关重要，这是当今受功率限制的计算环境中最重要的问题。分支预测传统上将过去的分支结果视为一个线性的、连续的比特流，通过它搜索模式和相关性。最先进的TAGE预测器及其变体遵循这种方法，同时改变它们所考虑的全球历史片段的长度。这项工作确定了一个结构，固有的几个应用程序，挑战现有的，基于线性历史的分支预测策略。它发现应用程序具有显示多维相关性的分支。这些分支具有以下两个属性:1)它们被封闭在嵌套循环中，2)它们在外部循环的迭代中表现出相关性。折叠分支历史并将其解释为多维信息片段，可以暴露这些交叉迭代相关性，从而允许预测者以较低的成本在历史空间中搜索更复杂的相关性。我们提出虫洞，一个利用这些多维历史的新的侧面预测器。Wormhole与is - tage集成在一起，并利用其现有的侧向预测器的信息。实验表明，虫洞预测器比现有的侧预测器提高了精度，其中一些侧预测器是市售的，硬件成本相似。考虑到40种不同的应用轨迹，虫洞预测器在4KB和32KB is - tage预测器的基础上，分别平均降低了2.53%和3.15%的MPKI。当考虑到表现出多维历史相关性的前四种工作负载时，在4KB和32KB的islage上，Wormhole的MPKI平均降低了22%和20%。

{"title":"Wormhole: Wisely Predicting Multidimensional Branches","authors":"Jorge Albericio, Joshua San Miguel, Natalie D. Enright Jerger, Andreas Moshovos","doi":"10.1109/MICRO.2014.40","DOIUrl":"https://doi.org/10.1109/MICRO.2014.40","url":null,"abstract":"Improving branch prediction accuracy is essential in enabling high-performance processors to find more concurrency and to improve energy efficiency by reducing wrong path instruction execution, a paramount concern in today's power-constrained computing landscape. Branch prediction traditionally considers past branch outcomes as a linear, continuous bit stream through which it searches for patterns and correlations. The state-of-the-art TAGE predictor and its variants follow this approach while varying the length of the global history fragments they consider. This work identifies a construct, inherent to several applications that challenges existing, linear history based branch prediction strategies. It finds that applications have branches that exhibit multi-dimensional correlations. These are branches with the following two attributes: 1) they are enclosed within nested loops, and 2) they exhibit correlation across iterations of the outer loops. Folding the branch history and interpreting it as a multidimensional piece of information, exposes these cross-iteration correlations allowing predictors to search for more complex correlations in the history space with lower cost. We present wormhole, a new side-predictor that exploits these multidimensional histories. Wormhole is integrated alongside ISL-TAGE and leverages information from its existing side-predictors. Experiments show that the wormhole predictor improves accuracy more than existing side-predictors, some of which are commercially available, with a similar hardware cost. Considering 40 diverse application traces, the wormhole predictor reduces MPKI by an average of 2.53% and 3.15% on top of 4KB and 32KB ISL-TAGE predictors respectively. When considering the top four workloads that exhibit multi-dimensional history correlations, Wormhole achieves 22% and 20% MPKI average reductions over 4KB and 32KB ISL-TAGE.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"8 1","pages":"509-520"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83515456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution 均衡器:动态调整GPU资源的有效执行

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.16

Ankit Sethia, S. Mahlke

GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode. Throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode.

gpu使用数千个线程来提供高性能和效率。通常，如果内核的一个线程更频繁地使用其中一种资源(计算、带宽、数据缓存)，那么由于大量相同的并发线程，将会对该资源产生明显的争用。由于对瓶颈资源的争用，这种争用最终会使内核的性能饱和，同时使其他资源得不到充分利用。为了克服这个问题，一个可以调整硬件以匹配内核特征的运行时系统可以有效地缓解内核资源需求和GPU上存在的硬件资源之间的不平衡。我们提出了均衡器，一个低开销的硬件运行时系统，动态监控内核的资源需求，并管理片上并发的数量，核心频率和内存频率，以适应硬件，以最好地匹配运行内核的需求。均衡器提供两种模式的效率。首先，gpu使用数千个线程来提供高性能和效率，可以在不显著降低性能的情况下节省能源。通常，如果内核的一个线程更频繁地使用其中一种资源(计算、带宽、数据缓存)，那么由于大量相同的并发线程，将会对该资源产生明显的争用。由于对瓶颈资源的争用，这种争用最终会使内核的性能饱和，同时使其他资源得不到充分利用。为了克服这个问题，一个可以调整硬件以匹配内核特征的运行时系统可以有效地缓解内核资源需求和GPU上存在的硬件资源之间的不平衡。我们提出了均衡器，一个低开销的硬件运行时系统，动态监控内核的资源需求，并管理片上并发的数量，核心频率和内存频率，以适应硬件，以最好地匹配运行内核的需求。均衡器提供两种模式的效率。首先，它可以通过节流未充分利用的资源来节省能源，而不会显著降低性能。其次，它可以增加瓶颈资源以减少争用，在不增加大量能量的情况下提供更高的性能。在27个内核的频谱中，均衡器在能量模式下节省15%，在性能模式下加速22%。限制未充分利用的资源。其次，它可以增加瓶颈资源以减少争用，在不增加大量能量的情况下提供更高的性能。在27个内核的频谱中，均衡器在能量模式下节省15%，在性能模式下加速22%。

{"title":"Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution","authors":"Ankit Sethia, S. Mahlke","doi":"10.1109/MICRO.2014.16","DOIUrl":"https://doi.org/10.1109/MICRO.2014.16","url":null,"abstract":"GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode. Throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"25 1","pages":"647-658"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73474568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 63

Compiler Support for Optimizing Memory Bank-Level Parallelism 编译器支持优化内存组级并行性

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.34

W. Ding, D. Guttman, M. Kandemir

Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or many cores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.

许多先前的基于编译器的优化方案只关注缓存数据的局部性。然而，缓存局部性只是在新兴的多核或多核上运行的应用程序整体性能的一部分。例如，即使在缓存优化的代码中，内存延迟也可能占执行时间的很大一部分，造成这种情况的主要原因之一是缺乏内存级并行性。基于此，我们提出了一种基于编译器的银行级并行(BLP)优化方案，该方案使用循环平铺调度。更具体地说，我们首先使用缓存丢失方程来预测每个块中最后一级缓存丢失将发生的位置，然后确定每个块中将被访问的内存库集。利用这些信息，提出了两种贴图调度算法来最大化BLP，每种算法针对不同的场景。我们进一步讨论如何增强基于编译器的方案，以考虑内存控制器级并行性和行缓冲区局部性。我们使用11个多线程应用程序进行的实验评估表明，所提出的BLP优化可以将平均BLP平均提高17.1%，从而使平均内存访问延迟减少9.2%。此外，考虑到内存控制器级并行性和行缓冲区局部性(除了BLP)，我们在内存访问延迟方面的平均改进达到22.2%。

{"title":"Compiler Support for Optimizing Memory Bank-Level Parallelism","authors":"W. Ding, D. Guttman, M. Kandemir","doi":"10.1109/MICRO.2014.34","DOIUrl":"https://doi.org/10.1109/MICRO.2014.34","url":null,"abstract":"Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or many cores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"13 1","pages":"571-582"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88604462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration 高上升:一个高基数开关与单周期仲裁的3D集成

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.45

Supreet Jeloka, R. Das, R. Dreslinski, T. Mudge, D. Blaauw

This paper proposes a novel 3D switch, called 'Hi-Rise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch's fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction.

本文提出了一种新颖的3D开关，称为“Hi-Rise”，它采用高基数开关来有效地跨多个堆叠层的芯片传输数据。所提出的互连是分层的，由每硅层两个交换机和一组专用的层对层通道组成。然而，分层3D切换可能导致不同层之间的不公平仲裁。为了解决这个问题，本文提出了一种独特的基于类的仲裁方案，该方案完全集成到交换结构中，并且易于实现。它使3D分层交换机的公平性可与最近最少授予仲裁的平面2D交换机相媲美。根据不同的根数、堆叠层数和不同的3D集成技术对3D开关进行评估。采用32nm技术评估的64位、128位宽度、4层Hi-Rise在均匀随机流量下的吞吐量为10.65 Tbps。与2D设计相比，这相当于吞吐量提高了15%，面积减少了33%，延迟减少了20%，每个事务的能耗减少了38%。

引用次数: 14

DaDianNao: A Machine-Learning Supercomputer DaDianNao:机器学习超级计算机

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.58

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, O. Temam

Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

许多公司正在部署面向消费者或行业的服务，这些服务主要基于机器学习算法，用于对大量数据进行复杂处理。最先进和最流行的机器学习算法是卷积神经网络和深度神经网络(cnn和dnn)，它们被认为是计算和内存密集型的。近年来，人们提出了许多神经网络加速器，它们可以提供较高的计算容量/面积比，但仍然受到内存访问的限制。然而，与处理器在通用工作负载上面临的内存墙不同，cnn和dnn的内存占用虽然很大，但不会超出多芯片系统的片上存储能力。该特性与CNN/DNN算法特性相结合，可以实现高内部带宽和低外部通信，从而以合理的面积成本实现高度并行。在本文中，我们将介绍一种定制的多芯片机器学习架构。我们表明，在已知最大的神经网络层的一个子集上，可以实现比GPU更快450.65倍的加速，并且在64芯片系统中平均减少150.31倍的能量。我们在28nm的位置和路由上实现节点，包含定制存储和计算单元的组合，具有工业级互连。

{"title":"DaDianNao: A Machine-Learning Supercomputer","authors":"Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, O. Temam","doi":"10.1109/MICRO.2014.58","DOIUrl":"https://doi.org/10.1109/MICRO.2014.58","url":null,"abstract":"Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"21 1","pages":"609-622"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91091300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1256