HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, K. Olukotun
Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.
{"title":"Locality-Aware Mapping of Nested Parallel Patterns on GPUs","authors":"HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, K. Olukotun","doi":"10.1109/MICRO.2014.23","DOIUrl":"https://doi.org/10.1109/MICRO.2014.23","url":null,"abstract":"Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"83 1","pages":"63-74"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80209442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haofan Yang, J. Tripathi, Natalie D. Enright Jerger, Dan Gibson
Network topology plays a vital role in chip design, it largely determines network cost (power and area) and significantly impacts communication performance in many-core architectures. Conventional topologies such as a 2D mesh have drawbacks including high diameter as the network scales and poor load balancing for the center nodes. We propose a methodology to design random topologies for on-chip networks. Random topologies provide better scalability in terms of network diameter and provide inherent load balancing. As a proof-of-concept for random on-chip topologies, we explore a novel set of networks -- do decs -- and illustrate how they reduce network diameter with randomized low-radix router connections. While a 4 × 4 mesh has a diameter of 6, our dodec has a diameter of 4 with lower cost. By introducing randomness, dodec networks exhibit more uniform message latency. By using low-radix routers, dodec networks simplify the router micro architecture and attain 20% area and 22% power reduction compared to mesh routers while delivering the same overall application performance for PARSEC.
网络拓扑结构在芯片设计中起着至关重要的作用,它在很大程度上决定了网络成本(功耗和面积),并对多核架构中的通信性能产生重大影响。传统的拓扑结构(如2D网格)存在一些缺点,包括随着网络规模的扩大,直径会变大,中心节点的负载平衡也会变差。我们提出了一种设计片上网络随机拓扑的方法。随机拓扑在网络直径方面提供了更好的可伸缩性,并提供了固有的负载平衡。作为随机片上拓扑的概念验证,我们探索了一组新颖的网络- do decs -并说明了它们如何通过随机低基数路由器连接减小网络直径。虽然4 × 4网格的直径为6,但我们的dodec的直径为4,成本较低。通过引入随机性,十二编网络表现出更均匀的消息延迟。通过使用低基数路由器,dodec网络简化了路由器的微架构,与网状路由器相比,可以减少20%的面积和22%的功耗,同时为PARSEC提供相同的整体应用性能。
{"title":"Dodec: Random-Link, Low-Radix On-Chip Networks","authors":"Haofan Yang, J. Tripathi, Natalie D. Enright Jerger, Dan Gibson","doi":"10.1109/MICRO.2014.19","DOIUrl":"https://doi.org/10.1109/MICRO.2014.19","url":null,"abstract":"Network topology plays a vital role in chip design, it largely determines network cost (power and area) and significantly impacts communication performance in many-core architectures. Conventional topologies such as a 2D mesh have drawbacks including high diameter as the network scales and poor load balancing for the center nodes. We propose a methodology to design random topologies for on-chip networks. Random topologies provide better scalability in terms of network diameter and provide inherent load balancing. As a proof-of-concept for random on-chip topologies, we explore a novel set of networks -- do decs -- and illustrate how they reduce network diameter with randomized low-radix router connections. While a 4 × 4 mesh has a diameter of 6, our dodec has a diameter of 4 with lower cost. By introducing randomness, dodec networks exhibit more uniform message latency. By using low-radix routers, dodec networks simplify the router micro architecture and attain 20% area and 22% power reduction compared to mesh routers while delivering the same overall application performance for PARSEC.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"38 1","pages":"496-508"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78739313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.
{"title":"Random Fill Cache Architecture","authors":"Fangfei Liu, R. Lee","doi":"10.1109/MICRO.2014.28","DOIUrl":"https://doi.org/10.1109/MICRO.2014.28","url":null,"abstract":"Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"64 1","pages":"203-215"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78912212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals. In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks super blocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., No backward pointers). Saccades this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.
{"title":"Skewed Compressed Caches","authors":"S. Sardashti, André Seznec, D. Wood","doi":"10.1109/MICRO.2014.41","DOIUrl":"https://doi.org/10.1109/MICRO.2014.41","url":null,"abstract":"Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals. In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks super blocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., No backward pointers). Saccades this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"478 1","pages":"331-342"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77769594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A growing number of applications from various domains such as multimedia, machine learning and computer vision are inherently fault tolerant. However, for these soft workloads, not all computations are fault tolerant (e.g., A loop trip count). In this paper, we propose a compiler-based approach that takes advantage of soft computations inherent in the aforementioned class of workloads to bring down the cost of software-only transient fault detection. The technique works by identifying a small subset of critical variables that are necessary for correct macro-operation of the program. Traditional duplication and comparison are used to protect these variables. For the remaining variables and temporaries that only affect the micro-operation of the program, strategic expected value checks are inserted into the code. Intuitively, a computation-chain result near the expected value is either correct or close enough to the correct result so that it does not matter for non-critical variables. Overall, the proposed solution has, on average, only 19.5% performance overhead and reduces the number of silent data corruptions from 15% down to 7.3% and user-visible silent data corruptions from 3.4% down to 1.2% in comparison to an unmodified application. This unacceptable silent data corruption rate is even lower than a traditional full duplication scheme that has, on average, 57% overhead.
{"title":"Harnessing Soft Computations for Low-Budget Fault Tolerance","authors":"D. Khudia, S. Mahlke","doi":"10.1109/MICRO.2014.33","DOIUrl":"https://doi.org/10.1109/MICRO.2014.33","url":null,"abstract":"A growing number of applications from various domains such as multimedia, machine learning and computer vision are inherently fault tolerant. However, for these soft workloads, not all computations are fault tolerant (e.g., A loop trip count). In this paper, we propose a compiler-based approach that takes advantage of soft computations inherent in the aforementioned class of workloads to bring down the cost of software-only transient fault detection. The technique works by identifying a small subset of critical variables that are necessary for correct macro-operation of the program. Traditional duplication and comparison are used to protect these variables. For the remaining variables and temporaries that only affect the micro-operation of the program, strategic expected value checks are inserted into the code. Intuitively, a computation-chain result near the expected value is either correct or close enough to the correct result so that it does not matter for non-critical variables. Overall, the proposed solution has, on average, only 19.5% performance overhead and reduces the number of silent data corruptions from 15% down to 7.3% and user-visible silent data corruptions from 3.4% down to 1.2% in comparison to an unmodified application. This unacceptable silent data corruption rate is even lower than a traditional full duplication scheme that has, on average, 57% overhead.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"24 1","pages":"319-330"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73769480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jorge Albericio, Joshua San Miguel, Natalie D. Enright Jerger, Andreas Moshovos
Improving branch prediction accuracy is essential in enabling high-performance processors to find more concurrency and to improve energy efficiency by reducing wrong path instruction execution, a paramount concern in today's power-constrained computing landscape. Branch prediction traditionally considers past branch outcomes as a linear, continuous bit stream through which it searches for patterns and correlations. The state-of-the-art TAGE predictor and its variants follow this approach while varying the length of the global history fragments they consider. This work identifies a construct, inherent to several applications that challenges existing, linear history based branch prediction strategies. It finds that applications have branches that exhibit multi-dimensional correlations. These are branches with the following two attributes: 1) they are enclosed within nested loops, and 2) they exhibit correlation across iterations of the outer loops. Folding the branch history and interpreting it as a multidimensional piece of information, exposes these cross-iteration correlations allowing predictors to search for more complex correlations in the history space with lower cost. We present wormhole, a new side-predictor that exploits these multidimensional histories. Wormhole is integrated alongside ISL-TAGE and leverages information from its existing side-predictors. Experiments show that the wormhole predictor improves accuracy more than existing side-predictors, some of which are commercially available, with a similar hardware cost. Considering 40 diverse application traces, the wormhole predictor reduces MPKI by an average of 2.53% and 3.15% on top of 4KB and 32KB ISL-TAGE predictors respectively. When considering the top four workloads that exhibit multi-dimensional history correlations, Wormhole achieves 22% and 20% MPKI average reductions over 4KB and 32KB ISL-TAGE.
提高分支预测的准确性对于使高性能处理器能够发现更多的并发性和通过减少错误路径指令执行来提高能源效率至关重要,这是当今受功率限制的计算环境中最重要的问题。分支预测传统上将过去的分支结果视为一个线性的、连续的比特流,通过它搜索模式和相关性。最先进的TAGE预测器及其变体遵循这种方法,同时改变它们所考虑的全球历史片段的长度。这项工作确定了一个结构,固有的几个应用程序,挑战现有的,基于线性历史的分支预测策略。它发现应用程序具有显示多维相关性的分支。这些分支具有以下两个属性:1)它们被封闭在嵌套循环中,2)它们在外部循环的迭代中表现出相关性。折叠分支历史并将其解释为多维信息片段,可以暴露这些交叉迭代相关性,从而允许预测者以较低的成本在历史空间中搜索更复杂的相关性。我们提出虫洞,一个利用这些多维历史的新的侧面预测器。Wormhole与is - tage集成在一起,并利用其现有的侧向预测器的信息。实验表明,虫洞预测器比现有的侧预测器提高了精度,其中一些侧预测器是市售的,硬件成本相似。考虑到40种不同的应用轨迹,虫洞预测器在4KB和32KB is - tage预测器的基础上,分别平均降低了2.53%和3.15%的MPKI。当考虑到表现出多维历史相关性的前四种工作负载时,在4KB和32KB的islage上,Wormhole的MPKI平均降低了22%和20%。
{"title":"Wormhole: Wisely Predicting Multidimensional Branches","authors":"Jorge Albericio, Joshua San Miguel, Natalie D. Enright Jerger, Andreas Moshovos","doi":"10.1109/MICRO.2014.40","DOIUrl":"https://doi.org/10.1109/MICRO.2014.40","url":null,"abstract":"Improving branch prediction accuracy is essential in enabling high-performance processors to find more concurrency and to improve energy efficiency by reducing wrong path instruction execution, a paramount concern in today's power-constrained computing landscape. Branch prediction traditionally considers past branch outcomes as a linear, continuous bit stream through which it searches for patterns and correlations. The state-of-the-art TAGE predictor and its variants follow this approach while varying the length of the global history fragments they consider. This work identifies a construct, inherent to several applications that challenges existing, linear history based branch prediction strategies. It finds that applications have branches that exhibit multi-dimensional correlations. These are branches with the following two attributes: 1) they are enclosed within nested loops, and 2) they exhibit correlation across iterations of the outer loops. Folding the branch history and interpreting it as a multidimensional piece of information, exposes these cross-iteration correlations allowing predictors to search for more complex correlations in the history space with lower cost. We present wormhole, a new side-predictor that exploits these multidimensional histories. Wormhole is integrated alongside ISL-TAGE and leverages information from its existing side-predictors. Experiments show that the wormhole predictor improves accuracy more than existing side-predictors, some of which are commercially available, with a similar hardware cost. Considering 40 diverse application traces, the wormhole predictor reduces MPKI by an average of 2.53% and 3.15% on top of 4KB and 32KB ISL-TAGE predictors respectively. When considering the top four workloads that exhibit multi-dimensional history correlations, Wormhole achieves 22% and 20% MPKI average reductions over 4KB and 32KB ISL-TAGE.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"8 1","pages":"509-520"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83515456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode. Throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode.
{"title":"Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution","authors":"Ankit Sethia, S. Mahlke","doi":"10.1109/MICRO.2014.16","DOIUrl":"https://doi.org/10.1109/MICRO.2014.16","url":null,"abstract":"GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by GPUs use thousands of threads to provide high performance and efficiency. In general, if one thread of a kernel uses one of the resources (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource due to the large number of identical concurrent threads. This contention will eventually saturate the performance of the kernel due to contention for the bottleneck resource, while at the same time leaving other resources underutilized. To overcome this problem, a runtime system that can tune the hardware to match the characteristics of a kernel can effectively mitigate the imbalance between resource requirements of kernels and the hardware resources present on the GPU. We propose Equalizer, a low overhead hardware runtime system, that dynamically monitors the resource requirements of a kernel and manages the amount of on-chip concurrency, core frequency and memory frequency to adapt the hardware to best match the needs of the running kernel. Equalizer provides efficiency in two modes. Firstly, it can save energy without significant performance degradation by throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode. Throttling under-utilized resources. Secondly, it can boost bottleneck resources to reduce contention and provide higher performance without significant energy increase. Across a spectrum of 27 kernels, Equalizer achieves 15% savings in energy mode and 22% speedup in performance mode.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"25 1","pages":"647-658"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73474568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or many cores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.
{"title":"Compiler Support for Optimizing Memory Bank-Level Parallelism","authors":"W. Ding, D. Guttman, M. Kandemir","doi":"10.1109/MICRO.2014.34","DOIUrl":"https://doi.org/10.1109/MICRO.2014.34","url":null,"abstract":"Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or many cores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"13 1","pages":"571-582"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88604462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Supreet Jeloka, R. Das, R. Dreslinski, T. Mudge, D. Blaauw
This paper proposes a novel 3D switch, called 'Hi-Rise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch's fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction.
{"title":"Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration","authors":"Supreet Jeloka, R. Das, R. Dreslinski, T. Mudge, D. Blaauw","doi":"10.1109/MICRO.2014.45","DOIUrl":"https://doi.org/10.1109/MICRO.2014.45","url":null,"abstract":"This paper proposes a novel 3D switch, called 'Hi-Rise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch's fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"29 1","pages":"471-483"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85299126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.
{"title":"DaDianNao: A Machine-Learning Supercomputer","authors":"Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, O. Temam","doi":"10.1109/MICRO.2014.58","DOIUrl":"https://doi.org/10.1109/MICRO.2014.58","url":null,"abstract":"Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"21 1","pages":"609-622"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91091300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}