As we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The de-facto standard directory-based cache coherence has been extensively studied; but it does not scale well with increasing core count. Timestamp-based hardware coherence protocols introduced recently offer an attractive alternative solution. In this paper, we propose a timestamp-based coherence protocol, called TC-Release++, that addresses the scalability issues of efficiently supporting cache coherence in large-scale systems. Our approach is inspired by TC-Weak, a recently proposed timestamp-based coherence protocol targeting GPU architectures. We first design TC-Release coherence in an attempt to straightforwardly port TC-Weak to general-purpose many-cores. But re-purposing TC-Weak for general-purpose many-core architectures is challenging due to significant differences both in architecture and the programming model. Indeed the performance of TC-Release turns out to be worse than conventional directory coherence protocols. We overcome the limitations and overheads of TC-Release by introducing simple hardware support to eliminate frequent memory stalls, and an optimized life-time prediction mechanism to improve cache performance. The resulting optimized coherence protocol TC-Release++ is highly scalable (overhead for coherence per last-level cache line scales logarithmically with core count as opposed to linearly for directory coherence) and shows better execution time (3.0%) and comparable network traffic (within 1.3%) relative to the baseline MESI directory coherence protocol.
{"title":"Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures","authors":"Yuan Yao, Guanhua Wang, Zhiguo Ge, T. Mitra, Wenzhi Chen, Naxin Zhang","doi":"10.1145/2925426.2926270","DOIUrl":"https://doi.org/10.1145/2925426.2926270","url":null,"abstract":"As we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The de-facto standard directory-based cache coherence has been extensively studied; but it does not scale well with increasing core count. Timestamp-based hardware coherence protocols introduced recently offer an attractive alternative solution. In this paper, we propose a timestamp-based coherence protocol, called TC-Release++, that addresses the scalability issues of efficiently supporting cache coherence in large-scale systems. Our approach is inspired by TC-Weak, a recently proposed timestamp-based coherence protocol targeting GPU architectures. We first design TC-Release coherence in an attempt to straightforwardly port TC-Weak to general-purpose many-cores. But re-purposing TC-Weak for general-purpose many-core architectures is challenging due to significant differences both in architecture and the programming model. Indeed the performance of TC-Release turns out to be worse than conventional directory coherence protocols. We overcome the limitations and overheads of TC-Release by introducing simple hardware support to eliminate frequent memory stalls, and an optimized life-time prediction mechanism to improve cache performance. The resulting optimized coherence protocol TC-Release++ is highly scalable (overhead for coherence per last-level cache line scales logarithmically with core count as opposed to linearly for directory coherence) and shows better execution time (3.0%) and comparable network traffic (within 1.3%) relative to the baseline MESI directory coherence protocol.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"277 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123065437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patrick Judd, Jorge Albericio, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos
This work exploits the tolerance of Deep Neural Networks (DNNs) to reduced precision numerical representations and specifically, their recently demonstrated ability to tolerate representations of different precision per layer while maintaining accuracy. This flexibility enables improvements over conventional DNN implementations that use a single, uniform representation. This work proposes Proteus, which reduces the data traffic and storage footprint needed by DNNs, resulting in reduced energy and improved area efficiency for DNN implementations. Proteus uses a different representation per layer for both the data (neurons) and the weights (synapses) processed by DNNs. Proteus is a layered extension over existing DNN implementations that converts between the numerical representation used by the DNN execution engines and the shorter, layer-specific fixed-point representation used when reading and writing data values to memory be it on-chip buffers or off-chip memory. Proteus uses a novel memory layout for DNN data, enabling a simple, low-cost and low-energy conversion unit. We evaluate Proteus as an extension to a state-of-the-art accelerator [7] which uses a uniform 16-bit fixed-point representation. On five popular DNNs Proteus reduces data traffic among layers by 43% on average while maintaining accuracy within 1% even when compared to a single precision floating-point implementation. As a result, Proteus improves energy by 15% with no performance loss. Proteus also reduces the data footprint by at least 38% and hence the amount of on-chip buffering needed resulting in an implementation that requires 20% less area overall. This area savings can be used to improve cost by building smaller chips, to process larger DNNs for the same on-chip area, or to incorporate an additional three execution engines increasing peak performance bandwidth by 18%.
{"title":"Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks","authors":"Patrick Judd, Jorge Albericio, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos","doi":"10.1145/2925426.2926294","DOIUrl":"https://doi.org/10.1145/2925426.2926294","url":null,"abstract":"This work exploits the tolerance of Deep Neural Networks (DNNs) to reduced precision numerical representations and specifically, their recently demonstrated ability to tolerate representations of different precision per layer while maintaining accuracy. This flexibility enables improvements over conventional DNN implementations that use a single, uniform representation. This work proposes Proteus, which reduces the data traffic and storage footprint needed by DNNs, resulting in reduced energy and improved area efficiency for DNN implementations. Proteus uses a different representation per layer for both the data (neurons) and the weights (synapses) processed by DNNs. Proteus is a layered extension over existing DNN implementations that converts between the numerical representation used by the DNN execution engines and the shorter, layer-specific fixed-point representation used when reading and writing data values to memory be it on-chip buffers or off-chip memory. Proteus uses a novel memory layout for DNN data, enabling a simple, low-cost and low-energy conversion unit. We evaluate Proteus as an extension to a state-of-the-art accelerator [7] which uses a uniform 16-bit fixed-point representation. On five popular DNNs Proteus reduces data traffic among layers by 43% on average while maintaining accuracy within 1% even when compared to a single precision floating-point implementation. As a result, Proteus improves energy by 15% with no performance loss. Proteus also reduces the data footprint by at least 38% and hence the amount of on-chip buffering needed resulting in an implementation that requires 20% less area overall. This area savings can be used to improve cost by building smaller chips, to process larger DNNs for the same on-chip area, or to incorporate an additional three execution engines increasing peak performance bandwidth by 18%.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122685406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuehai Qian, Koushik Sen, Paul H. Hargrove, Costin Iancu
Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-to-date, there is no existing deterministic replay solution for one-sided communication. The essential problem is that the readers of updated data do not have any information on which remote threads produced the updates, the conventional happens-before based ordering tracking techniques are challenging to work at scale. This paper presents SReplay, the first software tool for sub-group deterministic record and replay for one-sided communication. SReplay allows the user to specify and record the execution of a set of threads of interest (sub-group), and then deterministically replays the execution of the sub-group on a local machine without starting the remaining threads. SReplay ensures sub-group determinism using a hybrid data- and order-replay technique. SReplay maintains scalability by a combination of local logging and approximative event order tracking within sub-group. Our evaluation on deterministic and nondeterministic UPC programs shows that SReplay introduces an overhead ranging from 1.3x to 29x, when running on 1,024 cores and tracking up to 16 threads.
{"title":"SReplay: Deterministic Sub-Group Replay for One-Sided Communication","authors":"Xuehai Qian, Koushik Sen, Paul H. Hargrove, Costin Iancu","doi":"10.1145/2925426.2926264","DOIUrl":"https://doi.org/10.1145/2925426.2926264","url":null,"abstract":"Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-to-date, there is no existing deterministic replay solution for one-sided communication. The essential problem is that the readers of updated data do not have any information on which remote threads produced the updates, the conventional happens-before based ordering tracking techniques are challenging to work at scale. This paper presents SReplay, the first software tool for sub-group deterministic record and replay for one-sided communication. SReplay allows the user to specify and record the execution of a set of threads of interest (sub-group), and then deterministically replays the execution of the sub-group on a local machine without starting the remaining threads. SReplay ensures sub-group determinism using a hybrid data- and order-replay technique. SReplay maintains scalability by a combination of local logging and approximative event order tracking within sub-group. Our evaluation on deterministic and nondeterministic UPC programs shows that SReplay introduces an overhead ranging from 1.3x to 29x, when running on 1,024 cores and tracking up to 16 threads.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126769754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, S. Bagchi, S. Chaterji
Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity. This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes. To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.
{"title":"SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications","authors":"K. Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, S. Bagchi, S. Chaterji","doi":"10.1145/2925426.2926283","DOIUrl":"https://doi.org/10.1145/2925426.2926283","url":null,"abstract":"Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity. This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes. To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126412427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ang Li, S. Song, M. Wijtvliet, Akash Kumar, H. Corporaal
Approximate computing, the technique that sacrifices certain amount of accuracy in exchange for substantial performance boost or power reduction, is one of the most promising solutions to enable power control and performance scaling towards exascale. Although most existing approximation designs target the emerging data-intensive applications that are comparatively more error-tolerable, there is still high demand for the acceleration of traditional scientific applications (e.g., weather and nuclear simulation), which often comprise intensive transcendental function calls and are very sensitive to accuracy loss. To address this challenge, we focus on a very important but long ignored approximation unit on today's commercial GPUs --- the special-function unit (SFU), and clarify its unique role in performance acceleration of accuracy-sensitive applications in the context of approximate computing. To better understand its features, we conduct a thorough empirical analysis on three generations of NVIDIA GPU architectures to evaluate all the single-precision and double-precision numeric transcendental functions that can be accelerated by SFUs, in terms of their performance, accuracy and power consumption. Based on the insights from the evaluation, we propose a transparent, tractable and portable design framework for SFU-driven approximate acceleration on GPUs. Our design is software-based and requires no hardware or application modifications. Experimental results on three NVIDIA GPU platforms demonstrate that our proposed framework can provide fine-grained tuning for performance and accuracy trade-offs, thus facilitating applications to achieve the maximum performance under certain accuracy constraints.
{"title":"SFU-Driven Transparent Approximation Acceleration on GPUs","authors":"Ang Li, S. Song, M. Wijtvliet, Akash Kumar, H. Corporaal","doi":"10.1145/2925426.2926255","DOIUrl":"https://doi.org/10.1145/2925426.2926255","url":null,"abstract":"Approximate computing, the technique that sacrifices certain amount of accuracy in exchange for substantial performance boost or power reduction, is one of the most promising solutions to enable power control and performance scaling towards exascale. Although most existing approximation designs target the emerging data-intensive applications that are comparatively more error-tolerable, there is still high demand for the acceleration of traditional scientific applications (e.g., weather and nuclear simulation), which often comprise intensive transcendental function calls and are very sensitive to accuracy loss. To address this challenge, we focus on a very important but long ignored approximation unit on today's commercial GPUs --- the special-function unit (SFU), and clarify its unique role in performance acceleration of accuracy-sensitive applications in the context of approximate computing. To better understand its features, we conduct a thorough empirical analysis on three generations of NVIDIA GPU architectures to evaluate all the single-precision and double-precision numeric transcendental functions that can be accelerated by SFUs, in terms of their performance, accuracy and power consumption. Based on the insights from the evaluation, we propose a transparent, tractable and portable design framework for SFU-driven approximate acceleration on GPUs. Our design is software-based and requires no hardware or application modifications. Experimental results on three NVIDIA GPU platforms demonstrate that our proposed framework can provide fine-grained tuning for performance and accuracy trade-offs, thus facilitating applications to achieve the maximum performance under certain accuracy constraints.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"251 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125711365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Searches on large graphs are heavily memory latency bound, as a result of many high latency DRAM accesses. Due to the highly irregular nature of the access patterns involved, caches and prefetchers, both hardware and software, perform poorly on graph workloads. This leads to CPU stalling for the majority of the time. However, in many cases the data access pattern is well defined and predictable in advance, many falling into a small set of simple patterns. Although existing implicit prefetchers cannot bring significant benefit, a prefetcher armed with knowledge of the data structures and access patterns could accurately anticipate applications' traversals to bring in the appropriate data. This paper presents a design of an explicitly configured prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. For a range of applications and graph sizes, our prefetcher achieves average speedups of 2.3x, and up to 3.3x, with little impact on memory bandwidth requirements.
{"title":"Graph Prefetching Using Data Structure Knowledge","authors":"S. Ainsworth, Timothy M. Jones","doi":"10.1145/2925426.2926254","DOIUrl":"https://doi.org/10.1145/2925426.2926254","url":null,"abstract":"Searches on large graphs are heavily memory latency bound, as a result of many high latency DRAM accesses. Due to the highly irregular nature of the access patterns involved, caches and prefetchers, both hardware and software, perform poorly on graph workloads. This leads to CPU stalling for the majority of the time. However, in many cases the data access pattern is well defined and predictable in advance, many falling into a small set of simple patterns. Although existing implicit prefetchers cannot bring significant benefit, a prefetcher armed with knowledge of the data structures and access patterns could accurately anticipate applications' traversals to bring in the appropriate data. This paper presents a design of an explicitly configured prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. For a range of applications and graph sizes, our prefetcher achieves average speedups of 2.3x, and up to 3.3x, with little impact on memory bandwidth requirements.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124695686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, D. Qian
Due to the cost-effective, massive computational power of graphics processing units (GPUs), there is a growing interest of utilizing GPUs in real-time systems. For example GPUs have been applied to automotive systems to enable new advanced and intelligent driver assistance technologies, accelerating the path to self-driving cars. In such systems, GPUs are shared among tasks with mixed timing constraints: real-time (RT) tasks that have to be accomplished before specified deadlines, and non-real-time, best-effort (BE) tasks. In this paper, (1) we propose resource-aware non-uniform slack distribution to enhance the schedulability of RT tasks (the total amount of work of RT tasks whose deadlines can be satisfied on a given amount of resources) in GPU-enabled systems; (2) we propose deadline-aware dynamic GPU partitioning to allow RT and BE tasks to run on a GPU simultaneously, such that BE tasks are not blocked for a long time. We evaluate the effectiveness of the proposed approaches by using both synthetic benchmarks and a real-world workload that consists of a set of emerging automotive tasks. Experimental results show that the proposed approaches yield significant schedulability improvement for RT tasks and turnaround time decrement for BE tasks. Moreover, the analysis of two driving scenarios shows that such schedulability improvement and turnaround time decrement can significantly enhance the driving safety and experience. For example, when the resource-aware non-uniform slack distribution approach is used, the distance that a car travels during the time between a traffic sign (pedestrian) is "seen and recognized" is decreased from 44.4m to 22.2m (from 4.4m to 2.2m); when the deadline-aware dynamic GPU partitioning approach is used, the distance that the car has traveled before a drowsy driver is woken up is reduced from 56.2m to 29.2m.
{"title":"Scheduling Tasks with Mixed Timing Constraints in GPU-Powered Real-Time Systems","authors":"Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, D. Qian","doi":"10.1145/2925426.2926265","DOIUrl":"https://doi.org/10.1145/2925426.2926265","url":null,"abstract":"Due to the cost-effective, massive computational power of graphics processing units (GPUs), there is a growing interest of utilizing GPUs in real-time systems. For example GPUs have been applied to automotive systems to enable new advanced and intelligent driver assistance technologies, accelerating the path to self-driving cars. In such systems, GPUs are shared among tasks with mixed timing constraints: real-time (RT) tasks that have to be accomplished before specified deadlines, and non-real-time, best-effort (BE) tasks. In this paper, (1) we propose resource-aware non-uniform slack distribution to enhance the schedulability of RT tasks (the total amount of work of RT tasks whose deadlines can be satisfied on a given amount of resources) in GPU-enabled systems; (2) we propose deadline-aware dynamic GPU partitioning to allow RT and BE tasks to run on a GPU simultaneously, such that BE tasks are not blocked for a long time. We evaluate the effectiveness of the proposed approaches by using both synthetic benchmarks and a real-world workload that consists of a set of emerging automotive tasks. Experimental results show that the proposed approaches yield significant schedulability improvement for RT tasks and turnaround time decrement for BE tasks. Moreover, the analysis of two driving scenarios shows that such schedulability improvement and turnaround time decrement can significantly enhance the driving safety and experience. For example, when the resource-aware non-uniform slack distribution approach is used, the distance that a car travels during the time between a traffic sign (pedestrian) is \"seen and recognized\" is decreased from 44.4m to 22.2m (from 4.4m to 2.2m); when the deadline-aware dynamic GPU partitioning approach is used, the distance that the car has traveled before a drowsy driver is woken up is reduced from 56.2m to 29.2m.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131952484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using smaller cache lines could improve cache space utilization, but it also frequently suffers from significant performance loss by introducing large amount of extra cache requests. In this work, we propose a novel cache design named tag-split cache (TSC) that enables fine-grained cache storage to address the problem of cache space underutilization while keeping memory request number unchanged. TSC divides tag into two parts to reduce storage overhead, and it supports multiple cache line replacement in one cycle. TSC can also automatically adjust cache storage granularity to avoid performance loss for applications with good spatial locality. Our evaluation shows that TSC improves the baseline cache performance by 17.2% on average across a wide range of applications. It also out-performs other previous techniques significantly.
{"title":"Tag-Split Cache for Efficient GPGPU Cache Utilization","authors":"Lingda Li, Ari B. Hayes, S. Song, E. Zhang","doi":"10.1145/2925426.2926253","DOIUrl":"https://doi.org/10.1145/2925426.2926253","url":null,"abstract":"Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using smaller cache lines could improve cache space utilization, but it also frequently suffers from significant performance loss by introducing large amount of extra cache requests. In this work, we propose a novel cache design named tag-split cache (TSC) that enables fine-grained cache storage to address the problem of cache space underutilization while keeping memory request number unchanged. TSC divides tag into two parts to reduce storage overhead, and it supports multiple cache line replacement in one cycle. TSC can also automatically adjust cache storage granularity to avoid performance loss for applications with good spatial locality. Our evaluation shows that TSC improves the baseline cache performance by 17.2% on average across a wide range of applications. It also out-performs other previous techniques significantly.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125554093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Abdel-Majeed, Daniel Wong, Justin Kuang, M. Annavaram
Graphical processing units (GPUs) are increasingly used to run a wide range of general purpose applications. Due to wide variation in application parallelism and inherent application level inefficiencies, GPUs experience significant idle periods. In this work, we first show that significant fine-grain pipeline bubbles exist regardless of warp scheduling policies or workloads. We propose to convert these bubbles into energy saving opportunities using Origami. Origami consists of two components: Warp Folding and the Origami scheduler. With Warp Folding, warps are split into two half-warps which are issued in succession. Warp Folding leaves half of the execution lanes idle, which is then exploited to improve energy efficiency through power gating. Origami scheduler is a new warp scheduler that is cognizant of the Warp Folding process and tries to further extend the sleep times of idle execution lanes. By combining the two techniques Origami can save 49% and 46% of the leakage energy in the integer and floating point pipelines, respectively. These savings are better than or at least on-par with Warped-Gates, a prior power gating technique that power gates the entire cluster of execution lanes. But Origami achieves these energy savings without relying on forcing idleness on execution lanes, which leads to performance losses, as has been proposed in Warped-Gates. Hence, Origami is able to achieve these energy savings with virtually no performance overhead.
{"title":"Origami: Folding Warps for Energy Efficient GPUs","authors":"Mohammad Abdel-Majeed, Daniel Wong, Justin Kuang, M. Annavaram","doi":"10.1145/2925426.2926281","DOIUrl":"https://doi.org/10.1145/2925426.2926281","url":null,"abstract":"Graphical processing units (GPUs) are increasingly used to run a wide range of general purpose applications. Due to wide variation in application parallelism and inherent application level inefficiencies, GPUs experience significant idle periods. In this work, we first show that significant fine-grain pipeline bubbles exist regardless of warp scheduling policies or workloads. We propose to convert these bubbles into energy saving opportunities using Origami. Origami consists of two components: Warp Folding and the Origami scheduler. With Warp Folding, warps are split into two half-warps which are issued in succession. Warp Folding leaves half of the execution lanes idle, which is then exploited to improve energy efficiency through power gating. Origami scheduler is a new warp scheduler that is cognizant of the Warp Folding process and tries to further extend the sleep times of idle execution lanes. By combining the two techniques Origami can save 49% and 46% of the leakage energy in the integer and floating point pipelines, respectively. These savings are better than or at least on-par with Warped-Gates, a prior power gating technique that power gates the entire cluster of execution lanes. But Origami achieves these energy savings without relying on forcing idleness on execution lanes, which leads to performance losses, as has been proposed in Warped-Gates. Hence, Origami is able to achieve these energy savings with virtually no performance overhead.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133116474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present Roca, a technique to reduce the opportunity cost of integrating non-programmable, high-throughput accelerators in general-purpose architectures. Roca exploits the insight that non-programmable accelerators are mostly made of private local memories (PLMs), which are key to the accelerators' performance and energy efficiency. Roca transparently exposes PLMs of otherwise unused accelerators to the cache substrate, thereby allowing the system to extract utility from accelerators even when they cannot directly speed up the system's workload. Roca adds low complexity to existing accelerator designs, requires minimal modifications to the cache substrate, and incurs a modest area overhead that is almost entirely due to additional tag storage. We quantify the utility of Roca by comparing the returns of investing area in either regular last-level cache banks or Roca-enabled accelerators. Through simulation of non-accelerated multiprogrammed workloads on a 16-core system, we extend a 2MB S-NUCA baseline system to show that a 6MB Roca-enabled last-level cache built upon typical accelerators (i.e. whose area is 66% memory) can, on average, realize 70% of the performance and 68% of the energy efficiency benefits of a same-area 8MB S-NUCA configuration, in addition to the potential orders-of-magnitude efficiency and performance improvements that the added accelerators provide to workloads suitable for acceleration.
{"title":"Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration","authors":"E. G. Cota, Paolo Mantovani, L. Carloni","doi":"10.1145/2925426.2926258","DOIUrl":"https://doi.org/10.1145/2925426.2926258","url":null,"abstract":"We present Roca, a technique to reduce the opportunity cost of integrating non-programmable, high-throughput accelerators in general-purpose architectures. Roca exploits the insight that non-programmable accelerators are mostly made of private local memories (PLMs), which are key to the accelerators' performance and energy efficiency. Roca transparently exposes PLMs of otherwise unused accelerators to the cache substrate, thereby allowing the system to extract utility from accelerators even when they cannot directly speed up the system's workload. Roca adds low complexity to existing accelerator designs, requires minimal modifications to the cache substrate, and incurs a modest area overhead that is almost entirely due to additional tag storage. We quantify the utility of Roca by comparing the returns of investing area in either regular last-level cache banks or Roca-enabled accelerators. Through simulation of non-accelerated multiprogrammed workloads on a 16-core system, we extend a 2MB S-NUCA baseline system to show that a 6MB Roca-enabled last-level cache built upon typical accelerators (i.e. whose area is 66% memory) can, on average, realize 70% of the performance and 68% of the energy efficiency benefits of a same-area 8MB S-NUCA configuration, in addition to the potential orders-of-magnitude efficiency and performance improvements that the added accelerators provide to workloads suitable for acceleration.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133955027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}