To handle the memory wall problem and satisfy the high processing speed of the multicore processors, there is significant demand for a large cache capacity in future. The 3D die-stacking DRAM cache with high density can be used as a large cache compared with conventional SRAM cache. However, energy becomes an inevitable challenge with the increasing size of DRAM cache. STT-RAM with near-zero leakage can be integrated with DRAM cache as a hybrid cache to reduce static energy, but the high write energy of STT-RAM brings another energy challenge. We observe that volatile STT-RAM can be utilized in the hybrid cache as a buffer to balance the high static energy of DRAM and the high dynamic energy of non-volatile STT-RAM.
{"title":"Architecting a Novel Hybrid Cache with Low Energy","authors":"Jiacong He, Joseph Callenes-Sloan","doi":"10.1109/PACT.2017.47","DOIUrl":"https://doi.org/10.1109/PACT.2017.47","url":null,"abstract":"To handle the memory wall problem and satisfy the high processing speed of the multicore processors, there is significant demand for a large cache capacity in future. The 3D die-stacking DRAM cache with high density can be used as a large cache compared with conventional SRAM cache. However, energy becomes an inevitable challenge with the increasing size of DRAM cache. STT-RAM with near-zero leakage can be integrated with DRAM cache as a hybrid cache to reduce static energy, but the high write energy of STT-RAM brings another energy challenge. We observe that volatile STT-RAM can be utilized in the hybrid cache as a buffer to balance the high static energy of DRAM and the high dynamic energy of non-volatile STT-RAM.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116512409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We live in an advent of specialized tasks ranging from graphics, to networking and graph processing, to machine learning and more. While hardware accelerators cater to mainstream demands, general purpose units will always be challenged to run new software. Introspective Computing focuses on building a feedback mechanism to tune dynamic hardware features in real-time. Unlike most prior work, our study is done completely on a real system using hardware resources tunable in most modern Intel processors.
{"title":"Introspective Computing","authors":"Karl Taht, R. Balasubramonian","doi":"10.1109/PACT.2017.49","DOIUrl":"https://doi.org/10.1109/PACT.2017.49","url":null,"abstract":"We live in an advent of specialized tasks ranging from graphics, to networking and graph processing, to machine learning and more. While hardware accelerators cater to mainstream demands, general purpose units will always be challenged to run new software. Introspective Computing focuses on building a feedback mechanism to tune dynamic hardware features in real-time. Unlike most prior work, our study is done completely on a real system using hardware resources tunable in most modern Intel processors.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123421385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIMD vectors help improve the performance of certain applications. The code gets vectorized into SIMD form either by hand, or automatically with auto-vectorizing compilers. The Superword-Level Parallelism (SLP) vectorization algorithm is a widely used algorithm for vectorizing straight-line code and is part of most industrial compilers. The algorithm attempts to pack scalar instructions into vectors starting from specific seed instructions in a bottom-up way. This approach, however, suffers from two main problems: (i) the algorithm may not reach instructions that could have been vectorized, and (ii) atomically operating on individual SLP graphs suffers from cost overestimation when consecutive SLP graphs share data. Both issues lead to missed vectorization opportunities even in simple code.In this work we propose SuperGraph-SLP (SG-SLP), an improved vectorization algorithm that overcomes these limitations of the existing algorithm. SG-SLP operates on a larger region, called the SuperGraph. This allows it to reach and successfully vectorize code that was previously unreachable. Moreover, the new region helps eliminate the inaccuracies in the cost-calculation as it allows for a more holistic view of the code. Our experiments show that SG-SLP improves the vectorization coverage and outperforms the state-of-the-art SLP across a number kernels by 36% on average, without affecting the compilation time.
{"title":"SuperGraph-SLP Auto-Vectorization","authors":"Vasileios Porpodas","doi":"10.1109/PACT.2017.21","DOIUrl":"https://doi.org/10.1109/PACT.2017.21","url":null,"abstract":"SIMD vectors help improve the performance of certain applications. The code gets vectorized into SIMD form either by hand, or automatically with auto-vectorizing compilers. The Superword-Level Parallelism (SLP) vectorization algorithm is a widely used algorithm for vectorizing straight-line code and is part of most industrial compilers. The algorithm attempts to pack scalar instructions into vectors starting from specific seed instructions in a bottom-up way. This approach, however, suffers from two main problems: (i) the algorithm may not reach instructions that could have been vectorized, and (ii) atomically operating on individual SLP graphs suffers from cost overestimation when consecutive SLP graphs share data. Both issues lead to missed vectorization opportunities even in simple code.In this work we propose SuperGraph-SLP (SG-SLP), an improved vectorization algorithm that overcomes these limitations of the existing algorithm. SG-SLP operates on a larger region, called the SuperGraph. This allows it to reach and successfully vectorize code that was previously unreachable. Moreover, the new region helps eliminate the inaccuracies in the cost-calculation as it allows for a more holistic view of the code. Our experiments show that SG-SLP improves the vectorization coverage and outperforms the state-of-the-art SLP across a number kernels by 36% on average, without affecting the compilation time.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121761070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Restricted Boltzmann Machine (RBM) is the building block of Deep Belief Nets and other deep learning tools. Fast learning and prediction are both essential for practical usage of RBM-based machine learning techniques. This paper presents a concept named generalized redundancy elimination to avoid most of the the computations required in RBM learning and prediction without changing the results. It consists of two optimization techniques. The first is called bounds-based filtering, which, through triangular inequality, replaces expensive calculations of many vector dot products with fast bounds calculations. The second is delta product, which effectively detects and avoids many repeated calculations in the core operation of RBM, Gibbs Sampling. The optimizations are applicable to both the standard contrastive divergence learning algorithm and its variations. In addition, the paper presents how to address some complexities these optimizations create for them to be used together and for them to be implemented efficiently on massively parallel processors. Results show that the optimizations can produce several-fold (up to 3X for training and 5.3X for prediction) speedups.
{"title":"POSTER: Cutting the Fat: Speeding Up RBM for Fast Deep Learning Through Generalized Redundancy Elimination","authors":"Lin Ning, Randall Pittman, Xipeng Shen","doi":"10.1109/PACT.2017.36","DOIUrl":"https://doi.org/10.1109/PACT.2017.36","url":null,"abstract":"Restricted Boltzmann Machine (RBM) is the building block of Deep Belief Nets and other deep learning tools. Fast learning and prediction are both essential for practical usage of RBM-based machine learning techniques. This paper presents a concept named generalized redundancy elimination to avoid most of the the computations required in RBM learning and prediction without changing the results. It consists of two optimization techniques. The first is called bounds-based filtering, which, through triangular inequality, replaces expensive calculations of many vector dot products with fast bounds calculations. The second is delta product, which effectively detects and avoids many repeated calculations in the core operation of RBM, Gibbs Sampling. The optimizations are applicable to both the standard contrastive divergence learning algorithm and its variations. In addition, the paper presents how to address some complexities these optimizations create for them to be used together and for them to be implemented efficiently on massively parallel processors. Results show that the optimizations can produce several-fold (up to 3X for training and 5.3X for prediction) speedups.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128679597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, H. Sarbazi-Azad, T. Wenisch
CPU-GPU heterogeneous systems are emerging are emerging as architectures of choice for high-performance energy-efficient computing. Designing on-chip interconnects for such systems is challenging: CPUs typically benefit greatly from optimizations that reduce latency, but rarely saturate bandwidth or queueing resources. In contrast, GPUs generate intense traffic that produces local congestion, harming CPU performance. Congestion-optimized interconnects can mitigate this problem through larger virtual and physical channel resources. However, when there is little traffic, such networks become suboptimal due to higher unloaded packet latencies and critical path delays. We argue for a reconfigurable network that can activate additional channels under high load/congestion and shut them off when the network is unloaded. However, these additional resources consume more power, making it difficult to statically provision a power budget for the network. We propose Elastic Network Reconfiguration, wherein we aggressively reduce voltage to free power budget to activate additional channels. Our key observation is that, under high load, the reduced queueing due to additional channels more than compensates for the increase in per-hop latency of the reduced clock frequency. We introduce BiNoCHS as a voltage-scalable NoC that specifically targets CPU-GPU heterogeneous systems and employs elastic network reconfiguration to maintain a constant power budget while adapting between latency- and congestion-optimized modes.
{"title":"POSTER: Elastic Reconfiguration for Heterogeneous NoCs with BiNoCHS","authors":"Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, H. Sarbazi-Azad, T. Wenisch","doi":"10.1109/PACT.2017.46","DOIUrl":"https://doi.org/10.1109/PACT.2017.46","url":null,"abstract":"CPU-GPU heterogeneous systems are emerging are emerging as architectures of choice for high-performance energy-efficient computing. Designing on-chip interconnects for such systems is challenging: CPUs typically benefit greatly from optimizations that reduce latency, but rarely saturate bandwidth or queueing resources. In contrast, GPUs generate intense traffic that produces local congestion, harming CPU performance. Congestion-optimized interconnects can mitigate this problem through larger virtual and physical channel resources. However, when there is little traffic, such networks become suboptimal due to higher unloaded packet latencies and critical path delays. We argue for a reconfigurable network that can activate additional channels under high load/congestion and shut them off when the network is unloaded. However, these additional resources consume more power, making it difficult to statically provision a power budget for the network. We propose Elastic Network Reconfiguration, wherein we aggressively reduce voltage to free power budget to activate additional channels. Our key observation is that, under high load, the reduced queueing due to additional channels more than compensates for the increase in per-hop latency of the reduced clock frequency. We introduce BiNoCHS as a voltage-scalable NoC that specifically targets CPU-GPU heterogeneous systems and employs elastic network reconfiguration to maintain a constant power budget while adapting between latency- and congestion-optimized modes.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134298271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, P. Sadayappan
High-level GPU graph processing frameworks are an attractive alternative for achieving both high productivity and high performance. Hence, several high-level frameworks for graph processing on GPUs have been developed. In this paper, we develop an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks. It uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices. A two-phase edge processing approach trades off extra data movement for improved load balancing across GPU threads, by using a 2D blocked representation for edge data. Experimental results demonstrate performance improvement over current state-of-the-art GPU graph processing frameworks for many benchmark programs and data sets.
{"title":"MultiGraph: Efficient Graph Processing on GPUs","authors":"Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, P. Sadayappan","doi":"10.1109/PACT.2017.48","DOIUrl":"https://doi.org/10.1109/PACT.2017.48","url":null,"abstract":"High-level GPU graph processing frameworks are an attractive alternative for achieving both high productivity and high performance. Hence, several high-level frameworks for graph processing on GPUs have been developed. In this paper, we develop an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks. It uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices. A two-phase edge processing approach trades off extra data movement for improved load balancing across GPU threads, by using a 2D blocked representation for edge data. Experimental results demonstrate performance improvement over current state-of-the-art GPU graph processing frameworks for many benchmark programs and data sets.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116455555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changmin Ahn, Camilo A. Celis Guzman, Bernhard Egger
Traditional approaches for cache-coherent shared-memory architectures running symmetric multiprocessing (SMP) operating systems are not adequate for future manycore chips where power management presents one of the most important challenges. In this work, we present a power management framework for many-core systems that does not require coherent shared memory and supports multiple-voltage/multiple-frequency (MVMF) architectures. A hierar-chical NUMA-aware power management technique combines dynamic voltage and frequency scaling (DVFS) with workload migration. The conflicting goals of grouping workloads with similar utilization patterns and placing workloads as close as possible to their data are considered by a greedy placement algorithm. Implemented in software and evaluated on existing hardware, the proposed technique achieves a 30 and 8 percent improvement in performance-per-watt compared to DVFS-only and NUMA-unaware power management.
{"title":"POSTER: NUMA-Aware Power Management for Chip Multiprocessors","authors":"Changmin Ahn, Camilo A. Celis Guzman, Bernhard Egger","doi":"10.1109/PACT.2017.31","DOIUrl":"https://doi.org/10.1109/PACT.2017.31","url":null,"abstract":"Traditional approaches for cache-coherent shared-memory architectures running symmetric multiprocessing (SMP) operating systems are not adequate for future manycore chips where power management presents one of the most important challenges. In this work, we present a power management framework for many-core systems that does not require coherent shared memory and supports multiple-voltage/multiple-frequency (MVMF) architectures. A hierar-chical NUMA-aware power management technique combines dynamic voltage and frequency scaling (DVFS) with workload migration. The conflicting goals of grouping workloads with similar utilization patterns and placing workloads as close as possible to their data are considered by a greedy placement algorithm. Implemented in software and evaluated on existing hardware, the proposed technique achieves a 30 and 8 percent improvement in performance-per-watt compared to DVFS-only and NUMA-unaware power management.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129825244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, W. Hsu
Processor manufacturers have adopted SIMD for decades because of its superior performance and power efficiency. The configurations of SIMD registers (i.e., the number and width) have evolved and diverged rapidly through various ISA extensions on different architectures. However, migrating legacy or proprietary applications optimized for one guest ISA to another host ISA that has fewer but longer SIMD registers through binary translation raises the issues of asymmetric SIMD register configurations. To date, these issues have been overlooked. As a result, only a small fraction of the potential performance gain is realized due to underutilization of the host's SIMD parallelism and register capacity.In this paper, we present a novel dynamic binary translation technique called spill-aware SLP (saSLP), which combines short ARMv8 NEON instructions and registers in the guest binary loops to fully utilize the x86 AVX host's parallelism as well as minimize register spilling. Our experiment results show that saSLP improves the performance by 1.6X (2.3X) across a number of benchmarks, and reduces spilling by 97% (99%) for ARMv8 NEON to x86 AVX2 (AVX-512) translation.
{"title":"Exploiting Asymmetric SIMD Register Configurations in ARM-to-x86 Dynamic Binary Translation","authors":"Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, W. Hsu","doi":"10.1109/PACT.2017.15","DOIUrl":"https://doi.org/10.1109/PACT.2017.15","url":null,"abstract":"Processor manufacturers have adopted SIMD for decades because of its superior performance and power efficiency. The configurations of SIMD registers (i.e., the number and width) have evolved and diverged rapidly through various ISA extensions on different architectures. However, migrating legacy or proprietary applications optimized for one guest ISA to another host ISA that has fewer but longer SIMD registers through binary translation raises the issues of asymmetric SIMD register configurations. To date, these issues have been overlooked. As a result, only a small fraction of the potential performance gain is realized due to underutilization of the host's SIMD parallelism and register capacity.In this paper, we present a novel dynamic binary translation technique called spill-aware SLP (saSLP), which combines short ARMv8 NEON instructions and registers in the guest binary loops to fully utilize the x86 AVX host's parallelism as well as minimize register spilling. Our experiment results show that saSLP improves the performance by 1.6X (2.3X) across a number of benchmarks, and reduces spilling by 97% (99%) for ARMv8 NEON to x86 AVX2 (AVX-512) translation.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130895482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
General-purpose workloads running on modern graphics processing units (GPGPUs) rely on hardware-based barriers to synchronize warps within a thread block (TB). However, imbalance may exist before reaching a barrier if a GPGPU workload contains irregular memory accesses, i.e., some warps may be critical while others may not. Ideally, cache space should be reserved for the critical warps. Unfortunately, current cache management policies are unaware of the existence of barriers and critical warps, which significantly limits the performance of irregular memory-intensive GPGPU workloads.In this work, we propose Barrier-Aware Cache Management (BACM), which is built on top of two underlying policies: a greedy policy and a friendly policy. The greedy policy does not allow non-critical warps to allocate cache lines in the L1 data cache; only critical warps can. The friendly policy allows non-critical warps to allocate cache lines but only over invalid or lower-priority cache lines. Based on the L1 data cache hit rate of non-critical warps, BACM dynamically chooses between the greedy and friendly policies. By doing so, BACM reserves more cache space to accelerate critical warps, thereby improving overall performance. Experimental results show that BACM achieves an average performance improvement of 24% and 20% compared to the GTO and BAWS policies, respectively. BACM's hardware cost is limited to 96 bytes per streaming multiprocessor.
{"title":"POSTER: BACM: Barrier-Aware Cache Management for Irregular Memory-Intensive GPGPU Workloads","authors":"Yuxi Liu, Xia Zhao, Zhibin Yu, Zhenlin Wang, Xiaolin Wang, Yingwei Luo, L. Eeckhout","doi":"10.1109/PACT.2017.55","DOIUrl":"https://doi.org/10.1109/PACT.2017.55","url":null,"abstract":"General-purpose workloads running on modern graphics processing units (GPGPUs) rely on hardware-based barriers to synchronize warps within a thread block (TB). However, imbalance may exist before reaching a barrier if a GPGPU workload contains irregular memory accesses, i.e., some warps may be critical while others may not. Ideally, cache space should be reserved for the critical warps. Unfortunately, current cache management policies are unaware of the existence of barriers and critical warps, which significantly limits the performance of irregular memory-intensive GPGPU workloads.In this work, we propose Barrier-Aware Cache Management (BACM), which is built on top of two underlying policies: a greedy policy and a friendly policy. The greedy policy does not allow non-critical warps to allocate cache lines in the L1 data cache; only critical warps can. The friendly policy allows non-critical warps to allocate cache lines but only over invalid or lower-priority cache lines. Based on the L1 data cache hit rate of non-critical warps, BACM dynamically chooses between the greedy and friendly policies. By doing so, BACM reserves more cache space to accelerate critical warps, thereby improving overall performance. Experimental results show that BACM achieves an average performance improvement of 24% and 20% compared to the GTO and BAWS policies, respectively. BACM's hardware cost is limited to 96 bytes per streaming multiprocessor.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122248782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maleen Abeydeera, Suvinay Subramanian, M. C. Jeffrey, J. Emer, Daniel Sánchez
This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. This disconnect causes major performance pathologies: increasing the number of threads per core adds conflicts and wasted work, and puts pressure on speculative execution resources. These pathologies often squander the benefits of multithreading.We present speculation-aware multithreading (SAM), a simple policy that addresses these pathologies. By coordinating instruction dispatch and conflict resolution priorities, SAM focuses execution resources on work that is more likely to commit, avoiding aborts and using speculation resources more efficiently.We design SAM variants for in-order and out-of-order cores. SAM is cheap to implement and makes multithreaded cores much more beneficial on speculative parallel programs. We evaluate SAM on systems with up to 64 SMT cores. With SAM, 8-threaded cores outperform single-threaded cores by 2.33x on average, while a speculation-oblivious policy yields a 1.85x speedup. SAM also reduces wasted work by 52%.
{"title":"SAM: Optimizing Multithreaded Cores for Speculative Parallelism","authors":"Maleen Abeydeera, Suvinay Subramanian, M. C. Jeffrey, J. Emer, Daniel Sánchez","doi":"10.1109/PACT.2017.37","DOIUrl":"https://doi.org/10.1109/PACT.2017.37","url":null,"abstract":"This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. This disconnect causes major performance pathologies: increasing the number of threads per core adds conflicts and wasted work, and puts pressure on speculative execution resources. These pathologies often squander the benefits of multithreading.We present speculation-aware multithreading (SAM), a simple policy that addresses these pathologies. By coordinating instruction dispatch and conflict resolution priorities, SAM focuses execution resources on work that is more likely to commit, avoiding aborts and using speculation resources more efficiently.We design SAM variants for in-order and out-of-order cores. SAM is cheap to implement and makes multithreaded cores much more beneficial on speculative parallel programs. We evaluate SAM on systems with up to 64 SMT cores. With SAM, 8-threaded cores outperform single-threaded cores by 2.33x on average, while a speculation-oblivious policy yields a 1.85x speedup. SAM also reduces wasted work by 52%.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129291646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}