Sangkug Lym, Heonjae Ha, Yongkee Kwon, Chun-Kai Chang, Jungrae Kim, M. Erez
Memory system performance is measured by access latency and bandwidth, and DRAM access parallelism critically impacts for both. To improve DRAM parallelism, previous research focused on increasing the number of effective banks by sub-dividing one physical bank. We find that without avoiding conflicts on the shared resources among (sub)banks, the benefits are limited. We propose mechanisms for efficient DRAM resource utilization and resource-conflict avoidance (ERUCA). ERUCA reduces conflicts on shared (sub)bank resources utilizing row address locality between sub-banks and improving the DRAM chip-level data bus. Area overhead for ERUCA is kept near zero with a unique implementation that exploits under-utilized resources available in commercial DRAM chips. Overall ERUCA provides 15% speedup while incurring <0.3% DRAM die area overhead.
{"title":"ERUCA: Efficient DRAM Resource Utilization and Resource Conflict Avoidance for Memory System Parallelism","authors":"Sangkug Lym, Heonjae Ha, Yongkee Kwon, Chun-Kai Chang, Jungrae Kim, M. Erez","doi":"10.1109/HPCA.2018.00063","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00063","url":null,"abstract":"Memory system performance is measured by access latency and bandwidth, and DRAM access parallelism critically impacts for both. To improve DRAM parallelism, previous research focused on increasing the number of effective banks by sub-dividing one physical bank. We find that without avoiding conflicts on the shared resources among (sub)banks, the benefits are limited. We propose mechanisms for efficient DRAM resource utilization and resource-conflict avoidance (ERUCA). ERUCA reduces conflicts on shared (sub)bank resources utilizing row address locality between sub-banks and improving the DRAM chip-level data bus. Area overhead for ERUCA is kept near zero with a unique implementation that exploits under-utilized resources available in commercial DRAM chips. Overall ERUCA provides 15% speedup while incurring <0.3% DRAM die area overhead.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132680467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Processing-in-Memory (PIM) has been proposed as a solution to accelerate data-intensive applications, such as real-time Big Data processing and neural networks. The acceleration of data processing using a PIM relies on its high internal memory bandwidth, which always comes with the cost of high power consumption. Consequently, it is important to have a comprehensive quantitative study of the power modeling and power management for such PIM architectures. In this work, we first model the relationship between the power consumption and the internal bandwidth of PIM. This model not only provides a guidance for PIM designs but also demonstrates the potential of power management via bandwidth throttling. Based on bandwidth throttling, we propose three techniques, Power-Aware Subtask Throttling (PAST), Processing Unit Boost (PUB), and Power Sprinting (PS), to improve the energy efficiency and performance. In order to demonstrate the universality of the proposed methods, we applied them to two kinds of popular PIM designs. Evaluations show that the performance of PIM can be further improved if the power consumption is carefully controlled. Targeting at the same performance, the peak power consumption of HMC-based PIM can be reduced from 20W to 15W. The proposed power management schemes improve the speedup of prior RRAM-based PIM from 69 × to 273 ×, after pushing the power usage from about 1W to 10W safely. The model also shows that emerging RRAM is more suitable for large processing-in-memory designs, due to its low power cost to store the data.
{"title":"PM3: Power Modeling and Power Management for Processing-in-Memory","authors":"Chao Zhang, Tong Meng, Guangyu Sun","doi":"10.1109/HPCA.2018.00054","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00054","url":null,"abstract":"Processing-in-Memory (PIM) has been proposed as a solution to accelerate data-intensive applications, such as real-time Big Data processing and neural networks. The acceleration of data processing using a PIM relies on its high internal memory bandwidth, which always comes with the cost of high power consumption. Consequently, it is important to have a comprehensive quantitative study of the power modeling and power management for such PIM architectures. In this work, we first model the relationship between the power consumption and the internal bandwidth of PIM. This model not only provides a guidance for PIM designs but also demonstrates the potential of power management via bandwidth throttling. Based on bandwidth throttling, we propose three techniques, Power-Aware Subtask Throttling (PAST), Processing Unit Boost (PUB), and Power Sprinting (PS), to improve the energy efficiency and performance. In order to demonstrate the universality of the proposed methods, we applied them to two kinds of popular PIM designs. Evaluations show that the performance of PIM can be further improved if the power consumption is carefully controlled. Targeting at the same performance, the peak power consumption of HMC-based PIM can be reduced from 20W to 15W. The proposed power management schemes improve the speedup of prior RRAM-based PIM from 69 × to 273 ×, after pushing the power usage from about 1W to 10W safely. The model also shows that emerging RRAM is more suitable for large processing-in-memory designs, due to its low power cost to store the data.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130669262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dmitry Knyaginin, Vassilis D. Papaefstathiou, P. Stenström
Non-Volatile Memory (NVM) technologies enable cost-effective hybrid main memories with two partitions: M1 (DRAM) and slower but larger M2 (NVM). This paper considers a flat migrating organization of hybrid memories. A challenging and open issue of managing such memories is to allocate M1 among co-running programs such that high fairness is achieved at the same time as high performance. This paper introduces ProFess: a Probabilistic hybrid main memory management Framework for high performance and fairness. It comprises: i) a Relative-Slowdown Monitor (RSM) that enables fair management by indicating which program suffers the most from competition for M1; and ii) a probabilistic Migration-Decision Mechanism (MDM) that unlocks high performance by realizing cost-benefit analysis that is individual for each pair of data blocks considered for migration. Within ProFess, RSM guides MDM towards high fairness. We show that for the multiprogrammed workloads evaluated, ProFess improves fairness by 15% (avg.; up to 29%), compared to the state-of-the-art, while outperforming it by 12% (avg.; up to 29%).
{"title":"ProFess: A Probabilistic Hybrid Main Memory Management Framework for High Performance and Fairness","authors":"Dmitry Knyaginin, Vassilis D. Papaefstathiou, P. Stenström","doi":"10.1109/HPCA.2018.00022","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00022","url":null,"abstract":"Non-Volatile Memory (NVM) technologies enable cost-effective hybrid main memories with two partitions: M1 (DRAM) and slower but larger M2 (NVM). This paper considers a flat migrating organization of hybrid memories. A challenging and open issue of managing such memories is to allocate M1 among co-running programs such that high fairness is achieved at the same time as high performance. This paper introduces ProFess: a Probabilistic hybrid main memory management Framework for high performance and fairness. It comprises: i) a Relative-Slowdown Monitor (RSM) that enables fair management by indicating which program suffers the most from competition for M1; and ii) a probabilistic Migration-Decision Mechanism (MDM) that unlocks high performance by realizing cost-benefit analysis that is individual for each pair of data blocks considered for migration. Within ProFess, RSM guides MDM towards high fairness. We show that for the multiprogrammed workloads evaluated, ProFess improves fairness by 15% (avg.; up to 29%), compared to the state-of-the-art, while outperforming it by 12% (avg.; up to 29%).","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122964301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.
随着技术扩展的进步,图形处理单元(GPU)包含越来越多的计算资源,单个GPU内核难以充分利用庞大的GPU资源。提高资源利用率的一个解决方案是并发内核执行(CKE)。早期CKE主要针对剩余资源。然而,它不能优化资源利用,也不能提供并发内核之间的公平性。空间多任务为每个内核分配一个流多处理器(SMs)子集。虽然实现了更好的公平性,但没有解决SM中资源利用率不足的问题。因此,提出了SM内部共享,将来自不同内核的线程块分发给每个SM。然而,正如本研究所示,在sm内部共享方案中,由于内核之间的严重干扰,整体性能可能会受到影响。具体来说,由于并发内核共享内存子系统,一个内核即使是计算密集型的,也可能因为不能及时发出内存指令而挨饿。此外,由一个内核(特别是内存密集型内核)引起的严重L1 d -缓存抖动和内存管道停滞将影响其他内核,从而进一步损害整体性能。在本研究中,我们探讨了克服sm内部共享中暴露的上述问题的各种方法。我们首先强调,针对cpu提出的缓存分区技术对gpu并不有效。然后,我们提出了两种减少内存管道延迟的方法。首先是平衡并发内核的内存访问。第二种方法是限制从单个内核发出的飞行内存指令的数量。我们的评估表明,所提出的方案显著提高了两种最先进的sm内部共享方案warp - slicer和SMK的加权加速,平均分别提高了24.6%和27.2%,并且硬件开销很轻。
{"title":"Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls","authors":"Hongwen Dai, Zhen Lin, C. Li, Chen Zhao, Fei Wang, Nanning Zheng, Huiyang Zhou","doi":"10.1109/HPCA.2018.00027","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00027","url":null,"abstract":"Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122970360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anisotropic filtering enabled by modern rasterization-based GPUs provides users with extremely authentic visualization experience, but significantly limits the performance and energy efficiency of 3D rendering process due to its large texture data requirement. To improve 3D rendering efficiency, we build a bridge between anisotropic filtering process and human visual system by analyzing users’ perception on image quality. We discover that anisotropic filtering does not impact user perceived image quality on every pixel. This motives us to approximate the anisotropic filtering process for non-perceivable pixels in order to improve the overall 3D rendering performance without damaging user experience. To achieve this goal, we propose a perceptionoriented runtime approximation model for 3D rendering by leveraging the inner-relationship between anisotropic and isotropic filtering. We also provide a low-cost texture unit design for enabling this approximation. Extensive evaluation on modern 3D games demonstrates that, under a conservative tuning point, our design achieves a significant average speedup of 17% for the overall 3D rendering along with 11% total GPU energy reduction, without visible image quality loss from users’ perception. It also reduces the texture filtering latency by an average of 29%. Additionally, it creates a unique perception-based tuning space for performance-quality tradeoffs on graphics processors.
{"title":"Perception-Oriented 3D Rendering Approximation for Modern Graphics Processors","authors":"Chenhao Xie, Xin Fu, S. Song","doi":"10.1109/HPCA.2018.00039","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00039","url":null,"abstract":"Anisotropic filtering enabled by modern rasterization-based GPUs provides users with extremely authentic visualization experience, but significantly limits the performance and energy efficiency of 3D rendering process due to its large texture data requirement. To improve 3D rendering efficiency, we build a bridge between anisotropic filtering process and human visual system by analyzing users’ perception on image quality. We discover that anisotropic filtering does not impact user perceived image quality on every pixel. This motives us to approximate the anisotropic filtering process for non-perceivable pixels in order to improve the overall 3D rendering performance without damaging user experience. To achieve this goal, we propose a perceptionoriented runtime approximation model for 3D rendering by leveraging the inner-relationship between anisotropic and isotropic filtering. We also provide a low-cost texture unit design for enabling this approximation. Extensive evaluation on modern 3D games demonstrates that, under a conservative tuning point, our design achieves a significant average speedup of 17% for the overall 3D rendering along with 11% total GPU energy reduction, without visible image quality loss from users’ perception. It also reduces the texture filtering latency by an average of 29%. Additionally, it creates a unique perception-based tuning space for performance-quality tradeoffs on graphics processors.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128807401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-core memory systems commonly share resources between processors. Resource sharing improves utilization at the cost of increased inter-application interference which may lead to priority inversion, missed deadlines and unpredictable interactive performance. A key component to effectively manage multi-core resources is performance accounting which aims to accurately estimate interference-free application performance. Previously proposed accounting systems are either invasive or transparent. Invasive accounting systems can be accurate, but slow down latency-sensitive processes. Transparent accounting systems do not affect performance, but tend to provide less accurate performance estimates. We propose a novel class of performance accounting systems that achieve both performance-transparency and superior accuracy. We call the approach dataflow accounting, and the key idea is to track dynamic dataflow properties and use these to estimate interference-free performance. Our main contribution is Graph-based Dynamic Performance (GDP) accounting. GDP dynamically builds a dataflow graph of load requests and periods where the processor commits instructions. This graph concisely represents the relationship between memory loads and forward progress in program execution. More specifically, GDP estimates interference-free stall cycles by multiplying the critical path length of the dataflow graph with the estimated interference-free memory latency. GDP is very accurate with mean IPC estimation errors of 3.4% and 9.8% for our 4- and 8-core processors, respectively. When GDP is used in a cache partitioning policy, we observe average system throughput improvements of 11.9% and 20.8% compared to partitioning using the state-of-the-art Application Slowdown Model.
{"title":"GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime","authors":"Magnus Jahre, L. Eeckhout","doi":"10.1109/HPCA.2018.00034","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00034","url":null,"abstract":"Multi-core memory systems commonly share resources between processors. Resource sharing improves utilization at the cost of increased inter-application interference which may lead to priority inversion, missed deadlines and unpredictable interactive performance. A key component to effectively manage multi-core resources is performance accounting which aims to accurately estimate interference-free application performance. Previously proposed accounting systems are either invasive or transparent. Invasive accounting systems can be accurate, but slow down latency-sensitive processes. Transparent accounting systems do not affect performance, but tend to provide less accurate performance estimates. We propose a novel class of performance accounting systems that achieve both performance-transparency and superior accuracy. We call the approach dataflow accounting, and the key idea is to track dynamic dataflow properties and use these to estimate interference-free performance. Our main contribution is Graph-based Dynamic Performance (GDP) accounting. GDP dynamically builds a dataflow graph of load requests and periods where the processor commits instructions. This graph concisely represents the relationship between memory loads and forward progress in program execution. More specifically, GDP estimates interference-free stall cycles by multiplying the critical path length of the dataflow graph with the estimated interference-free memory latency. GDP is very accurate with mean IPC estimation errors of 3.4% and 9.8% for our 4- and 8-core processors, respectively. When GDP is used in a cache partitioning policy, we observe average system throughput improvements of 11.9% and 20.8% compared to partitioning using the state-of-the-art Application Slowdown Model.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129890233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fine-grained synchronization is employed in many parallel algorithms and is often implemented using busy-wait synchronization (e.g., spin locks). However, busy-wait synchronization incurs significant overheads and existing CPU solutions do not readily translate to single-instruction, multiple-thread (SIMT) graphics processor unit (GPU) architectures. In this paper, we propose Back-Off Warp Spinning (BOWS), a hardware warp scheduling policy that extends existing warp scheduling policies to temporarily deprioritize warps executing busy wait code. In addition, we propose Dynamic Detection of Spinning (DDOS), a novel hardware mechanism for accurately and efficiently detecting busy-wait synchronization on GPUs. On a set of GPU kernels employing busy-wait synchronization, DDOS identifies all busy-wait loops incurring no false detections. BOWS improves performance by 1.5× and reduces energy consumption by 1.6× versus Criticality-Aware Warp Acceleration (CAWA) [14].,,,,
{"title":"Warp Scheduling for Fine-Grained Synchronization","authors":"Ahmed Eltantawy, Tor M. Aamodt","doi":"10.1109/HPCA.2018.00040","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00040","url":null,"abstract":"Fine-grained synchronization is employed in many parallel algorithms and is often implemented using busy-wait synchronization (e.g., spin locks). However, busy-wait synchronization incurs significant overheads and existing CPU solutions do not readily translate to single-instruction, multiple-thread (SIMT) graphics processor unit (GPU) architectures. In this paper, we propose Back-Off Warp Spinning (BOWS), a hardware warp scheduling policy that extends existing warp scheduling policies to temporarily deprioritize warps executing busy wait code. In addition, we propose Dynamic Detection of Spinning (DDOS), a novel hardware mechanism for accurately and efficiently detecting busy-wait synchronization on GPUs. On a set of GPU kernels employing busy-wait synchronization, DDOS identifies all busy-wait loops incurring no false detections. BOWS improves performance by 1.5× and reduces energy consumption by 1.6× versus Criticality-Aware Warp Acceleration (CAWA) [14].,,,,","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117042888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongrui Fan, Wenming Li, Xiaochun Ye, Da Wang, Hao Zhang, Zhimin Tang, Ninghui Sun
Fast-growing high-throughput applications, such as web services, are characterized by high-concurrency processing, hard real-time response, and high-bandwidth memory access. The newly-born applications bring severe challenges to processors in datacenters, both in concurrent processing performance and energy efficiency. To offer a satisfactory quality of services, it is of critical importance to meet these newly emerging demands of high-throughput applications in the future datacenters in a more efficient way. In this paper, we propose a novel architecture, called SmarCo, which allows high-throughput applications to be processed more efficiently in datacenters. Based on the dominant characteristics of high-throughput applications, we implement large-scale many-core architecture with in-pair threads to support high-concurrency processing; we also introduce a hierarchical ring topology and laxity-aware task scheduler to guarantee hard real-time response; furthermore, we propose high-throughput datapath to improve memory access efficiency. We verify the efficiency of SmarCo by using simulators, large-scale FPGA and prototype with TSMC 40-nm technology node. The experimental results show that, compared to Intel Xeon E7-8890V4, SmarCo achieves 10.11X performance improvement and 6.95X energy-efficiency improvement with higher throughput and a better guarantee of real-time response.
{"title":"SmarCo: An Efficient Many-Core Processor for High-Throughput Applications in Datacenters","authors":"Dongrui Fan, Wenming Li, Xiaochun Ye, Da Wang, Hao Zhang, Zhimin Tang, Ninghui Sun","doi":"10.1109/HPCA.2018.00057","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00057","url":null,"abstract":"Fast-growing high-throughput applications, such as web services, are characterized by high-concurrency processing, hard real-time response, and high-bandwidth memory access. The newly-born applications bring severe challenges to processors in datacenters, both in concurrent processing performance and energy efficiency. To offer a satisfactory quality of services, it is of critical importance to meet these newly emerging demands of high-throughput applications in the future datacenters in a more efficient way. In this paper, we propose a novel architecture, called SmarCo, which allows high-throughput applications to be processed more efficiently in datacenters. Based on the dominant characteristics of high-throughput applications, we implement large-scale many-core architecture with in-pair threads to support high-concurrency processing; we also introduce a hierarchical ring topology and laxity-aware task scheduler to guarantee hard real-time response; furthermore, we propose high-throughput datapath to improve memory access efficiency. We verify the efficiency of SmarCo by using simulators, large-scale FPGA and prototype with TSMC 40-nm technology node. The experimental results show that, compared to Intel Xeon E7-8890V4, SmarCo achieves 10.11X performance improvement and 6.95X energy-efficiency improvement with higher throughput and a better guarantee of real-time response.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121536661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phase change memory (PCM) has recently emerged as a promising technology to meet the fast growing demand for large capacity memory in computer systems, replacing DRAM that is impeded by physical limitations. Multi-level cell (MLC) PCM offers high density with low per-byte fabrication cost. However, despite many advantages, such as scalability and low leakage, the energy for programming intermediate states is considerably larger than programing single-level cell PCM. In this paper, we study encoding techniques to reduce write energy for MLC PCM when the encoding granularity is lowered below the typical cache line size. We observe that encoding data blocks at small granularity to reduce write energy actually increases the write energy because of the auxiliary encoding bits. We mitigate this adverse effect by 1) designing suitable codeword mappings that use fewer auxiliary bits and 2) proposing a new Word-Level Compression (WLC) which compresses more than 91% of the memory lines and provides enough room to store the auxiliary data using a novel restricted coset encoding applied at small data block granularities. Experimental results show that the proposed encoding at 16-bit data granularity reduces the write energy by 39%, on average, versus the leading encoding approach for write energy reduction. Furthermore, it improves endurance by 20% and is more reliable than the leading approach. Hardware synthesis evaluation shows that the proposed encoding can be implemented on-chip with only a nominal area overhead.
{"title":"Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM","authors":"Seyed Mohammad Seyedzadeh, A. Jones, R. Melhem","doi":"10.1109/HPCA.2018.00038","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00038","url":null,"abstract":"Phase change memory (PCM) has recently emerged as a promising technology to meet the fast growing demand for large capacity memory in computer systems, replacing DRAM that is impeded by physical limitations. Multi-level cell (MLC) PCM offers high density with low per-byte fabrication cost. However, despite many advantages, such as scalability and low leakage, the energy for programming intermediate states is considerably larger than programing single-level cell PCM. In this paper, we study encoding techniques to reduce write energy for MLC PCM when the encoding granularity is lowered below the typical cache line size. We observe that encoding data blocks at small granularity to reduce write energy actually increases the write energy because of the auxiliary encoding bits. We mitigate this adverse effect by 1) designing suitable codeword mappings that use fewer auxiliary bits and 2) proposing a new Word-Level Compression (WLC) which compresses more than 91% of the memory lines and provides enough room to store the auxiliary data using a novel restricted coset encoding applied at small data block granularities. Experimental results show that the proposed encoding at 16-bit data granularity reduces the write energy by 39%, on average, versus the leading encoding approach for write energy reduction. Furthermore, it improves endurance by 20% and is more reliable than the leading approach. Hardware synthesis evaluation shows that the proposed encoding can be implemented on-chip with only a nominal area overhead.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131633662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen
Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01× (up to 132.67×) speedup and a 33.82× energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69× to 2.19× speedup and consumes 4.77× to 8.91× less energy. GRAPHR gains a speedup of 1.16× to 4.12×, and is 3.67× to 10.96× more energy efficiency compared to PIM-based architecture.
{"title":"GraphR: Accelerating Graph Processing Using ReRAM","authors":"Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen","doi":"10.1109/HPCA.2018.00052","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00052","url":null,"abstract":"Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01× (up to 132.67×) speedup and a 33.82× energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69× to 2.19× speedup and consumes 4.77× to 8.91× less energy. GRAPHR gains a speedup of 1.16× to 4.12×, and is 3.67× to 10.96× more energy efficiency compared to PIM-based architecture.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134434507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}