2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第6页

ERUCA: Efficient DRAM Resource Utilization and Resource Conflict Avoidance for Memory System Parallelism ERUCA:有效的DRAM资源利用和避免内存系统并行性的资源冲突

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00063

Sangkug Lym, Heonjae Ha, Yongkee Kwon, Chun-Kai Chang, Jungrae Kim, M. Erez

Memory system performance is measured by access latency and bandwidth, and DRAM access parallelism critically impacts for both. To improve DRAM parallelism, previous research focused on increasing the number of effective banks by sub-dividing one physical bank. We find that without avoiding conflicts on the shared resources among (sub)banks, the benefits are limited. We propose mechanisms for efficient DRAM resource utilization and resource-conflict avoidance (ERUCA). ERUCA reduces conflicts on shared (sub)bank resources utilizing row address locality between sub-banks and improving the DRAM chip-level data bus. Area overhead for ERUCA is kept near zero with a unique implementation that exploits under-utilized resources available in commercial DRAM chips. Overall ERUCA provides 15% speedup while incurring <0.3% DRAM die area overhead.

内存系统性能是通过访问延迟和带宽来衡量的，而DRAM访问并行性对两者都有重要影响。为了提高DRAM并行性，以前的研究主要集中在通过细分一个物理银行来增加有效银行的数量。研究发现，如果不避免银行间共享资源的冲突，银行间的收益是有限的。我们提出了有效的DRAM资源利用和资源冲突避免机制(ERUCA)。ERUCA利用子银行之间的行地址局部性减少了共享(子)银行资源的冲突，并改进了DRAM芯片级数据总线。ERUCA的区域开销保持在接近于零，其独特的实现利用了商用DRAM芯片中可用的未充分利用的资源。总的来说，ERUCA提供15%的加速，同时产生<0.3%的DRAM芯片面积开销。

引用次数: 9

PM3: Power Modeling and Power Management for Processing-in-Memory PM3:内存中处理的电源建模和电源管理

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00054

Chao Zhang, Tong Meng, Guangyu Sun

Processing-in-Memory (PIM) has been proposed as a solution to accelerate data-intensive applications, such as real-time Big Data processing and neural networks. The acceleration of data processing using a PIM relies on its high internal memory bandwidth, which always comes with the cost of high power consumption. Consequently, it is important to have a comprehensive quantitative study of the power modeling and power management for such PIM architectures. In this work, we first model the relationship between the power consumption and the internal bandwidth of PIM. This model not only provides a guidance for PIM designs but also demonstrates the potential of power management via bandwidth throttling. Based on bandwidth throttling, we propose three techniques, Power-Aware Subtask Throttling (PAST), Processing Unit Boost (PUB), and Power Sprinting (PS), to improve the energy efficiency and performance. In order to demonstrate the universality of the proposed methods, we applied them to two kinds of popular PIM designs. Evaluations show that the performance of PIM can be further improved if the power consumption is carefully controlled. Targeting at the same performance, the peak power consumption of HMC-based PIM can be reduced from 20W to 15W. The proposed power management schemes improve the speedup of prior RRAM-based PIM from 69 × to 273 ×, after pushing the power usage from about 1W to 10W safely. The model also shows that emerging RRAM is more suitable for large processing-in-memory designs, due to its low power cost to store the data.

内存中处理(PIM)已被提出作为加速数据密集型应用的解决方案，如实时大数据处理和神经网络。使用PIM加速数据处理依赖于其高内部内存带宽，而这总是以高功耗为代价。因此，对此类PIM架构的电源建模和电源管理进行全面的定量研究非常重要。在这项工作中，我们首先建立了PIM的功耗和内部带宽之间的关系模型。该模型不仅为PIM设计提供了指导，而且还展示了通过带宽节流进行电源管理的潜力。在带宽限制的基础上，提出了功率感知子任务限制(Power- aware Subtask throttling, PAST)、处理单元提升(Processing Unit Boost, PUB)和功率冲刺(Power sprint, PS)三种技术来提高能效和性能。为了证明所提方法的通用性，我们将其应用于两种流行的PIM设计。评估表明，如果仔细控制功耗，PIM的性能可以进一步提高。在相同的性能下，基于hmc的PIM的峰值功耗可以从20W降低到15W。所提出的电源管理方案在将功耗从1W左右安全地提高到10W后，将先前基于rram的PIM的加速从69 ×提高到273 ×。该模型还表明，新兴的RRAM由于其存储数据的低功耗，更适合于大型内存处理设计。

{"title":"PM3: Power Modeling and Power Management for Processing-in-Memory","authors":"Chao Zhang, Tong Meng, Guangyu Sun","doi":"10.1109/HPCA.2018.00054","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00054","url":null,"abstract":"Processing-in-Memory (PIM) has been proposed as a solution to accelerate data-intensive applications, such as real-time Big Data processing and neural networks. The acceleration of data processing using a PIM relies on its high internal memory bandwidth, which always comes with the cost of high power consumption. Consequently, it is important to have a comprehensive quantitative study of the power modeling and power management for such PIM architectures. In this work, we first model the relationship between the power consumption and the internal bandwidth of PIM. This model not only provides a guidance for PIM designs but also demonstrates the potential of power management via bandwidth throttling. Based on bandwidth throttling, we propose three techniques, Power-Aware Subtask Throttling (PAST), Processing Unit Boost (PUB), and Power Sprinting (PS), to improve the energy efficiency and performance. In order to demonstrate the universality of the proposed methods, we applied them to two kinds of popular PIM designs. Evaluations show that the performance of PIM can be further improved if the power consumption is carefully controlled. Targeting at the same performance, the peak power consumption of HMC-based PIM can be reduced from 20W to 15W. The proposed power management schemes improve the speedup of prior RRAM-based PIM from 69 × to 273 ×, after pushing the power usage from about 1W to 10W safely. The model also shows that emerging RRAM is more suitable for large processing-in-memory designs, due to its low power cost to store the data.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130669262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

ProFess: A Probabilistic Hybrid Main Memory Management Framework for High Performance and Fairness 教授:一种高性能和公平性的概率混合主存管理框架

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00022

Dmitry Knyaginin, Vassilis D. Papaefstathiou, P. Stenström

Non-Volatile Memory (NVM) technologies enable cost-effective hybrid main memories with two partitions: M1 (DRAM) and slower but larger M2 (NVM). This paper considers a flat migrating organization of hybrid memories. A challenging and open issue of managing such memories is to allocate M1 among co-running programs such that high fairness is achieved at the same time as high performance. This paper introduces ProFess: a Probabilistic hybrid main memory management Framework for high performance and fairness. It comprises: i) a Relative-Slowdown Monitor (RSM) that enables fair management by indicating which program suffers the most from competition for M1; and ii) a probabilistic Migration-Decision Mechanism (MDM) that unlocks high performance by realizing cost-benefit analysis that is individual for each pair of data blocks considered for migration. Within ProFess, RSM guides MDM towards high fairness. We show that for the multiprogrammed workloads evaluated, ProFess improves fairness by 15% (avg.; up to 29%), compared to the state-of-the-art, while outperforming it by 12% (avg.; up to 29%).

非易失性内存(NVM)技术支持具有两个分区的经济高效混合主内存:M1 (DRAM)和速度较慢但较大的M2 (NVM)。本文研究了一种混合存储器的平面迁移组织。管理此类内存的一个具有挑战性和开放性的问题是在共同运行的程序之间分配M1，以便在实现高性能的同时实现高公平性。介绍了一种高性能、公平的概率混合主存管理框架ProFess。它包括:i)一个相对减速监视器(RSM)，通过指出哪个程序在M1竞争中受到的影响最大，从而实现公平管理;ii)概率迁移决策机制(MDM)，通过实现对考虑迁移的每对数据块进行单独的成本效益分析，实现高性能。在prof中，RSM引导MDM实现高公平性。我们表明，对于评估的多程序工作负载，ProFess将公平性提高了15%(平均;高达29%)，同时比最先进的技术高出12%(平均;高达29%)。

{"title":"ProFess: A Probabilistic Hybrid Main Memory Management Framework for High Performance and Fairness","authors":"Dmitry Knyaginin, Vassilis D. Papaefstathiou, P. Stenström","doi":"10.1109/HPCA.2018.00022","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00022","url":null,"abstract":"Non-Volatile Memory (NVM) technologies enable cost-effective hybrid main memories with two partitions: M1 (DRAM) and slower but larger M2 (NVM). This paper considers a flat migrating organization of hybrid memories. A challenging and open issue of managing such memories is to allocate M1 among co-running programs such that high fairness is achieved at the same time as high performance. This paper introduces ProFess: a Probabilistic hybrid main memory management Framework for high performance and fairness. It comprises: i) a Relative-Slowdown Monitor (RSM) that enables fair management by indicating which program suffers the most from competition for M1; and ii) a probabilistic Migration-Decision Mechanism (MDM) that unlocks high performance by realizing cost-benefit analysis that is individual for each pair of data blocks considered for migration. Within ProFess, RSM guides MDM towards high fairness. We show that for the multiprogrammed workloads evaluated, ProFess improves fairness by 15% (avg.; up to 29%), compared to the state-of-the-art, while outperforming it by 12% (avg.; up to 29%).","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122964301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls 加速GPU并发内核执行通过减少内存管道摊位

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00027

Hongwen Dai, Zhen Lin, C. Li, Chen Zhao, Fei Wang, Nanning Zheng, Huiyang Zhou

Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.

随着技术扩展的进步，图形处理单元(GPU)包含越来越多的计算资源，单个GPU内核难以充分利用庞大的GPU资源。提高资源利用率的一个解决方案是并发内核执行(CKE)。早期CKE主要针对剩余资源。然而，它不能优化资源利用，也不能提供并发内核之间的公平性。空间多任务为每个内核分配一个流多处理器(SMs)子集。虽然实现了更好的公平性，但没有解决SM中资源利用率不足的问题。因此，提出了SM内部共享，将来自不同内核的线程块分发给每个SM。然而，正如本研究所示，在sm内部共享方案中，由于内核之间的严重干扰，整体性能可能会受到影响。具体来说，由于并发内核共享内存子系统，一个内核即使是计算密集型的，也可能因为不能及时发出内存指令而挨饿。此外，由一个内核(特别是内存密集型内核)引起的严重L1 d -缓存抖动和内存管道停滞将影响其他内核，从而进一步损害整体性能。在本研究中，我们探讨了克服sm内部共享中暴露的上述问题的各种方法。我们首先强调，针对cpu提出的缓存分区技术对gpu并不有效。然后，我们提出了两种减少内存管道延迟的方法。首先是平衡并发内核的内存访问。第二种方法是限制从单个内核发出的飞行内存指令的数量。我们的评估表明，所提出的方案显著提高了两种最先进的sm内部共享方案warp - slicer和SMK的加权加速，平均分别提高了24.6%和27.2%，并且硬件开销很轻。

{"title":"Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls","authors":"Hongwen Dai, Zhen Lin, C. Li, Chen Zhao, Fei Wang, Nanning Zheng, Huiyang Zhou","doi":"10.1109/HPCA.2018.00027","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00027","url":null,"abstract":"Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122970360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Perception-Oriented 3D Rendering Approximation for Modern Graphics Processors 面向感知的现代图形处理器三维渲染近似

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00039

Chenhao Xie, Xin Fu, S. Song

Anisotropic filtering enabled by modern rasterization-based GPUs provides users with extremely authentic visualization experience, but significantly limits the performance and energy efficiency of 3D rendering process due to its large texture data requirement. To improve 3D rendering efficiency, we build a bridge between anisotropic filtering process and human visual system by analyzing users’ perception on image quality. We discover that anisotropic filtering does not impact user perceived image quality on every pixel. This motives us to approximate the anisotropic filtering process for non-perceivable pixels in order to improve the overall 3D rendering performance without damaging user experience. To achieve this goal, we propose a perceptionoriented runtime approximation model for 3D rendering by leveraging the inner-relationship between anisotropic and isotropic filtering. We also provide a low-cost texture unit design for enabling this approximation. Extensive evaluation on modern 3D games demonstrates that, under a conservative tuning point, our design achieves a significant average speedup of 17% for the overall 3D rendering along with 11% total GPU energy reduction, without visible image quality loss from users’ perception. It also reduces the texture filtering latency by an average of 29%. Additionally, it creates a unique perception-based tuning space for performance-quality tradeoffs on graphics processors.

现代基于栅格化的gpu支持的各向异性滤波为用户提供了极其真实的可视化体验，但由于其对纹理数据的大量需求，极大地限制了3D渲染过程的性能和能效。为了提高3D渲染效率，我们通过分析用户对图像质量的感知，在各向异性滤波过程和人类视觉系统之间架起一座桥梁。我们发现各向异性滤波不会影响用户感知到的每个像素的图像质量。这促使我们对不可感知像素近似各向异性过滤过程，以便在不损害用户体验的情况下提高整体3D渲染性能。为了实现这一目标，我们利用各向异性和各向同性滤波之间的内在关系，提出了一种面向感知的3D渲染运行时近似模型。我们还提供了一种低成本的纹理单元设计来实现这种近似。对现代3D游戏的广泛评估表明，在保守的调整点下，我们的设计在整体3D渲染上实现了17%的显著平均加速，同时减少了11%的GPU总能量，而用户的感知没有明显的图像质量损失。它还将纹理过滤延迟平均降低了29%。此外，它还为图形处理器的性能质量权衡创造了一个独特的基于感知的调优空间。

{"title":"Perception-Oriented 3D Rendering Approximation for Modern Graphics Processors","authors":"Chenhao Xie, Xin Fu, S. Song","doi":"10.1109/HPCA.2018.00039","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00039","url":null,"abstract":"Anisotropic filtering enabled by modern rasterization-based GPUs provides users with extremely authentic visualization experience, but significantly limits the performance and energy efficiency of 3D rendering process due to its large texture data requirement. To improve 3D rendering efficiency, we build a bridge between anisotropic filtering process and human visual system by analyzing users’ perception on image quality. We discover that anisotropic filtering does not impact user perceived image quality on every pixel. This motives us to approximate the anisotropic filtering process for non-perceivable pixels in order to improve the overall 3D rendering performance without damaging user experience. To achieve this goal, we propose a perceptionoriented runtime approximation model for 3D rendering by leveraging the inner-relationship between anisotropic and isotropic filtering. We also provide a low-cost texture unit design for enabling this approximation. Extensive evaluation on modern 3D games demonstrates that, under a conservative tuning point, our design achieves a significant average speedup of 17% for the overall 3D rendering along with 11% total GPU energy reduction, without visible image quality loss from users’ perception. It also reduces the texture filtering latency by an average of 29%. Additionally, it creates a unique perception-based tuning space for performance-quality tradeoffs on graphics processors.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128807401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime GDP:使用数据流属性在运行时准确估计无干扰性能

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00034

Magnus Jahre, L. Eeckhout

Multi-core memory systems commonly share resources between processors. Resource sharing improves utilization at the cost of increased inter-application interference which may lead to priority inversion, missed deadlines and unpredictable interactive performance. A key component to effectively manage multi-core resources is performance accounting which aims to accurately estimate interference-free application performance. Previously proposed accounting systems are either invasive or transparent. Invasive accounting systems can be accurate, but slow down latency-sensitive processes. Transparent accounting systems do not affect performance, but tend to provide less accurate performance estimates. We propose a novel class of performance accounting systems that achieve both performance-transparency and superior accuracy. We call the approach dataflow accounting, and the key idea is to track dynamic dataflow properties and use these to estimate interference-free performance. Our main contribution is Graph-based Dynamic Performance (GDP) accounting. GDP dynamically builds a dataflow graph of load requests and periods where the processor commits instructions. This graph concisely represents the relationship between memory loads and forward progress in program execution. More specifically, GDP estimates interference-free stall cycles by multiplying the critical path length of the dataflow graph with the estimated interference-free memory latency. GDP is very accurate with mean IPC estimation errors of 3.4% and 9.8% for our 4- and 8-core processors, respectively. When GDP is used in a cache partitioning policy, we observe average system throughput improvements of 11.9% and 20.8% compared to partitioning using the state-of-the-art Application Slowdown Model.

多核内存系统通常在处理器之间共享资源。资源共享提高了利用率，但代价是增加了应用程序间的干扰，这可能导致优先级反转、错过最后期限和不可预测的交互性能。有效管理多核资源的一个关键组成部分是性能核算，其目的是准确估计无干扰应用程序的性能。先前提出的会计制度要么是侵入性的，要么是透明的。侵入式会计系统可能是准确的，但会减慢对延迟敏感的流程。透明的会计制度不影响绩效，但往往提供不太准确的绩效估计。我们提出了一种新型的绩效会计系统，它既能实现绩效透明度，又能实现卓越的准确性。我们称这种方法为数据流会计，其关键思想是跟踪动态数据流属性并使用这些属性来估计无干扰性能。我们的主要贡献是基于图表的动态绩效(GDP)核算。GDP动态地构建负载请求和处理器提交指令的周期的数据流图。该图简洁地表示了程序执行中内存负载和向前进度之间的关系。更具体地说，GDP通过将数据流图的关键路径长度与估计的无干扰内存延迟相乘来估计无干扰的失速周期。对于我们的4核和8核处理器，GDP非常准确，平均IPC估计误差分别为3.4%和9.8%。当在缓存分区策略中使用GDP时，我们观察到与使用最先进的应用程序减速模型进行分区相比，平均系统吞吐量提高了11.9%和20.8%。

{"title":"GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime","authors":"Magnus Jahre, L. Eeckhout","doi":"10.1109/HPCA.2018.00034","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00034","url":null,"abstract":"Multi-core memory systems commonly share resources between processors. Resource sharing improves utilization at the cost of increased inter-application interference which may lead to priority inversion, missed deadlines and unpredictable interactive performance. A key component to effectively manage multi-core resources is performance accounting which aims to accurately estimate interference-free application performance. Previously proposed accounting systems are either invasive or transparent. Invasive accounting systems can be accurate, but slow down latency-sensitive processes. Transparent accounting systems do not affect performance, but tend to provide less accurate performance estimates. We propose a novel class of performance accounting systems that achieve both performance-transparency and superior accuracy. We call the approach dataflow accounting, and the key idea is to track dynamic dataflow properties and use these to estimate interference-free performance. Our main contribution is Graph-based Dynamic Performance (GDP) accounting. GDP dynamically builds a dataflow graph of load requests and periods where the processor commits instructions. This graph concisely represents the relationship between memory loads and forward progress in program execution. More specifically, GDP estimates interference-free stall cycles by multiplying the critical path length of the dataflow graph with the estimated interference-free memory latency. GDP is very accurate with mean IPC estimation errors of 3.4% and 9.8% for our 4- and 8-core processors, respectively. When GDP is used in a cache partitioning policy, we observe average system throughput improvements of 11.9% and 20.8% compared to partitioning using the state-of-the-art Application Slowdown Model.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129890233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Warp Scheduling for Fine-Grained Synchronization 细粒度同步的Warp调度

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00040

Ahmed Eltantawy, Tor M. Aamodt

Fine-grained synchronization is employed in many parallel algorithms and is often implemented using busy-wait synchronization (e.g., spin locks). However, busy-wait synchronization incurs significant overheads and existing CPU solutions do not readily translate to single-instruction, multiple-thread (SIMT) graphics processor unit (GPU) architectures. In this paper, we propose Back-Off Warp Spinning (BOWS), a hardware warp scheduling policy that extends existing warp scheduling policies to temporarily deprioritize warps executing busy wait code. In addition, we propose Dynamic Detection of Spinning (DDOS), a novel hardware mechanism for accurately and efficiently detecting busy-wait synchronization on GPUs. On a set of GPU kernels employing busy-wait synchronization, DDOS identifies all busy-wait loops incurring no false detections. BOWS improves performance by 1.5× and reduces energy consumption by 1.6× versus Criticality-Aware Warp Acceleration (CAWA) [14].,,,,

细粒度同步在许多并行算法中使用，并且通常使用忙等待同步(例如，自旋锁)来实现。然而，忙碌等待同步带来了巨大的开销，并且现有的CPU解决方案不容易转换为单指令、多线程(SIMT)图形处理器单元(GPU)架构。在本文中，我们提出了一种硬件Warp调度策略，它扩展了现有的Warp调度策略，以暂时降低执行繁忙等待代码的Warp的优先级。此外，我们提出了动态检测旋转(DDOS)，这是一种新的硬件机制，可以准确有效地检测gpu上的忙等待同步。在一组采用忙等待同步的GPU内核上，DDOS识别所有不会产生错误检测的忙等待循环。与临界感知曲速加速(CAWA)相比，bow提高了1.5倍的性能，降低了1.6倍的能耗[14].，，，，

引用次数: 23

SmarCo: An Efficient Many-Core Processor for High-Throughput Applications in Datacenters SmarCo:适用于数据中心高吞吐量应用的高效多核处理器

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00057

Dongrui Fan, Wenming Li, Xiaochun Ye, Da Wang, Hao Zhang, Zhimin Tang, Ninghui Sun

Fast-growing high-throughput applications, such as web services, are characterized by high-concurrency processing, hard real-time response, and high-bandwidth memory access. The newly-born applications bring severe challenges to processors in datacenters, both in concurrent processing performance and energy efficiency. To offer a satisfactory quality of services, it is of critical importance to meet these newly emerging demands of high-throughput applications in the future datacenters in a more efficient way. In this paper, we propose a novel architecture, called SmarCo, which allows high-throughput applications to be processed more efficiently in datacenters. Based on the dominant characteristics of high-throughput applications, we implement large-scale many-core architecture with in-pair threads to support high-concurrency processing; we also introduce a hierarchical ring topology and laxity-aware task scheduler to guarantee hard real-time response; furthermore, we propose high-throughput datapath to improve memory access efficiency. We verify the efficiency of SmarCo by using simulators, large-scale FPGA and prototype with TSMC 40-nm technology node. The experimental results show that, compared to Intel Xeon E7-8890V4, SmarCo achieves 10.11X performance improvement and 6.95X energy-efficiency improvement with higher throughput and a better guarantee of real-time response.

快速增长的高吞吐量应用程序(如web服务)具有高并发处理、硬实时响应和高带宽内存访问的特点。新生的应用程序给数据中心的处理器带来了严峻的挑战，无论是并发处理性能还是能源效率。为了提供令人满意的服务质量，在未来的数据中心中以更有效的方式满足这些新出现的高吞吐量应用需求至关重要。在本文中，我们提出了一种新的架构，称为SmarCo，它允许在数据中心中更有效地处理高吞吐量应用程序。基于高吞吐量应用的主要特点，我们实现了大规模的多核架构和成对线程，以支持高并发处理;我们还引入了分层环拓扑和松弛感知任务调度器来保证硬实时响应;此外，我们提出了高吞吐量的数据路径，以提高内存访问效率。通过仿真器、大规模FPGA和台积电40nm工艺节点的原型验证了SmarCo的效率。实验结果表明，与Intel至强E7-8890V4相比，SmarCo的性能提高了10.11倍，能效提高了6.95倍，吞吐量更高，实时响应得到了更好的保证。

{"title":"SmarCo: An Efficient Many-Core Processor for High-Throughput Applications in Datacenters","authors":"Dongrui Fan, Wenming Li, Xiaochun Ye, Da Wang, Hao Zhang, Zhimin Tang, Ninghui Sun","doi":"10.1109/HPCA.2018.00057","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00057","url":null,"abstract":"Fast-growing high-throughput applications, such as web services, are characterized by high-concurrency processing, hard real-time response, and high-bandwidth memory access. The newly-born applications bring severe challenges to processors in datacenters, both in concurrent processing performance and energy efficiency. To offer a satisfactory quality of services, it is of critical importance to meet these newly emerging demands of high-throughput applications in the future datacenters in a more efficient way. In this paper, we propose a novel architecture, called SmarCo, which allows high-throughput applications to be processed more efficiently in datacenters. Based on the dominant characteristics of high-throughput applications, we implement large-scale many-core architecture with in-pair threads to support high-concurrency processing; we also introduce a hierarchical ring topology and laxity-aware task scheduler to guarantee hard real-time response; furthermore, we propose high-throughput datapath to improve memory access efficiency. We verify the efficiency of SmarCo by using simulators, large-scale FPGA and prototype with TSMC 40-nm technology node. The experimental results show that, compared to Intel Xeon E7-8890V4, SmarCo achieves 10.11X performance improvement and 6.95X energy-efficiency improvement with higher throughput and a better guarantee of real-time response.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121536661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM 通过字级压缩实现PCM的细粒度受限Coset编码

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2017-11-23 DOI: 10.1109/HPCA.2018.00038

Seyed Mohammad Seyedzadeh, A. Jones, R. Melhem

Phase change memory (PCM) has recently emerged as a promising technology to meet the fast growing demand for large capacity memory in computer systems, replacing DRAM that is impeded by physical limitations. Multi-level cell (MLC) PCM offers high density with low per-byte fabrication cost. However, despite many advantages, such as scalability and low leakage, the energy for programming intermediate states is considerably larger than programing single-level cell PCM. In this paper, we study encoding techniques to reduce write energy for MLC PCM when the encoding granularity is lowered below the typical cache line size. We observe that encoding data blocks at small granularity to reduce write energy actually increases the write energy because of the auxiliary encoding bits. We mitigate this adverse effect by 1) designing suitable codeword mappings that use fewer auxiliary bits and 2) proposing a new Word-Level Compression (WLC) which compresses more than 91% of the memory lines and provides enough room to store the auxiliary data using a novel restricted coset encoding applied at small data block granularities. Experimental results show that the proposed encoding at 16-bit data granularity reduces the write energy by 39%, on average, versus the leading encoding approach for write energy reduction. Furthermore, it improves endurance by 20% and is more reliable than the leading approach. Hardware synthesis evaluation shows that the proposed encoding can be implemented on-chip with only a nominal area overhead.

相变存储器(PCM)最近作为一种有前途的技术出现，以满足计算机系统对大容量存储器快速增长的需求，取代了受物理限制阻碍的DRAM。多层单元(MLC) PCM提供高密度和低每字节制造成本。然而，尽管有许多优点，如可扩展性和低泄漏，编程中间状态的能量比编程单电平单元PCM要大得多。在本文中，我们研究了当编码粒度低于典型的缓存线大小时，如何减少MLC PCM的写入能量。我们观察到，以小粒度编码数据块以减少写入能量实际上增加了写入能量，因为辅助编码位。我们通过以下方法减轻了这种不利影响:1)设计合适的码字映射，使用更少的辅助位;2)提出一种新的字级压缩(WLC)，它压缩了91%以上的存储行，并提供了足够的空间来存储辅助数据，使用了一种新的限制性协集编码，应用于小数据块粒度。实验结果表明，在16位数据粒度下，与现有的编码方法相比，所提出的编码方法平均减少了39%的写能量。此外，它的续航能力提高了20%，比现有的方法更可靠。硬件综合评估表明，所提出的编码可以在片上实现，只有一个名义上的面积开销。

{"title":"Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM","authors":"Seyed Mohammad Seyedzadeh, A. Jones, R. Melhem","doi":"10.1109/HPCA.2018.00038","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00038","url":null,"abstract":"Phase change memory (PCM) has recently emerged as a promising technology to meet the fast growing demand for large capacity memory in computer systems, replacing DRAM that is impeded by physical limitations. Multi-level cell (MLC) PCM offers high density with low per-byte fabrication cost. However, despite many advantages, such as scalability and low leakage, the energy for programming intermediate states is considerably larger than programing single-level cell PCM. In this paper, we study encoding techniques to reduce write energy for MLC PCM when the encoding granularity is lowered below the typical cache line size. We observe that encoding data blocks at small granularity to reduce write energy actually increases the write energy because of the auxiliary encoding bits. We mitigate this adverse effect by 1) designing suitable codeword mappings that use fewer auxiliary bits and 2) proposing a new Word-Level Compression (WLC) which compresses more than 91% of the memory lines and provides enough room to store the auxiliary data using a novel restricted coset encoding applied at small data block granularities. Experimental results show that the proposed encoding at 16-bit data granularity reduces the write energy by 39%, on average, versus the leading encoding approach for write energy reduction. Furthermore, it improves endurance by 20% and is more reliable than the leading approach. Hardware synthesis evaluation shows that the proposed encoding can be implemented on-chip with only a nominal area overhead.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131633662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

GraphR: Accelerating Graph Processing Using ReRAM GraphR:使用ReRAM加速图形处理

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2017-08-21 DOI: 10.1109/HPCA.2018.00052

Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen

Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01× (up to 132.67×) speedup and a 33.82× energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69× to 2.19× speedup and consumes 4.77× to 8.91× less energy. GRAPHR gains a speedup of 1.16× to 4.12×, and is 3.67× to 10.96× more energy efficiency compared to PIM-based architecture.

鉴于理解关系的广泛需求，图处理最近受到了广泛的关注。它以局部性差和内存带宽要求高而闻名。在传统的架构中，它们会产生大量的数据移动和能源消耗，这促使了几个硬件图形处理加速器。当前的图形处理加速器依赖于内存访问优化或将计算逻辑放置在内存附近。与所有现有的方法不同，我们利用新兴的内存技术来加速模拟计算的图形处理。本文提出了GRAPHR——第一个基于rerram的图形处理加速器。GRAPHR遵循近数据处理的原则，并探索了以低硬件和能源成本执行大规模并行模拟操作的机会。模拟计算适合于图形处理，因为:1)算法是迭代的，可以固有地容忍不精确;2)概率计算(例如，PageRank和协同过滤)和典型的涉及整数的图算法(例如，BFS/SSSP)对错误都有弹性。GRAPHR的关键思想是，如果图算法的顶点程序可以用稀疏矩阵向量乘法(SpMV)表示，则可以用ReRAM交叉条有效地执行。我们证明了这个假设对于大量的图算法来说通常是正确的。GRAPHR是一种新型的加速器架构，由内存ReRAM和图形引擎(GE)两部分组成。核心图计算在GEs (ReRAM交叉条)中以稀疏矩阵格式进行。基于向量/矩阵的图计算并不新鲜，但ReRAM提供了独特的机会，以前所未有的能源效率和低硬件成本实现大规模并行。对于由GEs处理的小子图，执行并行操作的收益掩盖了由于稀疏性造成的浪费。实验结果表明，与CPU基准系统相比，GRAPHR实现了16.01倍(最高132.67倍)的加速和33.82倍的几何平均节能。与GPU相比，GRAPHR的速度提升了1.69 ~ 2.19倍，能耗降低了4.77 ~ 8.91倍。与基于pim的架构相比，GRAPHR的速度提高了1.16到4.12倍，能效提高了3.67到10.96倍。

{"title":"GraphR: Accelerating Graph Processing Using ReRAM","authors":"Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen","doi":"10.1109/HPCA.2018.00052","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00052","url":null,"abstract":"Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01× (up to 132.67×) speedup and a 33.82× energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69× to 2.19× speedup and consumes 4.77× to 8.91× less energy. GRAPHR gains a speedup of 1.16× to 4.12×, and is 3.67× to 10.96× more energy efficiency compared to PIM-based architecture.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134434507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 202