首页 > 最新文献

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)最新文献

英文 中文
RelaxFault Memory Repair 松弛故障记忆修复
Dong-Wan Kim, M. Erez
Memory system reliability is a serious concern in many systems today, and is becoming more worrisome as technology scales and system size grows. Stronger fault tolerance capability is therefore desirable, but often comes at high cost. In this paper, we propose a low-cost, fault-aware, hardware-only resilience mechanism, RelaxFault, that repairs the vast majority of memory faults using a small amount of the LLC to remap faulty memory locations. RelaxFault requires less than 100KiB of LLC capacity, has near-zero impact on performance and power. By repairing faults, RelaxFault relaxes the requirement for high fault tolerance of other mechanisms, such as ECC. A better tradeoff between resilience and overhead is made by exploiting an understanding of memory system architecture and fault characteristics. We show that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements. We also propose a more refined memory fault model than prior work and demonstrate its importance.
在当今的许多系统中,内存系统的可靠性是一个严重的问题,随着技术的扩展和系统规模的增长,它变得越来越令人担忧。因此,需要更强的容错能力,但代价往往很高。在本文中,我们提出了一种低成本,故障感知,仅硬件的弹性机制,松弛故障,修复绝大多数的内存故障使用少量的LLC来重新映射故障的内存位置。RelaxFault只需要小于100KiB的LLC容量,对性能和功率几乎没有影响。通过修复故障,放松了其他机制(如ECC)对高容错性的要求。通过理解内存系统架构和故障特征,可以更好地在弹性和开销之间进行权衡。我们表明,RelaxFault提供了比之前同等成本的工作更好的修复能力,在更大程度上提高了内存可靠性,并显著减少了维护事件和内存模块更换的数量。我们还提出了一个比以前的工作更精细的记忆故障模型,并证明了它的重要性。
{"title":"RelaxFault Memory Repair","authors":"Dong-Wan Kim, M. Erez","doi":"10.1145/3007787.3001205","DOIUrl":"https://doi.org/10.1145/3007787.3001205","url":null,"abstract":"Memory system reliability is a serious concern in many systems today, and is becoming more worrisome as technology scales and system size grows. Stronger fault tolerance capability is therefore desirable, but often comes at high cost. In this paper, we propose a low-cost, fault-aware, hardware-only resilience mechanism, RelaxFault, that repairs the vast majority of memory faults using a small amount of the LLC to remap faulty memory locations. RelaxFault requires less than 100KiB of LLC capacity, has near-zero impact on performance and power. By repairing faults, RelaxFault relaxes the requirement for high fault tolerance of other mechanisms, such as ECC. A better tradeoff between resilience and overhead is made by exploiting an understanding of memory system architecture and fault characteristics. We show that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements. We also propose a more refined memory fault model than prior work and demonstrate its importance.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"88 1","pages":"645-657"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72922058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading 有效地缩放乱序核同步多线程
Faissal M. Sleiman, T. Wenisch
Simultaneous multithreading (SMT) out-of-order cores waste a significant portion of structural out-of-order core resources on instructions that do not need them. These resources eliminate false ordering dependences. However, because thread interleaving spreads dependent instructions, nearly half of instructions dynamically issue in program order after all false dependences have resolved. These in-sequence instructions interleave with other reordered instructions at a fine granularity within the instruction window. We develop a technique to efficiently scale in-flight instructions through a hybrid out-of-order/in-order microarchitecture, which can dispatch instructions to efficient in-order scheduling mechanisms -- using a FIFO issue queue called the shelf -- on an instruction-by-instruction basis. Instructions dispatched to the shelf do not allocate out-of-order core resources in the reorder buffer, issue queue, physical registers, or load-store queues. We measure opportunity for such hybrid microarchitectures and design and evaluate a practical dispatch mechanism targeted at 4-threaded cores. Adding a shelf to a baseline 4-thread system with 64- entry ROB improves normalized system throughput by 11.5% (up to 19.2% at best) and energy-delay product by 10.9% (up to 17.5% at best).
同步多线程(SMT)乱序核在不需要它们的指令上浪费了很大一部分结构乱序核资源。这些资源消除了错误的排序依赖。然而,由于线程交错传播依赖指令,在所有错误的依赖都解决后,几乎有一半的指令以程序顺序动态发出。这些顺序指令在指令窗口内以细粒度与其他重新排序的指令交错。我们开发了一种通过无序/有序混合微架构有效扩展飞行指令的技术,该技术可以将指令分配到有效的有序调度机制中-使用称为shelf的FIFO问题队列-基于一条指令一条指令。分配到货架的指令不会在重新排序缓冲区、发布队列、物理寄存器或负载存储队列中分配无序的核心资源。我们衡量了这种混合微架构的机会,并设计和评估了针对4线程内核的实用调度机制。在具有64条目ROB的基线4线程系统中添加一个架子,将规范化系统吞吐量提高11.5%(最多提高19.2%),将能量延迟产品提高10.9%(最多提高17.5%)。
{"title":"Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading","authors":"Faissal M. Sleiman, T. Wenisch","doi":"10.1145/3007787.3001183","DOIUrl":"https://doi.org/10.1145/3007787.3001183","url":null,"abstract":"Simultaneous multithreading (SMT) out-of-order cores waste a significant portion of structural out-of-order core resources on instructions that do not need them. These resources eliminate false ordering dependences. However, because thread interleaving spreads dependent instructions, nearly half of instructions dynamically issue in program order after all false dependences have resolved. These in-sequence instructions interleave with other reordered instructions at a fine granularity within the instruction window. We develop a technique to efficiently scale in-flight instructions through a hybrid out-of-order/in-order microarchitecture, which can dispatch instructions to efficient in-order scheduling mechanisms -- using a FIFO issue queue called the shelf -- on an instruction-by-instruction basis. Instructions dispatched to the shelf do not allocate out-of-order core resources in the reorder buffer, issue queue, physical registers, or load-store queues. We measure opportunity for such hybrid microarchitectures and design and evaluate a practical dispatch mechanism targeted at 4-threaded cores. Adding a shelf to a baseline 4-thread system with 64- entry ROB improves normalized system throughput by 11.5% (up to 19.2% at best) and energy-delay product by 10.9% (up to 17.5% at best).","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"356 1","pages":"431-443"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77324027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators Minerva:实现低功耗,高精度深度神经网络加速器
Brandon Reagen, P. Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, D. Brooks
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.
深度神经网络(dnn)在分类任务中的持续成功引发了使用专用硬件加速其执行的趋势。虽然已发布的设计很容易比通用硬件有一个数量级的改进,但很少有人能超越最初的实现。本文介绍了Minerva,一种高度自动化的跨算法、架构和电路级别的协同设计方法,用于优化深度神经网络硬件加速器。与已建立的定点加速器基线相比,我们发现细粒度、异构数据类型优化可将功耗降低1.5倍;积极的、内联的预测和对小活度值的修剪进一步降低了2.0倍的功率;主动硬件故障检测加上域感知错误缓解通过降低SRAM电压消除了额外的2.7倍。在五个数据集上,这些优化在不影响DNN模型精度的情况下,在加速器基线上平均减少了8.1倍的功率。Minerva实现了高精度、超低功耗的DNN加速器(在几十毫瓦的范围内),使得在功率受限的物联网和移动设备中部署DNN成为可能。
{"title":"Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators","authors":"Brandon Reagen, P. Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, D. Brooks","doi":"10.1145/3007787.3001165","DOIUrl":"https://doi.org/10.1145/3007787.3001165","url":null,"abstract":"The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"9 1","pages":"267-278"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85172639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 537
MITTS: Memory Inter-arrival Time Traffic Shaping 内存到达时间流量整形
Yanqi Zhou, D. Wentzlaff
Memory bandwidth severely limits the scalability and performance of multicore and manycore systems. Application performance can be very sensitive to both the delivered memory bandwidth and latency. In multicore systems, a memory channel is usually shared by multiple cores. Having the ability to precisely provision, schedule, and isolate memory bandwidth and latency on a per-core basis is particularly important when different memory guarantees are needed on a per-customer, per-application, or per-core basis. Infrastructure as a Service (IaaS) Cloud systems, and even general purpose multicores optimized for application throughput or fairness all benefit from the ability to control and schedule memory access on a fine-grain basis. In this paper, we propose MITTS (Memory Inter-arrival Time Traffic Shaping), a simple, distributed hardware mechanism which limits memory traffic at the source (Core or LLC). MITTS shapes memory traffic based on memory request inter-arrival time, enabling fine-grain bandwidth allocation. In an IaaS system, MITTS enables Cloud customers to express their memory distribution needs and pay commensurately. For instance, MITTS enables charging customers that have bursty memory traffic more than customers with uniform memory traffic for the same aggregate bandwidth. Beyond IaaS systems, MITTS can also be used to optimize for throughput or fairness in a general purpose multi-program workload. MITTS uses an online genetic algorithm to configure hardware bins, which can adapt for program phases and variable input sets. We have implemented MITTS in Verilog and have taped-out the design in a 25-core 32nm processor and find that MITTS requires less than 0.9% of core area. We evaluate across SPECint, PARSEC, Apache, and bhm Mail Server workloads, and find that MITTS achieves an average 1.18× performance gain compared to the best static bandwidth allocation, a 2.69× average performance/cost advantage in an IaaS setting, and up to 1.17x better throughput and 1.52× better fairness when compared to conventional memory bandwidth provisioning techniques.
内存带宽严重限制了多核和多核系统的可扩展性和性能。应用程序性能对交付的内存带宽和延迟都非常敏感。在多核系统中,一个内存通道通常由多个核共享。当在每个客户、每个应用程序或每个核心的基础上需要不同的内存保证时,在每个核心的基础上精确地配置、调度和隔离内存带宽和延迟的能力尤为重要。基础设施即服务(IaaS)云系统,甚至是针对应用程序吞吐量或公平性进行优化的通用多核,都受益于在细粒度基础上控制和调度内存访问的能力。在本文中,我们提出了MITTS(内存到达时间流量整形),这是一种简单的分布式硬件机制,可以限制源(核心或LLC)的内存流量。MITTS基于内存请求间到达时间形成内存流量,支持细粒度带宽分配。在IaaS系统中,MITTS使云客户能够表达他们的内存分布需求并相应地支付费用。例如,对于相同的聚合带宽,MITTS可以向具有突发内存流量的客户收取比具有统一内存流量的客户更多的费用。除了IaaS系统之外,MITTS还可用于优化通用多程序工作负载中的吞吐量或公平性。MITTS采用在线遗传算法配置硬件箱,可以适应不同的程序阶段和不同的输入集。我们已经在Verilog中实现了MITTS,并在25核32nm处理器上完成了设计,发现MITTS只需要不到0.9%的核心面积。我们对SPECint、PARSEC、Apache和hm Mail Server工作负载进行了评估,发现与最佳静态带宽分配相比,MITTS实现了平均1.18倍的性能增益,在IaaS设置中实现了2.69倍的平均性能/成本优势,与传统内存带宽配置技术相比,吞吐量提高了1.17倍,公平性提高了1.52倍。
{"title":"MITTS: Memory Inter-arrival Time Traffic Shaping","authors":"Yanqi Zhou, D. Wentzlaff","doi":"10.1145/3007787.3001193","DOIUrl":"https://doi.org/10.1145/3007787.3001193","url":null,"abstract":"Memory bandwidth severely limits the scalability and performance of multicore and manycore systems. Application performance can be very sensitive to both the delivered memory bandwidth and latency. In multicore systems, a memory channel is usually shared by multiple cores. Having the ability to precisely provision, schedule, and isolate memory bandwidth and latency on a per-core basis is particularly important when different memory guarantees are needed on a per-customer, per-application, or per-core basis. Infrastructure as a Service (IaaS) Cloud systems, and even general purpose multicores optimized for application throughput or fairness all benefit from the ability to control and schedule memory access on a fine-grain basis. In this paper, we propose MITTS (Memory Inter-arrival Time Traffic Shaping), a simple, distributed hardware mechanism which limits memory traffic at the source (Core or LLC). MITTS shapes memory traffic based on memory request inter-arrival time, enabling fine-grain bandwidth allocation. In an IaaS system, MITTS enables Cloud customers to express their memory distribution needs and pay commensurately. For instance, MITTS enables charging customers that have bursty memory traffic more than customers with uniform memory traffic for the same aggregate bandwidth. Beyond IaaS systems, MITTS can also be used to optimize for throughput or fairness in a general purpose multi-program workload. MITTS uses an online genetic algorithm to configure hardware bins, which can adapt for program phases and variable input sets. We have implemented MITTS in Verilog and have taped-out the design in a 25-core 32nm processor and find that MITTS requires less than 0.9% of core area. We evaluate across SPECint, PARSEC, Apache, and bhm Mail Server workloads, and find that MITTS achieves an average 1.18× performance gain compared to the best static bandwidth allocation, a 2.69× average performance/cost advantage in an IaaS setting, and up to 1.17x better throughput and 1.52× better fairness when compared to conventional memory bandwidth provisioning techniques.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"35 1","pages":"532-544"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85785624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Accelerating Markov Random Field Inference Using Molecular Optical Gibbs Sampling Units 利用分子光学吉布斯采样单元加速马尔可夫随机场推理
Siyang Wang, X. Zhang, Yuxuan Li, Ramin Bashizade, Songze Yang, C. Dwyer, A. Lebeck
The increasing use of probabilistic algorithms from statistics and machine learning for data analytics presents new challenges and opportunities for the design of computing systems. One important class of probabilistic machine learning algorithms is Markov Chain Monte Carlo (MCMC) sampling, which can be used on a wide variety of applications in Bayesian Inference. However, this probabilistic iterative algorithm can be inefficient in practice on today's processors, especially for problems with high dimensionality and complex structure. The source of inefficiency is generating samples from parameterized probability distributions. This paper seeks to address this sampling inefficiency and presents a new approach to support probabilistic computing that leverages the native randomness of Resonance Energy Transfer (RET) networks to construct RET-based sampling units (RSU). Although RSUs can be designed for a variety of applications, we focus on the specific class of probabilistic problems described as Markov Random Field Inference. Our proposed RSU uses a RET network to implement a molecular-scale optical Gibbs sampling unit (RSU-G) that can be integrated into a processor / GPU as specialized functional units or organized as a discrete accelerator. We experimentally demonstrate the fundamental operation of an RSU using a macro-scale hardware prototype. Emulation-based evaluation of two computer vision applications for HD images reveal that an RSU augmented GPU provides speedups over a GPU of 3 and 16. Analytic evaluation shows a discrete accelerator that is limited by 336 GB/s DRAM produces speedups of 21 and 54 versus the GPU implementations.
越来越多地使用统计学和机器学习中的概率算法进行数据分析,为计算系统的设计带来了新的挑战和机遇。一类重要的概率机器学习算法是马尔可夫链蒙特卡罗(MCMC)采样,它可以用于贝叶斯推理中的各种应用。然而,这种概率迭代算法在目前的处理器上效率不高,特别是对于高维和复杂结构的问题。效率低下的根源在于从参数化的概率分布中生成样本。本文旨在解决这种采样效率低下的问题,并提出了一种支持概率计算的新方法,该方法利用共振能量转移(RET)网络的固有随机性来构建基于RET的采样单元(RSU)。尽管rsu可以设计用于各种应用,但我们主要关注被描述为马尔可夫随机场推理的特定类别的概率问题。我们提出的RSU使用RET网络实现分子尺度光学吉布斯采样单元(RSU- g),该单元可以作为专用功能单元集成到处理器/ GPU中,也可以作为离散加速器组织。我们用一个宏观尺度的硬件原型实验演示了RSU的基本操作。基于仿真的两种HD图像计算机视觉应用评估表明,RSU增强GPU比GPU提供3和16的加速。分析评估表明,与GPU实现相比,受336 GB/s DRAM限制的离散加速器的速度提高了21和54。
{"title":"Accelerating Markov Random Field Inference Using Molecular Optical Gibbs Sampling Units","authors":"Siyang Wang, X. Zhang, Yuxuan Li, Ramin Bashizade, Songze Yang, C. Dwyer, A. Lebeck","doi":"10.1145/3007787.3001196","DOIUrl":"https://doi.org/10.1145/3007787.3001196","url":null,"abstract":"The increasing use of probabilistic algorithms from statistics and machine learning for data analytics presents new challenges and opportunities for the design of computing systems. One important class of probabilistic machine learning algorithms is Markov Chain Monte Carlo (MCMC) sampling, which can be used on a wide variety of applications in Bayesian Inference. However, this probabilistic iterative algorithm can be inefficient in practice on today's processors, especially for problems with high dimensionality and complex structure. The source of inefficiency is generating samples from parameterized probability distributions. This paper seeks to address this sampling inefficiency and presents a new approach to support probabilistic computing that leverages the native randomness of Resonance Energy Transfer (RET) networks to construct RET-based sampling units (RSU). Although RSUs can be designed for a variety of applications, we focus on the specific class of probabilistic problems described as Markov Random Field Inference. Our proposed RSU uses a RET network to implement a molecular-scale optical Gibbs sampling unit (RSU-G) that can be integrated into a processor / GPU as specialized functional units or organized as a discrete accelerator. We experimentally demonstrate the fundamental operation of an RSU using a macro-scale hardware prototype. Emulation-based evaluation of two computer vision applications for HD images reveal that an RSU augmented GPU provides speedups over a GPU of 3 and 16. Analytic evaluation shows a discrete accelerator that is limited by 336 GB/s DRAM produces speedups of 21 and 54 versus the GPU implementations.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"22 1","pages":"558-569"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87405112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Decoupling Loads for Nano-Instruction Set Computers 纳米指令集计算机的解耦负载
Ziqiang Huang, Andrew D. Hilton, Benjamin C. Lee
We propose an ISA extension that decouples the data access and register write operations in a load instruction. We describe system and hardware support for decoupled loads. Furthermore, we show how compilers can generate better static instruction schedules by hoisting a decoupled load's data access above may-alias stores and branches. We find that decoupled loads improve performance with geometric mean speedups of 8.4%.
我们提出了一个ISA扩展,该扩展将加载指令中的数据访问和寄存器写操作解耦。我们描述了对解耦负载的系统和硬件支持。此外,我们还展示了编译器如何通过将解耦负载的数据访问提升到多别名存储和分支之上来生成更好的静态指令调度。我们发现解耦负载提高了8.4%的几何平均加速性能。
{"title":"Decoupling Loads for Nano-Instruction Set Computers","authors":"Ziqiang Huang, Andrew D. Hilton, Benjamin C. Lee","doi":"10.1145/3007787.3001181","DOIUrl":"https://doi.org/10.1145/3007787.3001181","url":null,"abstract":"We propose an ISA extension that decouples the data access and register write operations in a load instruction. We describe system and hardware support for decoupled loads. Furthermore, we show how compilers can generate better static instruction schedules by hoisting a decoupled load's data access above may-alias stores and branches. We find that decoupled loads improve performance with geometric mean speedups of 8.4%.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"41 1","pages":"406-417"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88006446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs 在gpu上实现动态并行的局部性感知调度器
Jin Wang, Norman Rubin, A. Sidelnik, S. Yalamanchili
Recent developments in GPU execution models and architectures have introduced dynamic parallelism to facilitate the execution of irregular applications where control flow and memory behavior can be unstructured, time-varying, and hierarchical. The changes brought about by this extension to the traditional bulk synchronous parallel (BSP) model also creates new challenges in exploiting the current GPU memory hierarchy. One of the major challenges is that the reference locality that exists between the parent and child thread blocks (TBs) created during dynamic nested kernel and thread block launches cannot be fully leveraged using the current TB scheduling strategies. These strategies were designed for the current implementations of the BSP model but fall short when dynamic parallelism is introduced since they are oblivious to the hierarchical reference locality. We propose LaPerm, a new locality-aware TB scheduler that exploits such parent-child locality, both spatial and temporal. LaPerm adopts three different scheduling decisions to i) prioritize the execution of the child TBs, ii) bind them to the stream multiprocessors (SMXs) occupied by their parents TBs, and iii) maintain workload balance across compute units. Experiments with a set of irregular CUDA applications executed on a cycle-level simulator employing dynamic parallelism demonstrate that LaPerm is able to achieve an average of 27% performance improvement over the baseline round-robin TB scheduler commonly used in modern GPUs.
GPU执行模型和架构的最新发展引入了动态并行性,以促进不规则应用程序的执行,其中控制流和内存行为可以是非结构化的、时变的和分层的。这种扩展给传统的批量同步并行(BSP)模型带来的变化也为利用当前GPU内存层次结构带来了新的挑战。一个主要的挑战是,在动态嵌套内核和线程块启动期间创建的父线程块和子线程块(TB)之间存在的引用局部性不能使用当前的TB调度策略充分利用。这些策略是为当前BSP模型的实现而设计的,但当引入动态并行性时,它们就失效了,因为它们对分层引用局部性无关。我们提出LaPerm,一个新的位置感知结核调度程序,利用这种亲子位置,空间和时间。LaPerm采用三种不同的调度决策来i)优先执行子tb, ii)将它们绑定到父tb占用的流多处理器(smx),以及iii)维护计算单元之间的工作负载平衡。在采用动态并行性的周期级模拟器上执行的一组不规则CUDA应用程序的实验表明,与现代gpu中常用的基准轮循TB调度器相比,LaPerm能够实现平均27%的性能提高。
{"title":"LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs","authors":"Jin Wang, Norman Rubin, A. Sidelnik, S. Yalamanchili","doi":"10.1145/3007787.3001199","DOIUrl":"https://doi.org/10.1145/3007787.3001199","url":null,"abstract":"Recent developments in GPU execution models and architectures have introduced dynamic parallelism to facilitate the execution of irregular applications where control flow and memory behavior can be unstructured, time-varying, and hierarchical. The changes brought about by this extension to the traditional bulk synchronous parallel (BSP) model also creates new challenges in exploiting the current GPU memory hierarchy. One of the major challenges is that the reference locality that exists between the parent and child thread blocks (TBs) created during dynamic nested kernel and thread block launches cannot be fully leveraged using the current TB scheduling strategies. These strategies were designed for the current implementations of the BSP model but fall short when dynamic parallelism is introduced since they are oblivious to the hierarchical reference locality. We propose LaPerm, a new locality-aware TB scheduler that exploits such parent-child locality, both spatial and temporal. LaPerm adopts three different scheduling decisions to i) prioritize the execution of the child TBs, ii) bind them to the stream multiprocessors (SMXs) occupied by their parents TBs, and iii) maintain workload balance across compute units. Experiments with a set of irregular CUDA applications executed on a cycle-level simulator employing dynamic parallelism demonstrate that LaPerm is able to achieve an average of 27% performance improvement over the baseline round-robin TB scheduler commonly used in modern GPUs.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"32 1","pages":"583-595"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80989535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Base-Victim Compression: An Opportunistic Cache Compression Architecture 基础受害者压缩:一个机会缓存压缩架构
Jayesh Gaur, Alaa R. Alameldeen, S. Subramoney
The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity in cache compression implementations could increase cache power and access latency. On the other hand, advanced cache replacement mechanisms use heuristics to reduce misses, leading to significant performance gains. Both cache compression and replacement policies should collaborate to improve performance. In this paper, we demonstrate that cache compression and replacement policies can interact negatively. In many workloads, performance gains from replacement policies are lost due to the need to alter the replacement policy to accommodate compression. This leads to sub-optimal replacement policies that could lose performance compared to an uncompressed cache. We introduce a novel, opportunistic cache compression mechanism, Base-Victim, based on an efficient cache design. Our compression architecture improves performance on top of advanced cache replacement policies, and guarantees a hit rate at least as high as that of an uncompressed cache. For cache-sensitive applications, Base-Victim achieves an average 7.3% performance gain for single-threaded workloads, and 8.7% gain for four-thread multi-program workload mixes.
内存墙激发了对缓存管理策略的许多增强,旨在减少失误。缓存压缩被提出用于增加有效的缓存容量,从而潜在地减少容量和冲突缺失。然而,缓存压缩实现的复杂性可能会增加缓存能力和访问延迟。另一方面,高级缓存替换机制使用启发式方法来减少失误,从而显著提高性能。缓存压缩和替换策略应该协作以提高性能。在本文中,我们证明了缓存压缩和替换策略可以负交互。在许多工作负载中,由于需要更改替换策略以适应压缩,从替换策略中获得的性能收益会丢失。这将导致次优替换策略,与未压缩缓存相比,可能会损失性能。我们介绍了一种基于高效缓存设计的新型机会缓存压缩机制——Base-Victim。我们的压缩架构在高级缓存替换策略的基础上提高了性能,并保证命中率至少与未压缩缓存一样高。对于缓存敏感的应用程序,Base-Victim在单线程工作负载下的平均性能提升为7.3%,在四线程多程序工作负载混合下的平均性能提升为8.7%。
{"title":"Base-Victim Compression: An Opportunistic Cache Compression Architecture","authors":"Jayesh Gaur, Alaa R. Alameldeen, S. Subramoney","doi":"10.1145/3007787.3001171","DOIUrl":"https://doi.org/10.1145/3007787.3001171","url":null,"abstract":"The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity in cache compression implementations could increase cache power and access latency. On the other hand, advanced cache replacement mechanisms use heuristics to reduce misses, leading to significant performance gains. Both cache compression and replacement policies should collaborate to improve performance. In this paper, we demonstrate that cache compression and replacement policies can interact negatively. In many workloads, performance gains from replacement policies are lost due to the need to alter the replacement policy to accommodate compression. This leads to sub-optimal replacement policies that could lose performance compared to an uncompressed cache. We introduce a novel, opportunistic cache compression mechanism, Base-Victim, based on an efficient cache design. Our compression architecture improves performance on top of advanced cache replacement policies, and guarantees a hit rate at least as high as that of an uncompressed cache. For cache-sensitive applications, Base-Victim achieves an average 7.3% performance gain for single-threaded workloads, and 8.7% gain for four-thread multi-program workload mixes.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"24 1","pages":"317-328"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84296037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric 草案:基于低功耗dram的可重构加速结构
Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, C. Kozyrakis
FPGAs are a popular target for application-specific accelerators because they lead to a good balance between flexibility and energy efficiency. However, FPGA lookup tables introduce significant area and power overheads, making it difficult to use FPGA devices in environments with tight cost and power constraints. This is the case for datacenter servers, where a modestly-sized FPGA cannot accommodate the large number of diverse accelerators that datacenter applications need. This paper introduces DRAF, an architecture for bit-level reconfigurable logic that uses DRAM subarrays to implement dense lookup tables. DRAF overlaps DRAM operations like bitline precharge and charge restoration with routing within the reconfigurable routing fabric to minimize the impact of DRAM latency. It also supports multiple configuration contexts that can be used to quickly switch between different accelerators with minimal latency. Overall, DRAF trades off some of the performance of FPGAs for significant gains in area and power. DRAF improves area density by 10x over FPGAs and power consumption by more than 3x, enabling DRAF to satisfy demanding applications within strict power and cost constraints. While accelerators mapped to DRAF are 2-3x slower than those in FPGAs, they still deliver a 13x speedup and an 11x reduction in power consumption over a Xeon core for a wide range of datacenter tasks, including analytics and interactive services like speech recognition.
fpga是特定应用加速器的热门目标,因为它们在灵活性和能源效率之间取得了良好的平衡。然而,FPGA查找表引入了大量的面积和功耗开销,使得在成本和功耗限制严格的环境中使用FPGA器件变得困难。这就是数据中心服务器的情况,一个中等大小的FPGA无法容纳数据中心应用程序所需的大量不同的加速器。本文介绍了位级可重构逻辑的架构draft,该架构使用DRAM子阵列实现密集查找表。DRAF将位线预充和充电恢复等DRAM操作与可重构路由结构中的路由重叠,以最大限度地减少DRAM延迟的影响。它还支持多个配置上下文,可用于以最小的延迟在不同的加速器之间快速切换。总的来说,draft牺牲了fpga的一些性能,在面积和功率方面获得了显著的收益。与fpga相比,draft的面积密度提高了10倍,功耗降低了3倍以上,使draft能够在严格的功率和成本限制下满足苛刻的应用。虽然映射到draft的加速器比fpga慢2-3倍,但它们仍然比至强核心提供13倍的加速和11倍的功耗降低,用于广泛的数据中心任务,包括分析和语音识别等交互服务。
{"title":"DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric","authors":"Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, C. Kozyrakis","doi":"10.1145/3007787.3001191","DOIUrl":"https://doi.org/10.1145/3007787.3001191","url":null,"abstract":"FPGAs are a popular target for application-specific accelerators because they lead to a good balance between flexibility and energy efficiency. However, FPGA lookup tables introduce significant area and power overheads, making it difficult to use FPGA devices in environments with tight cost and power constraints. This is the case for datacenter servers, where a modestly-sized FPGA cannot accommodate the large number of diverse accelerators that datacenter applications need. This paper introduces DRAF, an architecture for bit-level reconfigurable logic that uses DRAM subarrays to implement dense lookup tables. DRAF overlaps DRAM operations like bitline precharge and charge restoration with routing within the reconfigurable routing fabric to minimize the impact of DRAM latency. It also supports multiple configuration contexts that can be used to quickly switch between different accelerators with minimal latency. Overall, DRAF trades off some of the performance of FPGAs for significant gains in area and power. DRAF improves area density by 10x over FPGAs and power consumption by more than 3x, enabling DRAF to satisfy demanding applications within strict power and cost constraints. While accelerators mapped to DRAF are 2-3x slower than those in FPGAs, they still deliver a 13x speedup and an 11x reduction in power consumption over a Xeon core for a wide range of datacenter tasks, including analytics and interactive services like speech recognition.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"28 1","pages":"506-518"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75548208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation 通过错误模式转换挽救片上存储器中不可纠正的故障模式
Henry Duwe, Xun Jian, Daniel Petrisko, Rakesh Kumar
Voltage scaling can effectively reduce processor power, but also reduces the reliability of the SRAM cells in on-chip memories. Therefore, it is often accompanied by the use of an error correcting code (ECC). To enable reliable and efficient memory operation at low voltages, ECCs for on-chip memories must provide both high error coverage and low correction latency. In this paper, we propose error pattern transformation, a novel low-latency error correction technique that allows on-chip memories to be scaled to voltages lower than what has been previously possible. Our technique relies on the observation that the number of on-chip memory errors that many ECCs can correct differs widely depending on the error patterns in the logical words they protect. We propose adaptively rearranging the logical bit to physical bit mapping per word according to the BIST-detectable fault pattern in the physical word. The adaptive logical bit to physical bit mapping transforms many uncorrectable error patterns in the logical words into correctable error patterns and, therefore, improving on-chip memory reliability. This reduces the minimum voltage at which on-chip memory can run by 70mV over the best low-latency ECC baseline, leading to a 25.7% core-wide power reduction for an ARM Cortex-A7-like core. Energy per instruction is reduced by 15.7% compared to the best baseline.
电压缩放可以有效地降低处理器功耗,但也会降低片上存储器中SRAM单元的可靠性。因此,它通常伴随着纠错码(ECC)的使用。为了在低电压下实现可靠和高效的存储器操作,片上存储器的ecc必须提供高错误覆盖率和低校正延迟。在本文中,我们提出了错误模式转换,这是一种新颖的低延迟纠错技术,它允许片上存储器缩放到比以前可能的更低的电压。我们的技术依赖于这样一种观察,即许多ecc可以纠正的片上存储器错误的数量根据它们保护的逻辑词中的错误模式而有很大差异。我们提出了根据物理字中bist可检测的故障模式,自适应地重新安排每个字的逻辑位到物理位映射。自适应逻辑位到物理位映射将逻辑字中的许多不可纠正的错误模式转换为可纠正的错误模式,从而提高片上存储器的可靠性。这使得片上存储器在最佳低延迟ECC基线上可以运行的最小电压降低了70mV,从而使ARM cortex - a7类核心的功耗降低了25.7%。与最佳基线相比,每条指令的能量减少了15.7%。
{"title":"Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation","authors":"Henry Duwe, Xun Jian, Daniel Petrisko, Rakesh Kumar","doi":"10.1145/3007787.3001204","DOIUrl":"https://doi.org/10.1145/3007787.3001204","url":null,"abstract":"Voltage scaling can effectively reduce processor power, but also reduces the reliability of the SRAM cells in on-chip memories. Therefore, it is often accompanied by the use of an error correcting code (ECC). To enable reliable and efficient memory operation at low voltages, ECCs for on-chip memories must provide both high error coverage and low correction latency. In this paper, we propose error pattern transformation, a novel low-latency error correction technique that allows on-chip memories to be scaled to voltages lower than what has been previously possible. Our technique relies on the observation that the number of on-chip memory errors that many ECCs can correct differs widely depending on the error patterns in the logical words they protect. We propose adaptively rearranging the logical bit to physical bit mapping per word according to the BIST-detectable fault pattern in the physical word. The adaptive logical bit to physical bit mapping transforms many uncorrectable error patterns in the logical words into correctable error patterns and, therefore, improving on-chip memory reliability. This reduces the minimum voltage at which on-chip memory can run by 70mV over the best low-latency ECC baseline, leading to a 25.7% core-wide power reduction for an ARM Cortex-A7-like core. Energy per instruction is reduced by 15.7% compared to the best baseline.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"19 74 1","pages":"634-644"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87326923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
期刊
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1