2012 39th Annual International Symposium on Computer Architecture (ISCA)最新文献

Can traditional programming bridge the Ninja performance gap for parallel computing applications? 对于并行计算应用程序，传统编程能否弥补Ninja的性能差距?

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-04-23 DOI: 10.1145/2742910

N. Satish, Changkyu Kim, J. Chhugani, Hideki Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, P. Dubey

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming manycore architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the “Ninja gap”, which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

当前的处理器趋势是将更多的内核与更宽的SIMD单元集成在一起，以及更深更复杂的内存层次结构，这使得从应用程序中提取性能变得越来越具有挑战性。一些人认为，传统的编程方法不适用于这些现代处理器，因此必须发现全新的语言。在本文中，我们对这种想法提出了质疑，并提供了支持传统编程方法的证据，以及通用多核处理器和即将到来的多核架构在为常用并行计算工作负载提供显着加速和接近最佳性能方面的性能vs编程工作效率。我们首先量化了“忍者差距”的程度，这是指不考虑并行性的天真编写的C/ c++代码(通常是串行的)与现代多核/多核处理器上最佳优化的代码之间的性能差距。使用一组具有代表性的吞吐量计算基准，我们表明，对于最近的6核Intel®Core™i7 X980 Westmere CPU, Ninja的平均差距为24X(最高可达53X)，如果不加以解决，这一差距将不可避免地增加。我们展示了一组众所周知的算法变化以及现代编译器技术的进步如何将Ninja的差距缩小到平均1.3倍。这些改变通常只需要很少的编程工作，而制作Ninja代码则需要大量的工作。我们还讨论了对可编程性的硬件支持，这可以减少这些更改的影响，甚至进一步提高程序员的工作效率。我们为即将推出的具有更多内核和更宽SIMD的英特尔®多集成核心架构(英特尔®MIC)展示了同样令人鼓舞的结果。因此，我们证明了我们可以控制忍者差距的不受控制的增长，并在未来的体系结构中提供更稳定和可预测的性能增长，提供了强有力的证据，证明不需要彻底的语言更改。

{"title":"Can traditional programming bridge the Ninja performance gap for parallel computing applications?","authors":"N. Satish, Changkyu Kim, J. Chhugani, Hideki Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, P. Dubey","doi":"10.1145/2742910","DOIUrl":"https://doi.org/10.1145/2742910","url":null,"abstract":"Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming manycore architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the “Ninja gap”, which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133104037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 88

End-to-end sequential consistency 端到端顺序一致性

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2012-06-09 DOI: 10.1145/2366231.2337220

Abhayendra Singh, S. Narayanasamy, Daniel Marino, T. Millstein, M. Musuvathi

Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. While a recent study has shown that a compiler can preserve SC semantics for a small performance cost, an efficient and complexity-effective SC hardware remains elusive. Past hardware solutions relied on aggressive speculation techniques, which has not yet been realized in a practical implementation. This paper exploits the observation that hardware need not enforce any memory model constraints on accesses to thread-local and shared read-only locations. A processor can easily determine a large fraction of these safe accesses with assistance from static compiler analysis and the hardware memory management unit. We discuss a low-complexity hardware design that exploits this information to reduce the overhead in ensuring SC. Our design employs an additional unordered store buffer for fast-tracking thread-local stores and allowing later memory accesses to proceed without a memory ordering related stall. Our experimental study shows that the cost of guaranteeing end-to-end SC is only 6.2% on average when compared to a system with TSO hardware executing a stock compiler's output.

顺序一致性(SC)可以说是共享内存多线程程序最直观的行为。语言级SC可以显著提高多处理器系统的可编程性，这已被广泛接受。然而，有效地支持端到端SC仍然是一个挑战，因为它要求编译器和硬件优化都保持SC语义。虽然最近的一项研究表明，编译器可以以很小的性能成本保留SC语义，但高效且有效的SC硬件仍然难以实现。过去的硬件解决方案依赖于激进的推测技术，这在实际实现中尚未实现。本文利用了硬件不需要在访问线程本地和共享只读位置时强制任何内存模型约束的观察结果。在静态编译器分析和硬件内存管理单元的帮助下，处理器可以很容易地确定这些安全访问的很大一部分。我们讨论了一种低复杂性的硬件设计，利用这些信息来减少确保SC的开销。我们的设计采用了一个额外的无序存储缓冲区来快速跟踪线程本地存储，并允许以后的内存访问继续进行，而不会出现内存排序相关的停顿。我们的实验研究表明，与使用TSO硬件执行库存编译器输出的系统相比，保证端到端SC的成本平均仅为6.2%。

{"title":"End-to-end sequential consistency","authors":"Abhayendra Singh, S. Narayanasamy, Daniel Marino, T. Millstein, M. Musuvathi","doi":"10.1145/2366231.2337220","DOIUrl":"https://doi.org/10.1145/2366231.2337220","url":null,"abstract":"Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. While a recent study has shown that a compiler can preserve SC semantics for a small performance cost, an efficient and complexity-effective SC hardware remains elusive. Past hardware solutions relied on aggressive speculation techniques, which has not yet been realized in a practical implementation. This paper exploits the observation that hardware need not enforce any memory model constraints on accesses to thread-local and shared read-only locations. A processor can easily determine a large fraction of these safe accesses with assistance from static compiler analysis and the hardware memory management unit. We discuss a low-complexity hardware design that exploits this information to reduce the overhead in ensuring SC. Our design employs an additional unordered store buffer for fast-tracking thread-local stores and allowing later memory accesses to proceed without a memory ordering related stall. Our experimental study shows that the cost of guaranteeing end-to-end SC is only 6.2% on average when compared to a system with TSO hardware executing a stock compiler's output.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128396368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 98

Physically addressed queueing (PAQ): Improving parallelism in solid state disks 物理寻址队列(PAQ):改进固态磁盘的并行性

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2012-06-09 DOI: 10.1145/2366231.2337206

Myoungsoo Jung, E. Wilson, M. Kandemir

NAND flash storage has proven to be a competitive alternative to traditional disk for its properties of high random-access speeds, low-power and its presumed efficacy for random-reads. Ironically, we demonstrate that when packaged in SSD format, there arise many barriers to reaching full parallelism in reads, resulting in random writes outperforming them. Motivated by this, we propose Physically Addressed Queuing (PAQ), a request scheduler that avoids resource contention resultant from shared SSD resources. PAQ makes the following major contributions: First, it exposes the physical addresses of requests to the scheduler. Second, I/O clumping is utilized to select groups of operations that can be simultaneously executed without major resource conflict. Third, inter-request NAND transaction packing empowers multi-plane-mode operations. We implement PAQ in a cycle-accurate simulator and demonstrate bandwidth and IOPS improvements greater than 62% and latency decreases as much as 41.6% for random reads, without degrading performance of other access types.

NAND闪存已被证明是传统磁盘的一个有竞争力的替代品，因为它具有高随机存取速度、低功耗和随机读取的功效。具有讽刺意味的是，我们证明，当封装在SSD格式时，在读取中达到完全并行性会出现许多障碍，导致随机写入优于它们。基于此，我们提出了物理寻址队列(PAQ)，这是一种请求调度程序，可以避免共享SSD资源导致的资源争用。PAQ做出了以下主要贡献:首先，它向调度器公开请求的物理地址。其次，使用I/O群集来选择可以同时执行而不会发生重大资源冲突的操作组。第三，请求间NAND事务打包支持多平面模式操作。我们在周期精确的模拟器中实现了PAQ，并证明了随机读取的带宽和IOPS改进大于62%，延迟减少多达41.6%，而不会降低其他访问类型的性能。

引用次数: 83

Inspection resistant memory: Architectural support for security from physical examination 抗检查内存:对物理检查安全性的架构支持

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2012-06-09 DOI: 10.1145/2366231.2337174

Jonathan Valamehr, Melissa Chase, S. Kamara, Andrew Putnam, D. Shumow, V. Vaikuntanathan, T. Sherwood

The ability to safely keep a secret in memory is central to the vast majority of security schemes, but storing and erasing these secrets is a difficult problem in the face of an attacker who can obtain unrestricted physical access to the underlying hardware. Depending on the memory technology, the very act of storing a 1 instead of a 0 can have physical side effects measurable even after the power has been cut. These effects cannot be hidden easily, and if the secret stored on chip is of sufficient value, an attacker may go to extraordinary means to learn even a few bits of that information. Solving this problem requires a new class of architectures that measurably increase the difficulty of physical analysis. In this paper we take a first step towards this goal by focusing on one of the backbones of any hardware system: on-chip memory. We examine the relationship between security, area, and efficiency in these architectures, and quantitatively examine the resulting systems through cryptographic analysis and microarchitectural impact. In the end, we are able to find an efficient scheme in which, even if an adversary is able to inspect the value of a stored bit with a probabilistic error of only 5%, our system will be able to prevent that adversary from learning any information about the original un-coded bits with 99.9999999999% probability.

在内存中安全地保存秘密的能力是绝大多数安全方案的核心，但是面对可以获得对底层硬件不受限制的物理访问的攻击者，存储和删除这些秘密是一个难题。根据内存技术的不同，存储1而不是0的行为可能会产生可测量的物理副作用，即使在断电之后也是如此。这些影响不容易被隐藏，如果存储在芯片上的秘密有足够的价值，攻击者可能会采取非同寻常的手段来获取哪怕一小部分信息。解决这个问题需要一种新的体系结构，这大大增加了物理分析的难度。在本文中，我们通过关注任何硬件系统的一个骨干:片上存储器，向这个目标迈出了第一步。我们研究了这些体系结构中安全性、面积和效率之间的关系，并通过密码学分析和微体系结构影响定量地检查了所产生的系统。最后，我们能够找到一个有效的方案，即使攻击者能够以只有5%的概率错误检查存储位的值，我们的系统也能够以99.9999999999%的概率阻止攻击者了解有关原始未编码位的任何信息。

{"title":"Inspection resistant memory: Architectural support for security from physical examination","authors":"Jonathan Valamehr, Melissa Chase, S. Kamara, Andrew Putnam, D. Shumow, V. Vaikuntanathan, T. Sherwood","doi":"10.1145/2366231.2337174","DOIUrl":"https://doi.org/10.1145/2366231.2337174","url":null,"abstract":"The ability to safely keep a secret in memory is central to the vast majority of security schemes, but storing and erasing these secrets is a difficult problem in the face of an attacker who can obtain unrestricted physical access to the underlying hardware. Depending on the memory technology, the very act of storing a 1 instead of a 0 can have physical side effects measurable even after the power has been cut. These effects cannot be hidden easily, and if the secret stored on chip is of sufficient value, an attacker may go to extraordinary means to learn even a few bits of that information. Solving this problem requires a new class of architectures that measurably increase the difficulty of physical analysis. In this paper we take a first step towards this goal by focusing on one of the backbones of any hardware system: on-chip memory. We examine the relationship between security, area, and efficiency in these architectures, and quantitatively examine the resulting systems through cryptographic analysis and microarchitectural impact. In the end, we are able to find an efficient scheme in which, even if an adversary is able to inspect the value of a stored bit with a probabilistic error of only 5%, our system will be able to prevent that adversary from learning any information about the original un-coded bits with 99.9999999999% probability.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132028869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

The Yin and Yang of power and performance for asymmetric hardware and managed software 非对称硬件和托管软件的能量和性能的阴阳

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2012-06-09 DOI: 10.1145/2366231.2337185

Ting Cao, S. Blackburn, Tiejun Gao, K. McKinley

On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power. Asymmetric multicore processors (AMP) combine general-purpose big (fast, high power) cores and small (slow, low power) cores to meet power constraints. Realizing their energy efficiency opportunity requires workloads with differentiated performance and power characteristics.

在硬件方面，非对称多核处理器给软件带来了在两个方面进行优化的挑战和机遇:性能和功耗。非对称多核处理器(AMP)结合了通用的大(快、高功率)核和小(慢、低功率)核，以满足功率限制。实现他们的能源效率机会需要具有差异化性能和功率特性的工作负载。

引用次数: 107

Setting an error detection infrastructure with low cost acoustic wave detectors 用低成本声波探测器建立错误检测基础设施

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2012-06-09 DOI: 10.1145/2366231.2337198

Gaurang Upasani, X. Vera, Antonio González

The continuing decrease in dimensions and operating voltage of transistors has increased their sensitivity against radiation phenomena making soft errors an important challenge in future chip multiprocessors (CMPs). Hence, new techniques for detecting errors in the logic and memories that allow meeting the desired failures-in-time (FIT) budget in CMPs are required. This paper proposes a low-cost dynamic particle strike detection mechanism through acoustic wave detectors. Our results show that our mechanism can protect both the logic and the memory arrays. As a case study, we also show how this technique can be combined with error codes to protect the last-level cache at low cost.

晶体管的尺寸和工作电压的不断减小提高了它们对辐射现象的灵敏度，使得软误差成为未来芯片多处理器(CMPs)的一个重要挑战。因此，需要检测逻辑和存储器中的错误的新技术，以满足cmp中期望的故障及时(FIT)预算。提出了一种基于声波探测器的低成本动态粒子撞击检测机制。结果表明，我们的机制可以同时保护逻辑阵列和存储阵列。作为案例研究，我们还展示了如何将此技术与错误码结合起来，以低成本保护最后一级缓存。

引用次数: 12

Enhancing effective throughput for transmission line-based bus 提高基于传输线总线的有效吞吐量

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2012-06-09 DOI: 10.1145/2366231.2337178

A. Carpenter, Jianyun Hu, Övünç Kocabas, Michael C. Huang, Hui Wu

Main-stream general-purpose microprocessors require a collection of high-performance interconnects to supply the necessary data movement. The trend of continued increase in core count has prompted designs of packet-switched network as a scalable solution for future-generation chips. However, the cost of scalability can be significant and especially hard to justify for smaller-scale chips. In contrast, a circuit-switched bus using transmission lines and corresponding circuits offers lower latencies and much lower energy costs for smaller-scale chips, making it a better choice than a full-blown network-on-chip (NoC) architecture. However, shared-medium designs are perceived as only a niche solution for small- to medium-scale chips. In this paper, we show that there are many low-cost mechanisms to enhance the effective throughput of a bus architecture. When a handful of highly cost-effective techniques are applied, the performance advantage of even the most idealistically configured NoCs becomes vanishingly small. We find transmission line-based buses to be a more compelling interconnect even for large-scale chip-multiprocessors, and thus bring into doubt the centrality of packet switching in future on-chip interconnect.

主流通用微处理器需要一组高性能互连来提供必要的数据移动。核心数持续增加的趋势促使分组交换网络的设计成为下一代芯片的可扩展解决方案。然而，可扩展性的成本可能是显著的，特别是难以证明较小规模的芯片。相比之下，使用传输线和相应电路的电路交换总线为小型芯片提供更低的延迟和更低的能源成本，使其成为比成熟的片上网络(NoC)架构更好的选择。然而，共享介质设计被认为只是中小规模芯片的利基解决方案。在本文中，我们展示了有许多低成本的机制来提高总线体系结构的有效吞吐量。当应用少量高成本效益的技术时，即使是最理想配置的noc的性能优势也会变得非常小。我们发现基于传输线的总线是一种更有吸引力的互连，即使对于大规模的芯片多处理器，因此怀疑分组交换在未来片上互连中的中心地位。

{"title":"Enhancing effective throughput for transmission line-based bus","authors":"A. Carpenter, Jianyun Hu, Övünç Kocabas, Michael C. Huang, Hui Wu","doi":"10.1145/2366231.2337178","DOIUrl":"https://doi.org/10.1145/2366231.2337178","url":null,"abstract":"Main-stream general-purpose microprocessors require a collection of high-performance interconnects to supply the necessary data movement. The trend of continued increase in core count has prompted designs of packet-switched network as a scalable solution for future-generation chips. However, the cost of scalability can be significant and especially hard to justify for smaller-scale chips. In contrast, a circuit-switched bus using transmission lines and corresponding circuits offers lower latencies and much lower energy costs for smaller-scale chips, making it a better choice than a full-blown network-on-chip (NoC) architecture. However, shared-medium designs are perceived as only a niche solution for small- to medium-scale chips. In this paper, we show that there are many low-cost mechanisms to enhance the effective throughput of a bus architecture. When a handful of highly cost-effective techniques are applied, the performance advantage of even the most idealistically configured NoCs becomes vanishingly small. We find transmission line-based buses to be a more compelling interconnect even for large-scale chip-multiprocessors, and thus bring into doubt the centrality of packet switching in future on-chip interconnect.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114470568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures 通道解耦提高宽simd架构的时间误差弹性

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2012-06-09 DOI: 10.1145/2366231.2337187

Evgeni Krimer, P. Chiang, M. Erez

A significant portion of the energy dissipated in modern integrated circuits is consumed by the overhead associated with timing guardbands that ensure reliable execution. Timing speculation, where the pipeline operates at an unsafe voltage with any rare errors detected and resolved by the architecture, has been demonstrated to significantly improve the energy-efficiency of scalar processor designs. Unfortunately, applying the same timing-speculative approach to wide-SIMD architectures, such as those used in highly-efficient GPUs, may not provide similar gains. In this work, we make two important contributions. The first is a set of models describing a parametrized general error probability function that is based on measurements of a fabricated chip and the expected efficiency benefits of timing speculation in a SIMD context. The second contribution is a decoupled SIMD pipeline that more effectively utilizes timing speculation and recovery, when compared with a standard SIMD design that uses only conventional timing speculations. The proposed lane decoupling enables each SIMD lane to tolerate timing errors independent of other adjacent lanes, resulting in higher throughput and improved scalability. We validate our modes and evaluate our design using a cycle-based GPU simulator, describe the conditions where efficiency improvements can be obtained, and explore the benefits of decoupling across a wide range of parameters. Our results show that timing speculation can achieve up to 10.3% improvement in efficiency.

在现代集成电路中，消耗的能量的很大一部分是由确保可靠执行的定时保护带相关的开销消耗的。时间推测，即管道在不安全的电压下运行，任何罕见的错误都可以被架构检测和解决，已经被证明可以显着提高标量处理器设计的能效。不幸的是，将相同的时间推测方法应用于宽simd架构，例如用于高效gpu的架构，可能无法提供类似的收益。在这项工作中，我们做出了两项重要贡献。第一个是一组模型，描述了一个参数化的一般误差概率函数，该函数基于制造芯片的测量和SIMD上下文中时间推测的预期效率效益。第二个贡献是解耦的SIMD管道，与仅使用常规时间推测的标准SIMD设计相比，它更有效地利用了时间推测和恢复。提出的通道解耦使每个SIMD通道能够独立于其他相邻通道容忍定时错误，从而提高吞吐量和改进可扩展性。我们使用基于循环的GPU模拟器验证我们的模式并评估我们的设计，描述可以获得效率改进的条件，并探索跨广泛参数解耦的好处。我们的研究结果表明，时间投机可以实现高达10.3%的效率提高。

{"title":"Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures","authors":"Evgeni Krimer, P. Chiang, M. Erez","doi":"10.1145/2366231.2337187","DOIUrl":"https://doi.org/10.1145/2366231.2337187","url":null,"abstract":"A significant portion of the energy dissipated in modern integrated circuits is consumed by the overhead associated with timing guardbands that ensure reliable execution. Timing speculation, where the pipeline operates at an unsafe voltage with any rare errors detected and resolved by the architecture, has been demonstrated to significantly improve the energy-efficiency of scalar processor designs. Unfortunately, applying the same timing-speculative approach to wide-SIMD architectures, such as those used in highly-efficient GPUs, may not provide similar gains. In this work, we make two important contributions. The first is a set of models describing a parametrized general error probability function that is based on measurements of a fabricated chip and the expected efficiency benefits of timing speculation in a SIMD context. The second contribution is a decoupled SIMD pipeline that more effectively utilizes timing speculation and recovery, when compared with a standard SIMD design that uses only conventional timing speculations. The proposed lane decoupling enables each SIMD lane to tolerate timing errors independent of other adjacent lanes, resulting in higher throughput and improved scalability. We validate our modes and evaluate our design using a cycle-based GPU simulator, describe the conditions where efficiency improvements can be obtained, and explore the benefits of decoupling across a wide range of parameters. Our results show that timing speculation can achieve up to 10.3% improvement in efficiency.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"31 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123245923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

PARDIS: A programmable memory controller for the DDRx interfacing standards PARDIS: DDRx接口标准的可编程存储器控制器

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2012-06-09 DOI: 10.1145/2534845

M. N. Bojnordi, Engin Ipek

Modern memory controllers employ sophisticated address mapping, command scheduling, and power management optimizations to alleviate the adverse effects of DRAM timing and resource constraints on system performance. A promising way of improving the versatility and efficiency of these controllers is to make them programmable - a proven technique that has seen wide use in other control tasks ranging from DMA scheduling to NAND Flash and directory control. Unfortunately, the stringent latency and throughput requirements of modern DDRx devices have rendered such programmability largely impractical, confining DDRx controllers to fixed-function hardware. This paper presents the instruction set architecture (ISA) and hardware implementation of PARDIS, a programmable memory controller that can meet the performance requirements of a high-speed DDRx interface. The proposed controller is evaluated by mapping previously proposed DRAM scheduling, address mapping, refresh scheduling, and power management algorithms onto PARDIS. Simulation results show that the average performance of PARDIS comes within 8% of fixed-function hardware for each of these techniques; moreover, by enabling application-specific optimizations, PARDIS improves system performance by 6-17% and reduces DRAM energy by 9-22% over four existing memory controllers.

现代内存控制器采用复杂的地址映射、命令调度和电源管理优化，以减轻DRAM时间和资源限制对系统性能的不利影响。提高这些控制器的通用性和效率的一种有希望的方法是使它们可编程——这是一种经过验证的技术，已广泛应用于其他控制任务，从DMA调度到NAND闪存和目录控制。不幸的是，现代DDRx设备严格的延迟和吞吐量要求使得这种可编程性在很大程度上不切实际，将DDRx控制器限制在固定功能的硬件上。本文介绍了满足高速DDRx接口性能要求的可编程存储器控制器PARDIS的指令集体系结构和硬件实现。通过将先前提出的DRAM调度、地址映射、刷新调度和电源管理算法映射到PARDIS来评估所提出的控制器。仿真结果表明，对于每种技术，PARDIS的平均性能都在固定功能硬件的8%以内;此外，通过实现特定应用的优化，与现有的四个内存控制器相比，PARDIS将系统性能提高了6-17%，并将DRAM能耗降低了9-22%。

{"title":"PARDIS: A programmable memory controller for the DDRx interfacing standards","authors":"M. N. Bojnordi, Engin Ipek","doi":"10.1145/2534845","DOIUrl":"https://doi.org/10.1145/2534845","url":null,"abstract":"Modern memory controllers employ sophisticated address mapping, command scheduling, and power management optimizations to alleviate the adverse effects of DRAM timing and resource constraints on system performance. A promising way of improving the versatility and efficiency of these controllers is to make them programmable - a proven technique that has seen wide use in other control tasks ranging from DMA scheduling to NAND Flash and directory control. Unfortunately, the stringent latency and throughput requirements of modern DDRx devices have rendered such programmability largely impractical, confining DDRx controllers to fixed-function hardware. This paper presents the instruction set architecture (ISA) and hardware implementation of PARDIS, a programmable memory controller that can meet the performance requirements of a high-speed DDRx interface. The proposed controller is evaluated by mapping previously proposed DRAM scheduling, address mapping, refresh scheduling, and power management algorithms onto PARDIS. Simulation results show that the average performance of PARDIS comes within 8% of fixed-function hardware for each of these techniques; moreover, by enabling application-specific optimizations, PARDIS improves system performance by 6-17% and reduces DRAM energy by 9-22% over four existing memory controllers.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123329725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

A defect-tolerant accelerator for emerging high-performance applications 用于新兴高性能应用程序的容错加速器

2012 39th Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2012-06-09 DOI: 10.1145/2366231.2337200

O. Temam

Due to the evolution of technology constraints, especially energy constraints which may lead to heterogeneous multi-cores, and the increasing number of defects, the design of defect-tolerant accelerators for heterogeneous multi-cores may become a major micro-architecture research issue. Most custom circuits are highly defect sensitive, a single transistor can wreck such circuits. On the contrary, artificial neural networks (ANNs) are inherently error tolerant algorithms. And the emergence of high-performance applications implementing recognition and mining tasks, for which competitive ANN-based algorithms exist, drastically expands the potential application scope of a hardware ANN accelerator. However, while the error tolerance of ANN algorithms is well documented, there are few in-depth attempts at demonstrating that an actual hardware ANN would be tolerant to faulty transistors. Most fault models are abstract and cannot demonstrate that the error tolerance of ANN algorithms can be translated into the defect tolerance of hardware ANN accelerators. In this article, we introduce a hardware ANN geared towards defect tolerance and energy efficiency, by spatially expanding the ANN. In order to precisely assess the defect tolerance capability of this hardware ANN, we introduce defects at the level of transistors, and then assess the impact of such defects on the hardware ANN functional behavior. We empirically show that the conceptual error tolerance of neural networks does translate into the defect tolerance of hardware neural networks, paving the way for their introduction in heterogeneous multi-cores as intrinsically defect-tolerant and energy-efficient accelerators.

由于技术约束的演变，特别是能量约束可能导致异构多核，以及缺陷数量的不断增加，异构多核容错加速器的设计可能成为一个重要的微架构研究问题。大多数定制电路对缺陷高度敏感，一个晶体管就能破坏这样的电路。相反，人工神经网络(ann)本身就是一种容错算法。实现识别和挖掘任务的高性能应用的出现，极大地扩展了硬件人工神经网络加速器的潜在应用范围。然而，尽管人工神经网络算法的容错性有很好的记录，但很少有深入的尝试来证明实际的硬件人工神经网络可以容忍错误的晶体管。大多数故障模型是抽象的，无法证明人工神经网络算法的容错能力可以转化为硬件人工神经网络加速器的缺陷容错能力。在本文中，我们通过对神经网络的空间扩展，介绍了一种面向缺陷容忍度和能效的硬件神经网络。为了准确评估硬件人工神经网络的缺陷容限能力，我们在晶体管层面引入缺陷，然后评估这些缺陷对硬件人工神经网络功能行为的影响。我们的经验表明，神经网络的概念容错确实转化为硬件神经网络的缺陷容忍度，为将其作为本质上容错和节能的加速器引入异构多核铺平了道路。

{"title":"A defect-tolerant accelerator for emerging high-performance applications","authors":"O. Temam","doi":"10.1145/2366231.2337200","DOIUrl":"https://doi.org/10.1145/2366231.2337200","url":null,"abstract":"Due to the evolution of technology constraints, especially energy constraints which may lead to heterogeneous multi-cores, and the increasing number of defects, the design of defect-tolerant accelerators for heterogeneous multi-cores may become a major micro-architecture research issue. Most custom circuits are highly defect sensitive, a single transistor can wreck such circuits. On the contrary, artificial neural networks (ANNs) are inherently error tolerant algorithms. And the emergence of high-performance applications implementing recognition and mining tasks, for which competitive ANN-based algorithms exist, drastically expands the potential application scope of a hardware ANN accelerator. However, while the error tolerance of ANN algorithms is well documented, there are few in-depth attempts at demonstrating that an actual hardware ANN would be tolerant to faulty transistors. Most fault models are abstract and cannot demonstrate that the error tolerance of ANN algorithms can be translated into the defect tolerance of hardware ANN accelerators. In this article, we introduce a hardware ANN geared towards defect tolerance and energy efficiency, by spatially expanding the ANN. In order to precisely assess the defect tolerance capability of this hardware ANN, we introduce defects at the level of transistors, and then assess the impact of such defects on the hardware ANN functional behavior. We empirically show that the conceptual error tolerance of neural networks does translate into the defect tolerance of hardware neural networks, paving the way for their introduction in heterogeneous multi-cores as intrinsically defect-tolerant and energy-efficient accelerators.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133536138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 158