首页 > 最新文献

2012 IEEE 30th International Conference on Computer Design (ICCD)最新文献

英文 中文
Acceleration of Monte-Carlo molecular simulations on hybrid computing architectures 混合计算体系结构上蒙特卡罗分子模拟的加速
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378642
Claus Braun, S. Holst, H. Wunderlich, Juan Manuel Castillo-Sanchez, J. Gross
Markov-Chain Monte-Carlo (MCMC) methods are an important class of simulation techniques, which execute a sequence of simulation steps, where each new step depends on the previous ones. Due to this fundamental dependency, MCMC methods are inherently hard to parallelize on any architecture. The upcoming generations of hybrid CPU/GPGPU architectures with their multi-core CPUs and tightly coupled many-core GPGPUs provide new acceleration opportunities especially for MCMC methods, if the new degrees of freedom are exploited correctly. In this paper, the outcomes of an interdisciplinary collaboration are presented, which focused on the parallel mapping of a MCMC molecular simulation from thermodynamics to hybrid CPU/GPGPU computing systems. While the mapping is designed for upcoming hybrid architectures, the implementation of this approach on an NVIDIA Tesla system already leads to a substantial speedup of more than 87× despite the additional communication overheads.
马尔可夫链蒙特卡罗(MCMC)方法是一类重要的仿真技术,它执行一系列的仿真步骤,其中每个新步骤都依赖于前一个步骤。由于这种基本的依赖性,MCMC方法在任何体系结构上都很难并行化。即将到来的混合CPU/GPGPU架构及其多核CPU和紧密耦合的多核GPGPU提供了新的加速机会,特别是对于MCMC方法,如果新的自由度被正确利用。本文介绍了跨学科合作的成果,重点介绍了从热力学到混合CPU/GPGPU计算系统的MCMC分子模拟的并行映射。虽然这种映射是为即将到来的混合架构设计的,但在NVIDIA Tesla系统上实施这种方法已经带来了超过87x的大幅加速,尽管有额外的通信开销。
{"title":"Acceleration of Monte-Carlo molecular simulations on hybrid computing architectures","authors":"Claus Braun, S. Holst, H. Wunderlich, Juan Manuel Castillo-Sanchez, J. Gross","doi":"10.1109/ICCD.2012.6378642","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378642","url":null,"abstract":"Markov-Chain Monte-Carlo (MCMC) methods are an important class of simulation techniques, which execute a sequence of simulation steps, where each new step depends on the previous ones. Due to this fundamental dependency, MCMC methods are inherently hard to parallelize on any architecture. The upcoming generations of hybrid CPU/GPGPU architectures with their multi-core CPUs and tightly coupled many-core GPGPUs provide new acceleration opportunities especially for MCMC methods, if the new degrees of freedom are exploited correctly. In this paper, the outcomes of an interdisciplinary collaboration are presented, which focused on the parallel mapping of a MCMC molecular simulation from thermodynamics to hybrid CPU/GPGPU computing systems. While the mapping is designed for upcoming hybrid architectures, the implementation of this approach on an NVIDIA Tesla system already leads to a substantial speedup of more than 87× despite the additional communication overheads.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133936378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Timing-test scheduling for constraint-graph based post-silicon skew tuning 基于约束图的后硅倾斜调谐时序测试调度
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378679
M. Kaneko
Post-Silicon Tuning is an emerging technology for improving performance-yield of VLSIs under process variations. This paper focuses especially on the post-silicon timing-skew tuning (PSST) via programmable delay elements (PDEs), and proposes a novel tuning algorithm which utilizes only the result of setup and hold timing tests, not the result of costly delay-time measurements. The basic framework of our PSST consists of the construction of Control-value Constraint Graph from the results of timing-tests, and the computation of longest path lengths on this graph for finding safe PDE setting. Even though the cost for timing test is smaller than a delay-time measurement, the cost of timing-tests is still a dominant part of the PSST cost, and its reduction is a crucial problem. Longest path lengths which we need to compute depends directly on edge weights in the “longest-paths tree”, but for co-tree edges, their exact edge weights are not always necessary. Based on this observation, we propose timing-test scheduling for reducing the timing-test cost for PDE tuning. The experimental simulation results show that our approach reduces the test cost by almost half or more.
后硅调谐是一种新兴技术,用于提高超大规模集成电路在工艺变化下的性能良率。本文重点研究了基于可编程延迟元件(pde)的后硅时间偏差调谐(PSST),并提出了一种新的调谐算法,该算法仅利用设置和保持时间测试的结果,而不是昂贵的延迟时间测量的结果。我们的PSST的基本框架包括根据时间测试的结果构造控制值约束图,并计算该图上的最长路径长度以寻找安全的PDE设置。尽管定时测试的成本比延迟时间测量的成本小,但定时测试的成本仍然是PSST成本的主要组成部分,其降低是一个关键问题。我们需要计算的最长路径长度直接取决于“最长路径树”中的边权,但对于共树边,它们的确切边权并不总是必需的。基于这一观察,我们提出了时序测试调度,以减少PDE调优的时序测试成本。实验仿真结果表明,该方法可将测试成本降低近一半或更多。
{"title":"Timing-test scheduling for constraint-graph based post-silicon skew tuning","authors":"M. Kaneko","doi":"10.1109/ICCD.2012.6378679","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378679","url":null,"abstract":"Post-Silicon Tuning is an emerging technology for improving performance-yield of VLSIs under process variations. This paper focuses especially on the post-silicon timing-skew tuning (PSST) via programmable delay elements (PDEs), and proposes a novel tuning algorithm which utilizes only the result of setup and hold timing tests, not the result of costly delay-time measurements. The basic framework of our PSST consists of the construction of Control-value Constraint Graph from the results of timing-tests, and the computation of longest path lengths on this graph for finding safe PDE setting. Even though the cost for timing test is smaller than a delay-time measurement, the cost of timing-tests is still a dominant part of the PSST cost, and its reduction is a crucial problem. Longest path lengths which we need to compute depends directly on edge weights in the “longest-paths tree”, but for co-tree edges, their exact edge weights are not always necessary. Based on this observation, we propose timing-test scheduling for reducing the timing-test cost for PDE tuning. The experimental simulation results show that our approach reduces the test cost by almost half or more.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115361907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Thermal characterization of cloud workloads on a power-efficient server-on-chip 在高能效的片上服务器上对云工作负载进行热特性分析
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378637
D. Milojevic, Sachin Idgunji, Djordje Jevdjic, Emre Ozer, P. Lotfi-Kamran, Andreas Panteli, A. Prodromou, C. Nicopoulos, D. Hardy, B. Falsafi, Yiannakis Sazeides
We propose a power-efficient many-core server-on-chip system with 3D-stacked Wide I/O DRAM targeting cloud workloads in datacenters. The integration of 3D-stacked Wide I/O DRAM on top of a logic die increases available memory bandwidth by using dense and fast Through-Silicon Vias (TSVs) instead of off-chip IOs, enabling faster data transfers at much lower energy per bit. We demonstrate a methodology that includes full-system microarchitectural modeling and rapid virtual physical prototyping with emphasis on the thermal analysis. Our findings show that while executing CPU-centric benchmarks (e.g. SPECInt and Dhrystone), the temperature in the server-on-chip (logic+DRAM) is in the range of 175-200°C at a power consumption of less than 20W, exceeding the reliable operating bounds without any cooling solutions, even with embedded cores. However, with real cloud workloads, the power density in the server-on-chip remains much below the temperatures reached by the CPU-centric workloads as a result of much lower power burnt by memory-intensive cloud workloads. We show that such a server-on-chip system is feasible with a low-cost passive heat sink eliminating the need for a high-cost active heat sink with an attached fan, creating an opportunity for overall cost and energy savings in datacenters.
我们提出了一种具有3d堆叠宽I/O DRAM的节能多核服务器片上系统,针对数据中心的云工作负载。3d堆叠的宽I/O DRAM集成在逻辑芯片上,通过使用密集和快速的硅通孔(tsv)而不是片外IOs,增加了可用的内存带宽,以更低的每比特能量实现更快的数据传输。我们展示了一种方法,包括全系统微架构建模和快速虚拟物理原型,重点是热分析。我们的研究结果表明,在执行以cpu为中心的基准测试(例如SPECInt和Dhrystone)时,服务器片上(逻辑+DRAM)的温度在175-200°C的范围内,功耗低于20W,超过了可靠的运行界限,没有任何冷却解决方案,即使是嵌入式核心。然而,对于真实的云工作负载,由于内存密集型云工作负载消耗的功率要低得多,因此片上服务器的功率密度仍然远低于以cpu为中心的工作负载所达到的温度。我们证明了这样的服务器芯片系统是可行的,它具有低成本的被动散热器,消除了对带有附加风扇的高成本主动散热器的需求,为数据中心的总体成本和能源节约创造了机会。
{"title":"Thermal characterization of cloud workloads on a power-efficient server-on-chip","authors":"D. Milojevic, Sachin Idgunji, Djordje Jevdjic, Emre Ozer, P. Lotfi-Kamran, Andreas Panteli, A. Prodromou, C. Nicopoulos, D. Hardy, B. Falsafi, Yiannakis Sazeides","doi":"10.1109/ICCD.2012.6378637","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378637","url":null,"abstract":"We propose a power-efficient many-core server-on-chip system with 3D-stacked Wide I/O DRAM targeting cloud workloads in datacenters. The integration of 3D-stacked Wide I/O DRAM on top of a logic die increases available memory bandwidth by using dense and fast Through-Silicon Vias (TSVs) instead of off-chip IOs, enabling faster data transfers at much lower energy per bit. We demonstrate a methodology that includes full-system microarchitectural modeling and rapid virtual physical prototyping with emphasis on the thermal analysis. Our findings show that while executing CPU-centric benchmarks (e.g. SPECInt and Dhrystone), the temperature in the server-on-chip (logic+DRAM) is in the range of 175-200°C at a power consumption of less than 20W, exceeding the reliable operating bounds without any cooling solutions, even with embedded cores. However, with real cloud workloads, the power density in the server-on-chip remains much below the temperatures reached by the CPU-centric workloads as a result of much lower power burnt by memory-intensive cloud workloads. We show that such a server-on-chip system is feasible with a low-cost passive heat sink eliminating the need for a high-cost active heat sink with an attached fan, creating an opportunity for overall cost and energy savings in datacenters.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115754169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Design and evaluation of a four-port data cache for high instruction level parallelism reconfigurable processors 用于高指令级并行可重构处理器的四端口数据缓存的设计与评价
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378693
Kiyeon Lee, Moo-Kyoung Chung, Soojung Ryu, Yeon-Gon Cho, Sangyeun Cho
This paper explores high-bandwidth data cache designs for a coarse-grained reconfigurable architecture processor family capable of achieving a high degree of instruction level parallelism. To meet stringent power, area and time-to-market constraints, we take an architectural approach rather than circuit-level multi-porting approaches. We closely examine two design choices: single-level banked cache (SLC) and two-level cache (TLC). A detailed simulation study using a set of microbenchmarks and industry-strength benchmarks finds that both SLC and TLC offer a reasonably competitive performance at a small implementation cost compared with a hypothetical cache with perfect ports and a multi-bank scratchpad memory.
本文探讨了一种能够实现高度指令级并行的粗粒度可重构架构处理器家族的高带宽数据缓存设计。为了满足严格的功率、面积和上市时间限制,我们采用架构方法而不是电路级多端口方法。我们仔细研究了两种设计选择:单级银行缓存(SLC)和两级缓存(TLC)。使用一组微基准测试和行业强度基准测试的详细模拟研究发现,与具有完美端口和多银行刮擦板内存的假设缓存相比,SLC和TLC都以较小的实现成本提供了相当有竞争力的性能。
{"title":"Design and evaluation of a four-port data cache for high instruction level parallelism reconfigurable processors","authors":"Kiyeon Lee, Moo-Kyoung Chung, Soojung Ryu, Yeon-Gon Cho, Sangyeun Cho","doi":"10.1109/ICCD.2012.6378693","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378693","url":null,"abstract":"This paper explores high-bandwidth data cache designs for a coarse-grained reconfigurable architecture processor family capable of achieving a high degree of instruction level parallelism. To meet stringent power, area and time-to-market constraints, we take an architectural approach rather than circuit-level multi-porting approaches. We closely examine two design choices: single-level banked cache (SLC) and two-level cache (TLC). A detailed simulation study using a set of microbenchmarks and industry-strength benchmarks finds that both SLC and TLC offer a reasonably competitive performance at a small implementation cost compared with a hypothetical cache with perfect ports and a multi-bank scratchpad memory.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125410596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust optimization of a Chip Multiprocessor's performance under power and thermal constraints 芯片多处理器在功率和热约束下性能的稳健优化
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378625
M. Ghasemazar, H. Goudarzi, Massoud Pedram
Power dissipation and die temperature have become key performance limiters in today's high-performance Chip Multiprocessors (CMPs.) Dynamic power management solutions have been proposed to manage resources in a CMP based on the measured power dissipation, performance, and die temperature of processing cores. In this paper, we develop a robust framework for power and thermal management of heterogeneous CMPs subject to variability and uncertainty in system parameters. More precisely, we first model and formulate the problem of maximizing the task throughput of a heterogeneous CMP (a.k.a., asymmetric multi-core architecture) subject to a total power budget and a per-core temperature limit. Next we develop a solution framework, called Variation-aware Power/Thermal Manager (VPTM), which is a hierarchical dynamic power and thermal management solution targeting heterogeneous CMP architectures. VPTM utilizes dynamic voltage and frequency scaling (DVFS) and core consolidation techniques to control the core power consumptions, which implicitly regulate the core temperatures. An algorithm is proposed for core consolidation and application assignment, and a convex program is formulated and solved to produce optimal DVFS settings. Finally, a feedback controller is employed to compensate for variations in key system parameters at runtime. Experimental results show highly promising performance improvements for VPTM compared to the state-of-the-art techniques.
功耗和芯片温度已成为当今高性能芯片多处理器(cmp)的关键性能限制因素。动态电源管理解决方案已经提出,以管理资源的CMP基于测量的功耗,性能和芯片的处理核心的温度。在本文中,我们开发了一个强大的框架,用于系统参数可变性和不确定性的异构cmp的功率和热管理。更准确地说,我们首先建模并制定了在总功率预算和每核温度限制下最大化异构CMP(又称非对称多核架构)任务吞吐量的问题。接下来,我们开发了一个解决方案框架,称为变化感知电源/热管理器(VPTM),这是一个针对异构CMP架构的分层动态电源和热管理解决方案。VPTM利用动态电压和频率缩放(DVFS)和堆芯固结技术来控制堆芯功耗,从而隐含地调节堆芯温度。提出了一种核心巩固和应用分配的算法,并编制和求解了一个凸程序来产生最优DVFS设置。最后,采用反馈控制器对运行时系统关键参数的变化进行补偿。实验结果表明,与目前最先进的技术相比,VPTM的性能有了很大的提高。
{"title":"Robust optimization of a Chip Multiprocessor's performance under power and thermal constraints","authors":"M. Ghasemazar, H. Goudarzi, Massoud Pedram","doi":"10.1109/ICCD.2012.6378625","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378625","url":null,"abstract":"Power dissipation and die temperature have become key performance limiters in today's high-performance Chip Multiprocessors (CMPs.) Dynamic power management solutions have been proposed to manage resources in a CMP based on the measured power dissipation, performance, and die temperature of processing cores. In this paper, we develop a robust framework for power and thermal management of heterogeneous CMPs subject to variability and uncertainty in system parameters. More precisely, we first model and formulate the problem of maximizing the task throughput of a heterogeneous CMP (a.k.a., asymmetric multi-core architecture) subject to a total power budget and a per-core temperature limit. Next we develop a solution framework, called Variation-aware Power/Thermal Manager (VPTM), which is a hierarchical dynamic power and thermal management solution targeting heterogeneous CMP architectures. VPTM utilizes dynamic voltage and frequency scaling (DVFS) and core consolidation techniques to control the core power consumptions, which implicitly regulate the core temperatures. An algorithm is proposed for core consolidation and application assignment, and a convex program is formulated and solved to produce optimal DVFS settings. Finally, a feedback controller is employed to compensate for variations in key system parameters at runtime. Experimental results show highly promising performance improvements for VPTM compared to the state-of-the-art techniques.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124063198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Efficient code compression for coarse grained reconfigurable architectures 针对粗粒度可重构架构的高效代码压缩
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378687
Moo-Kyoung Chung, Yeon-Gon Cho, Soojung Ryu
Though Coarse Grained Reconfigurable Architecture (CGRA) is a flexible alternative for high performance computing, it has a crucial problem on instruction code whose size is so large that the instruction memory takes a significant portion of silicon area and power consumption. This article proposes an efficient dictionary-based compression method for the CGRA instruction code, where code bit-fields are rearranged and grouped together according to locality characteristics and the most efficient compression mode is selected for each group and kernel. The proposed method can reinstall the dictionary contents adaptively for each kernel. Experimental results show that the proposed method achieved an average compression ratio 0.56 in 4×4 array of function units for well-optimized applications.
粗粒度可重构体系结构(粗粒度可重构体系结构,CGRA)是一种灵活的高性能计算替代方案,但它存在一个关键问题,即指令代码的大小太大,以至于指令存储器占用了很大一部分硅面积和功耗。本文提出了一种高效的基于字典的CGRA指令码压缩方法,根据局部性特征对码位域进行重新排列和分组,并为每一组和内核选择最有效的压缩方式。该方法可以自适应地重新安装每个内核的字典内容。实验结果表明,该方法在4×4功能单元阵列上的平均压缩比为0.56,具有较好的优化效果。
{"title":"Efficient code compression for coarse grained reconfigurable architectures","authors":"Moo-Kyoung Chung, Yeon-Gon Cho, Soojung Ryu","doi":"10.1109/ICCD.2012.6378687","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378687","url":null,"abstract":"Though Coarse Grained Reconfigurable Architecture (CGRA) is a flexible alternative for high performance computing, it has a crucial problem on instruction code whose size is so large that the instruction memory takes a significant portion of silicon area and power consumption. This article proposes an efficient dictionary-based compression method for the CGRA instruction code, where code bit-fields are rearranged and grouped together according to locality characteristics and the most efficient compression mode is selected for each group and kernel. The proposed method can reinstall the dictionary contents adaptively for each kernel. Experimental results show that the proposed method achieved an average compression ratio 0.56 in 4×4 array of function units for well-optimized applications.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123380029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Embedded way prediction for last-level caches 最后一级缓存的嵌入式方式预测
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378636
Faissal M. Sleiman, R. Dreslinski, T. Wenisch
This paper investigates Embedded Way Prediction for large last-level caches (LLCs): an architecture and circuit design to provide the latency of parallel tag-data access at substantial energy savings. Existing way prediction approaches for L1 caches are compromised by the high associativity and filtered temporal locality of LLCs. We demonstrate: (1) the need for wide partial tag comparison, which we implement with a dynamic CAM alongside the data sub-array wordline decode, and (2) the inhibit bit, an architectural innovation to provide accurate predictions when the partial tag comparison is inconclusive. We present circuit critical-path and architectural power/performance studies demonstrating speedups of up to 15.4% (6.6% average) for scientific and server applications, matching the performance of parallel tag-data access while reducing energy overhead by 40%.
本文研究了大型最后一级缓存(llc)的嵌入式方式预测:一种架构和电路设计,可以在节省大量能源的情况下提供并行标签数据访问的延迟。现有的L1缓存预测方法受到llc的高关联性和过滤时间局部性的影响。我们证明:(1)需要广泛的部分标签比较,我们使用动态CAM与数据子阵列字行解码一起实现,以及(2)抑制位,这是一种架构创新,可以在部分标签比较不确定时提供准确的预测。我们提出的电路关键路径和架构功率/性能研究表明,对于科学和服务器应用,加速高达15.4%(平均6.6%),与并行标签数据访问的性能相匹配,同时减少了40%的能源开销。
{"title":"Embedded way prediction for last-level caches","authors":"Faissal M. Sleiman, R. Dreslinski, T. Wenisch","doi":"10.1109/ICCD.2012.6378636","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378636","url":null,"abstract":"This paper investigates Embedded Way Prediction for large last-level caches (LLCs): an architecture and circuit design to provide the latency of parallel tag-data access at substantial energy savings. Existing way prediction approaches for L1 caches are compromised by the high associativity and filtered temporal locality of LLCs. We demonstrate: (1) the need for wide partial tag comparison, which we implement with a dynamic CAM alongside the data sub-array wordline decode, and (2) the inhibit bit, an architectural innovation to provide accurate predictions when the partial tag comparison is inconclusive. We present circuit critical-path and architectural power/performance studies demonstrating speedups of up to 15.4% (6.6% average) for scientific and server applications, matching the performance of parallel tag-data access while reducing energy overhead by 40%.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125636321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Analyzing the optimal ratio of SRAM banks in hybrid caches 分析混合缓存中SRAM组的最佳比例
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378655
A. Valero, J. Sahuquillo, S. Petit, P. López, J. Duato
Cache memories have been typically implemented with Static Random Access Memory (SRAM) technology. This technology presents a fast access time but high energy consumption and low density. As opposite, the recently appeared embedded Dynamic RAM (eDRAM) technology allows caches to be built with lower energy and area, although with a slower access time. The eDRAM technology provides important leakage and area savings, especially in huge Last-Level Caches (LLCs), which occupy almost half the silicon area in some recent microprocessors. This paper proposes a novel hybrid LLC, which combines SRAM and eDRAM banks to address the trade-off among performance, energy, and area. To this end, we explore the optimal percentage of SRAM and eDRAM banks that achieves the best target trade-off. Architectural mechanisms have been devised to keep the most likely accessed blocks in fast SRAM banks as well as to avoid unnecessary destructive reads. Experimental results show that, compared to a conventional SRAM LLC with the same storage capacity, performance degradation does not surpass, on average, 2.9% (even with 12.5% of banks built with SRAM technology), whereas area savings can be as high as 46% for a 1MB-16way LLC. For a 45nm technology node, the energy-delay squared product confirms that a hybrid cache is a better design than the conventional SRAM cache regardless the number of eDRAM banks, and also better than a conventional eDRAM cache when the number of SRAM banks is a quarter or an eighth of the cache banks.
高速缓存存储器通常采用静态随机存取存储器(SRAM)技术实现。该技术具有存取时间快、能耗高、密度低等特点。相反,最近出现的嵌入式动态RAM (eDRAM)技术允许以更低的能量和面积构建高速缓存,尽管访问时间较慢。eDRAM技术提供了重要的泄漏和面积节省,特别是在巨大的最后一级缓存(llc)中,在一些最新的微处理器中,它占据了几乎一半的硅面积。本文提出了一种新的混合有限责任公司,它结合了SRAM和eDRAM银行,以解决性能,能源和面积之间的权衡。为此,我们探索了实现最佳目标权衡的SRAM和eDRAM组的最佳百分比。体系结构机制被设计为将最有可能访问的块保存在快速SRAM银行中,并避免不必要的破坏性读取。实验结果表明,与具有相同存储容量的传统SRAM LLC相比,性能下降平均不超过2.9%(即使12.5%的银行采用SRAM技术),而1mb -16路LLC的面积节省可高达46%。对于45nm技术节点,能量延迟平方产品证实,混合缓存是比传统SRAM缓存更好的设计,无论eDRAM银行的数量如何。当SRAM组的数量是缓存组的四分之一或八分之一时,也比传统的eDRAM缓存更好。
{"title":"Analyzing the optimal ratio of SRAM banks in hybrid caches","authors":"A. Valero, J. Sahuquillo, S. Petit, P. López, J. Duato","doi":"10.1109/ICCD.2012.6378655","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378655","url":null,"abstract":"Cache memories have been typically implemented with Static Random Access Memory (SRAM) technology. This technology presents a fast access time but high energy consumption and low density. As opposite, the recently appeared embedded Dynamic RAM (eDRAM) technology allows caches to be built with lower energy and area, although with a slower access time. The eDRAM technology provides important leakage and area savings, especially in huge Last-Level Caches (LLCs), which occupy almost half the silicon area in some recent microprocessors. This paper proposes a novel hybrid LLC, which combines SRAM and eDRAM banks to address the trade-off among performance, energy, and area. To this end, we explore the optimal percentage of SRAM and eDRAM banks that achieves the best target trade-off. Architectural mechanisms have been devised to keep the most likely accessed blocks in fast SRAM banks as well as to avoid unnecessary destructive reads. Experimental results show that, compared to a conventional SRAM LLC with the same storage capacity, performance degradation does not surpass, on average, 2.9% (even with 12.5% of banks built with SRAM technology), whereas area savings can be as high as 46% for a 1MB-16way LLC. For a 45nm technology node, the energy-delay squared product confirms that a hybrid cache is a better design than the conventional SRAM cache regardless the number of eDRAM banks, and also better than a conventional eDRAM cache when the number of SRAM banks is a quarter or an eighth of the cache banks.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134347682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Stealth assessment of hardware Trojans in a microcontroller 微控制器中硬件木马的隐身性评估
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378631
Trey Reece, D. Limbrick, Xiaowen Wang, B. Kiddie, W. H. Robinson
Many experimental hardware Trojans from the literature explore the potential threat vectors, but do not address the stealthiness of the malicious hardware. If a Trojan requires a large amount of area or power, then it can be easier to detect. Instead, a more focused attack can potentially avoid detection. This paper explores the cost in both area and power consumption of several small, focused attacks on an Intel 8051 microcontroller implemented with a standard cell library. The resulting cost in total area varied from a 0.4% increase in the design, down to a 0.150% increase in the design. Dynamic and leakage power showed similar results.
许多来自文献的实验性硬件木马探索了潜在的威胁向量,但没有解决恶意硬件的隐秘性。如果一个木马需要大量的面积或功率,那么它可以更容易检测。相反,更集中的攻击可能会避免被发现。本文探讨了几种针对Intel 8051微控制器的小型集中攻击的面积和功耗成本,这些攻击采用标准单元库实现。由此产生的总面积成本从设计中增加0.4%到设计中增加0.150%不等。动态功率和泄漏功率结果相似。
{"title":"Stealth assessment of hardware Trojans in a microcontroller","authors":"Trey Reece, D. Limbrick, Xiaowen Wang, B. Kiddie, W. H. Robinson","doi":"10.1109/ICCD.2012.6378631","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378631","url":null,"abstract":"Many experimental hardware Trojans from the literature explore the potential threat vectors, but do not address the stealthiness of the malicious hardware. If a Trojan requires a large amount of area or power, then it can be easier to detect. Instead, a more focused attack can potentially avoid detection. This paper explores the cost in both area and power consumption of several small, focused attacks on an Intel 8051 microcontroller implemented with a standard cell library. The resulting cost in total area varied from a 0.4% increase in the design, down to a 0.150% increase in the design. Dynamic and leakage power showed similar results.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126321549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Memory module-level testing and error behaviors for phase change memory 相变存储器的存储器模块级测试和错误行为
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378664
Zhe Zhang, Weijun Xiao, Nohhyun Park, D. Lilja
Phase change memory (PCM) is a promising technology to solve energy and performance bottlenecks for memory and storage systems. To help understand the reliability characteristics of PCM devices, we present a simple fault model to categorize four types of PCM errors. Based on our proposed fault model, we conduct extensive experiments on real PCM devices at the memory module level. Numerical results uncover many interesting trends in terms of the lifetime of PCM devices and error behaviors. Specifically, PCM lifetime for the memory chips we tested is greater than 14 million cycles, which is much longer than for flash memory devices. In addition, the distributions for four types of errors are quite different. These results can be used for estimating PCM lifetime and for measuring the fabrication quality of individual PCM memory chips.
相变存储器(PCM)是一种很有前途的技术,可以解决内存和存储系统的能量和性能瓶颈。为了帮助理解PCM器件的可靠性特性,我们提出了一个简单的故障模型来对四种类型的PCM错误进行分类。基于我们提出的故障模型,我们在存储器模块级的实际PCM器件上进行了大量的实验。数值结果揭示了PCM器件寿命和误差行为方面的许多有趣趋势。具体来说,我们测试的存储芯片的PCM寿命超过1400万次,这比闪存设备要长得多。此外,四种误差的分布也有很大的不同。这些结果可用于估计PCM寿命和测量单个PCM存储芯片的制造质量。
{"title":"Memory module-level testing and error behaviors for phase change memory","authors":"Zhe Zhang, Weijun Xiao, Nohhyun Park, D. Lilja","doi":"10.1109/ICCD.2012.6378664","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378664","url":null,"abstract":"Phase change memory (PCM) is a promising technology to solve energy and performance bottlenecks for memory and storage systems. To help understand the reliability characteristics of PCM devices, we present a simple fault model to categorize four types of PCM errors. Based on our proposed fault model, we conduct extensive experiments on real PCM devices at the memory module level. Numerical results uncover many interesting trends in terms of the lifetime of PCM devices and error behaviors. Specifically, PCM lifetime for the memory chips we tested is greater than 14 million cycles, which is much longer than for flash memory devices. In addition, the distributions for four types of errors are quite different. These results can be used for estimating PCM lifetime and for measuring the fabrication quality of individual PCM memory chips.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131460950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
2012 IEEE 30th International Conference on Computer Design (ICCD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1