首页 > 最新文献

2009 IEEE International Conference on Computer Design最新文献

英文 中文
Online multiple error detection in crossbar nano-architectures 交叉棒纳米结构的在线多重误差检测
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413135
Navid Farazmand, M. Tahoori
Crossbar nano-architectures based on self-assembled nano-structures are promising alternatives for current CMOS technology, which is facing serious challenges for further down-scaling. One of the major challenges in this nanotechnology is elevated failure rate due to atomic device sizes and inherent lack of control in self-assembly fabrication. Therefore, high permanent and transient failure rates lead to multiple faults during lifetime operation of crossbar nano architectures. In this paper, we present a concurrent multiple error detection scheme for multistage crossbar nano-architectures based on dual-rail implementations of logic functions. We prove the detectability of all single faults as well as most classes of multiple faults in this scheme. Based on statistical multiple fault injection, we compare the proposed technique with other online error detection and masking techniques such as Triple Module Redundancy (TMR), duplication, and parity checking, in terms of fault coverage as well as area and delay overhead.
基于自组装纳米结构的交叉棒纳米结构是当前CMOS技术的一个有希望的替代方案,但其进一步缩小规模面临着严峻的挑战。这种纳米技术的主要挑战之一是由于原子器件尺寸和自组装制造中固有的缺乏控制而导致的故障率升高。因此,高的永久故障率和瞬态故障率会导致交叉杆纳米结构在使用寿命期间出现多种故障。在本文中,我们提出了一种基于逻辑功能双轨实现的多级交叉棒纳米结构并发多重错误检测方案。我们证明了该方案对所有的单故障和大多数类型的多故障都是可检测的。基于统计多故障注入,我们比较了所提出的技术与其他在线错误检测和屏蔽技术,如三模冗余(TMR)、复制和奇偶校验,在故障覆盖、面积和延迟开销方面。
{"title":"Online multiple error detection in crossbar nano-architectures","authors":"Navid Farazmand, M. Tahoori","doi":"10.1109/ICCD.2009.5413135","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413135","url":null,"abstract":"Crossbar nano-architectures based on self-assembled nano-structures are promising alternatives for current CMOS technology, which is facing serious challenges for further down-scaling. One of the major challenges in this nanotechnology is elevated failure rate due to atomic device sizes and inherent lack of control in self-assembly fabrication. Therefore, high permanent and transient failure rates lead to multiple faults during lifetime operation of crossbar nano architectures. In this paper, we present a concurrent multiple error detection scheme for multistage crossbar nano-architectures based on dual-rail implementations of logic functions. We prove the detectability of all single faults as well as most classes of multiple faults in this scheme. Based on statistical multiple fault injection, we compare the proposed technique with other online error detection and masking techniques such as Triple Module Redundancy (TMR), duplication, and parity checking, in terms of fault coverage as well as area and delay overhead.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132302170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient architectures for elliptic curve cryptography processors for RFID 射频识别椭圆曲线密码处理器的高效架构
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413128
Lawrence Leinweber, C. Papachristou, F. Wolff
RFID tags will supplant barcodes for product identification in the supply chain. The capability of a tag to be read without a line of sight is its principal benefit, but compromises the privacy of the tag owner. Public key cryptography can restore this privacy. Because of the extreme economic constraints of the application, die area and power consumption for cryptographic functions must be minimized. Elliptic curve processors efficiently provide the cryptographic capability needed for RFID. This paper proposes efficient architectures for elliptic curve processors in GF(2m). One design requires six m-bit registers and six Galois field multiply operations per key bit. The other design requires five m-bit registers and seven Galois field multiply operations per key bit. These processors require a small number of circuit elements and clock cycles while providing protection from simple side-channel attacks. Synthesis results are presented for power, area, and delay in 250, 130 and 90 nm technologies. Compared with prior designs from the literature, the proposed processors require less area and energy. For the B-163 curve, with bit-serial multiplier, the first proposed design synthesized in an IBM low-power 130 nm technology requires an area of 9613 gate equivalents, 163,355 cycles and 4.14 µJ for an elliptic curve point multiplication. The other proposed design requires 8756 gate equivalents, 190,570 cycles and 4.19 µJ.
RFID标签将取代条形码在供应链中进行产品识别。在视线之外读取标签的能力是它的主要优点,但会损害标签所有者的隐私。公钥加密可以恢复这种隐私。由于应用的极端经济限制,必须最小化加密功能的芯片面积和功耗。椭圆曲线处理器有效地提供了RFID所需的加密能力。本文提出了GF(2m)中椭圆曲线处理器的高效架构。一种设计需要6个m位寄存器和每个键位6个伽罗瓦域乘法运算。另一种设计需要5个m位寄存器和每个键位7个伽罗瓦域乘法运算。这些处理器需要少量的电路元件和时钟周期,同时提供对简单侧信道攻击的保护。给出了250nm、130nm和90nm工艺的功耗、面积和延迟的综合结果。与文献中先前的设计相比,所提出的处理器需要更少的面积和能量。对于具有位串行乘法器的B-163曲线,采用IBM低功耗130 nm技术合成的首次提出的设计需要9613栅极当量的面积,163,355个周期和4.14µJ的椭圆曲线点乘法。另一种提出的设计需要8756个栅极等效,190,570个周期和4.19µJ。
{"title":"Efficient architectures for elliptic curve cryptography processors for RFID","authors":"Lawrence Leinweber, C. Papachristou, F. Wolff","doi":"10.1109/ICCD.2009.5413128","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413128","url":null,"abstract":"RFID tags will supplant barcodes for product identification in the supply chain. The capability of a tag to be read without a line of sight is its principal benefit, but compromises the privacy of the tag owner. Public key cryptography can restore this privacy. Because of the extreme economic constraints of the application, die area and power consumption for cryptographic functions must be minimized. Elliptic curve processors efficiently provide the cryptographic capability needed for RFID. This paper proposes efficient architectures for elliptic curve processors in GF(2m). One design requires six m-bit registers and six Galois field multiply operations per key bit. The other design requires five m-bit registers and seven Galois field multiply operations per key bit. These processors require a small number of circuit elements and clock cycles while providing protection from simple side-channel attacks. Synthesis results are presented for power, area, and delay in 250, 130 and 90 nm technologies. Compared with prior designs from the literature, the proposed processors require less area and energy. For the B-163 curve, with bit-serial multiplier, the first proposed design synthesized in an IBM low-power 130 nm technology requires an area of 9613 gate equivalents, 163,355 cycles and 4.14 µJ for an elliptic curve point multiplication. The other proposed design requires 8756 gate equivalents, 190,570 cycles and 4.19 µJ.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131605242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Enabling resonant clock distribution with scaled on-chip magnetic inductors 使共振时钟分布与缩放片上磁电感
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413169
S. Sinha, W. Xu, J. Velamala, T. Dastagir, B. Bakkaloglu, Hongbin Yu, Yu Cao
Resonant clock distribution with distributed LC oscillators is promising to reducing clock power and jitter noise. Yet the difficulty in the integration of on-chip inductors still limits its application in practice. This paper resolves such a key issue with sub-50 µm magnetic inductors, which are fully compatible with the CMOS process. These inductors leverage soft magnetic coils to achieve inductances up to 4nH, Q-factor of 3 at 1 GHz with a device diameter of only 30–50 µm, resulting in area savings of nearly 100X as compared to conventional design. The latency and noise performance of the resonant clock network is demonstrated to be comparable to those using conventional inductors without soft magnetic materials. In addition, inductors with integrated magnetic materials significantly reduce mutual coupling and eddy current loss in the power grid below the clock network. These design advantages enable high density of on-chip distributed oscillators, providing better phase averaging, lower power and superior noise characteristics as compared to traditional buffer-tree based clock network.
采用分布式LC振荡器进行谐振时钟分布,有望降低时钟功耗和抖动噪声。然而,片上电感集成的困难仍然限制了其在实际中的应用。本文用低于50µm的磁电感器解决了这一关键问题,该电感器完全兼容CMOS工艺。这些电感器利用软磁线圈实现高达4nH的电感,在1ghz时q因子为3,器件直径仅为30-50 μ m,与传统设计相比,面积节省近100倍。谐振时钟网络的延迟和噪声性能可与使用传统电感器而不使用软磁材料的电感器相媲美。此外,集成磁性材料的电感器显著降低了时钟网络以下电网的互耦和涡流损耗。与传统的基于缓冲树的时钟网络相比,这些设计优势可以实现高密度的片上分布式振荡器,提供更好的相位平均,更低的功耗和更优越的噪声特性。
{"title":"Enabling resonant clock distribution with scaled on-chip magnetic inductors","authors":"S. Sinha, W. Xu, J. Velamala, T. Dastagir, B. Bakkaloglu, Hongbin Yu, Yu Cao","doi":"10.1109/ICCD.2009.5413169","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413169","url":null,"abstract":"Resonant clock distribution with distributed LC oscillators is promising to reducing clock power and jitter noise. Yet the difficulty in the integration of on-chip inductors still limits its application in practice. This paper resolves such a key issue with sub-50 µm magnetic inductors, which are fully compatible with the CMOS process. These inductors leverage soft magnetic coils to achieve inductances up to 4nH, Q-factor of 3 at 1 GHz with a device diameter of only 30–50 µm, resulting in area savings of nearly 100X as compared to conventional design. The latency and noise performance of the resonant clock network is demonstrated to be comparable to those using conventional inductors without soft magnetic materials. In addition, inductors with integrated magnetic materials significantly reduce mutual coupling and eddy current loss in the power grid below the clock network. These design advantages enable high density of on-chip distributed oscillators, providing better phase averaging, lower power and superior noise characteristics as compared to traditional buffer-tree based clock network.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128746203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Design and test strategies for microarchitectural post-fabrication tuning 微架构后期调优的设计和测试策略
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413170
Xiaoyao Liang, Benjamin C. Lee, Gu-Yeon Wei, D. Brooks
Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.
工艺变化是技术持续扩展的主要障碍。系统变化和随机变化都会影响晶片的临界延迟,造成较宽的频率和功率分布。调整技术适应微架构,以减轻在制造后测试时间变化的影响。本文提出了一种考虑测试成本的新型制造后测试框架。该框架使用片上金丝雀电路捕获系统变化,同时使用统计分析来估计随机变化。我们推导回归模型来预测芯片性能和功耗。这些技术包括一个集成的框架,该框架确定了每个芯片最节能的制造后调谐配置。
{"title":"Design and test strategies for microarchitectural post-fabrication tuning","authors":"Xiaoyao Liang, Benjamin C. Lee, Gu-Yeon Wei, D. Brooks","doi":"10.1109/ICCD.2009.5413170","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413170","url":null,"abstract":"Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116989107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Empirical performance models for 3T1D memories 3T1D记忆体的经验性能模型
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413124
Kristen Lovin, Benjamin C. Lee, Xiaoyao Liang, D. Brooks, Gu-Yeon Wei
Process variation poses a threat to the performance and reliability of the 6T SRAM cell. Research has turned to new memory cell designs, such as the 3T1D DRAM cell, as potential replacement designs. If designers are to consider 3T1D memory architectures, performance models are needed to better understand memory cell behavior. We propose a decoupled approach for collecting Monte Carlo HSPICE data, reducing simulation times by simulating memory array components separately based on their contribution to the worst-case critical path. We use this Monte Carlo data to train regression models, which accurately predict retention and access times of a 3T1D memory array with a median error of 7.39%.
工艺变化对6T SRAM单元的性能和可靠性构成威胁。研究转向了新的存储单元设计,如3T1D DRAM单元,作为潜在的替代设计。如果设计人员要考虑3T1D内存架构,则需要性能模型来更好地理解内存单元的行为。我们提出了一种解耦的方法来收集蒙特卡罗HSPICE数据,通过根据内存阵列组件对最坏情况关键路径的贡献分别模拟内存阵列组件来减少模拟时间。我们使用蒙特卡罗数据训练回归模型,准确预测3T1D存储器阵列的保留和访问时间,中位数误差为7.39%。
{"title":"Empirical performance models for 3T1D memories","authors":"Kristen Lovin, Benjamin C. Lee, Xiaoyao Liang, D. Brooks, Gu-Yeon Wei","doi":"10.1109/ICCD.2009.5413124","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413124","url":null,"abstract":"Process variation poses a threat to the performance and reliability of the 6T SRAM cell. Research has turned to new memory cell designs, such as the 3T1D DRAM cell, as potential replacement designs. If designers are to consider 3T1D memory architectures, performance models are needed to better understand memory cell behavior. We propose a decoupled approach for collecting Monte Carlo HSPICE data, reducing simulation times by simulating memory array components separately based on their contribution to the worst-case critical path. We use this Monte Carlo data to train regression models, which accurately predict retention and access times of a 3T1D memory array with a median error of 7.39%.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"38 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114112009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Deterministic clock gating to eliminate wasteful activity due to wrong-path instructions in out-of-order superscalar processors 确定性时钟门控,以消除无序超标量处理器中由于错误路径指令而造成的浪费活动
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413158
Nasir Mohyuddin, Kimish Patel, Massoud Pedram
In this paper we present deterministic clock gating schemes for various micro architectural blocks of a modern out-of-order superscalar processor. We propose to make use of 1) idle stages of the pipelined function units (FUs) and 2) wrong-path instruction execution during branch mis-prediction, in order to clock gate various stages of FUs. The baseline Pipelined Functional unit Clock Gating (PFCG), presented for evaluation purpose only, disables the clock on idle stages and thus results in 13.93% chip-wide energy saving. Wrong-path instruction Clock Gating (WPCG) detects wrong-path instructions in the event of branch mis-prediction and prevents them from being issued to the FUs, and subsequently, disables the clock of these FUs along with reducing the stress on register file and cache. Simulations demonstrate that more than 92% of all wrong-path instructions can be detected and stopped from being executed. The WPCG architecture results in 16.26% chip-wide energy savings which is 2.33% more than that of the baseline PFCG scheme.
本文提出了一种现代无序超标量处理器的各种微结构块的确定性时钟门控方案。我们建议利用1)流水线功能单元(FUs)的空闲阶段和2)在分支错误预测期间的错误路径指令执行,以便对FUs的各个阶段进行时钟门。基线流水线功能单元时钟门控(PFCG),仅用于评估目的,在空闲阶段禁用时钟,从而在整个芯片范围内节省13.93%的能源。错误路径指令时钟门控(WPCG)在分支错误预测的情况下检测错误路径指令,并阻止它们被发布到FUs,随后禁用这些FUs的时钟,同时减少对寄存器文件和缓存的压力。仿真表明,92%以上的错误路径指令可以被检测到并阻止执行。WPCG架构在全芯片范围内节能16.26%,比基线PFCG方案节能2.33%。
{"title":"Deterministic clock gating to eliminate wasteful activity due to wrong-path instructions in out-of-order superscalar processors","authors":"Nasir Mohyuddin, Kimish Patel, Massoud Pedram","doi":"10.1109/ICCD.2009.5413158","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413158","url":null,"abstract":"In this paper we present deterministic clock gating schemes for various micro architectural blocks of a modern out-of-order superscalar processor. We propose to make use of 1) idle stages of the pipelined function units (FUs) and 2) wrong-path instruction execution during branch mis-prediction, in order to clock gate various stages of FUs. The baseline Pipelined Functional unit Clock Gating (PFCG), presented for evaluation purpose only, disables the clock on idle stages and thus results in 13.93% chip-wide energy saving. Wrong-path instruction Clock Gating (WPCG) detects wrong-path instructions in the event of branch mis-prediction and prevents them from being issued to the FUs, and subsequently, disables the clock of these FUs along with reducing the stress on register file and cache. Simulations demonstrate that more than 92% of all wrong-path instructions can be detected and stopped from being executed. The WPCG architecture results in 16.26% chip-wide energy savings which is 2.33% more than that of the baseline PFCG scheme.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114765393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Algorithmic approach to designing an easy-to-program system: Can it lead to a HW-enhanced programmer's workflow add-on? 算法方法设计一个易于编程的系统:它能导致hw增强程序员的工作流程附加组件吗?
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413174
U. Vishkin
Our earlier parallel algorithmics work on the parallel random-access-machine/model (PRAM) computation model led us to a PRAM-On-Chip vision: a comprehensive many-core system that can look to the programmer like the abstract PRAM model. We introduced the eXplicit MultiThreaded (XMT) design and prototyped it in hardware and software. XMT comprises a programmer's workflow that advances from work-depth, a standard PRAM theory abstraction, to an XMT program, and, if desired, to its performance tuning. XMT provides strong performance for programs developed this way due to its hardware support of very fine-grained threads and the overhead of handling them. XMT has also shown unique promise when it comes to ease-of-programming, the biggest problem that has limited the impact of all parallel systems to date. For example, teachability of XMT programming has been demonstrated at various levels from rising 6th graders to graduate students, and students in a freshman class were able to program 3 parallel sorting algorithms. The main purpose of the current paper is to stimulate discussion on the following somewhat open-ended question. Now that we made significant progress on a system devoted to supporting PRAM-like programming, is it possible to incorporate our hardware support as an add-on into other current and future many-core systems? The paper considers a concrete proposal for doing that: recasting our work as a hardware-enhanced programmer's workflow “module” that can then be essentially imported into the other systems.
我们早期在并行随机存取机器/模型(PRAM)计算模型上的并行算法工作使我们实现了PRAM- on - chip的愿景:一个全面的多核系统,可以像抽象的PRAM模型一样被程序员看到。介绍了显式多线程(eXplicit multithread, XMT)的设计,并从硬件和软件两个方面对其进行了原型化。XMT包含程序员的工作流,从工作深度(标准的PRAM理论抽象)到XMT程序,如果需要的话,再到它的性能调优。XMT为以这种方式开发的程序提供了强大的性能,因为它的硬件支持非常细粒度的线程和处理它们的开销。在易于编程方面,XMT也显示出了独特的前景,这是迄今为止限制所有并行系统影响的最大问题。例如,XMT编程的可教性已经在从六年级学生到研究生的各个层次上得到了证明,大一的学生能够编写3种并行排序算法。本文的主要目的是激发对下列开放式问题的讨论。既然我们在致力于支持类ram编程的系统上取得了重大进展,那么是否有可能将我们的硬件支持作为附加组件集成到其他当前和未来的多核系统中呢?本文考虑了这样做的一个具体建议:将我们的工作重新定义为一个硬件增强的程序员工作流“模块”,然后可以基本上导入到其他系统中。
{"title":"Algorithmic approach to designing an easy-to-program system: Can it lead to a HW-enhanced programmer's workflow add-on?","authors":"U. Vishkin","doi":"10.1109/ICCD.2009.5413174","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413174","url":null,"abstract":"Our earlier parallel algorithmics work on the parallel random-access-machine/model (PRAM) computation model led us to a PRAM-On-Chip vision: a comprehensive many-core system that can look to the programmer like the abstract PRAM model. We introduced the eXplicit MultiThreaded (XMT) design and prototyped it in hardware and software. XMT comprises a programmer's workflow that advances from work-depth, a standard PRAM theory abstraction, to an XMT program, and, if desired, to its performance tuning. XMT provides strong performance for programs developed this way due to its hardware support of very fine-grained threads and the overhead of handling them. XMT has also shown unique promise when it comes to ease-of-programming, the biggest problem that has limited the impact of all parallel systems to date. For example, teachability of XMT programming has been demonstrated at various levels from rising 6th graders to graduate students, and students in a freshman class were able to program 3 parallel sorting algorithms. The main purpose of the current paper is to stimulate discussion on the following somewhat open-ended question. Now that we made significant progress on a system devoted to supporting PRAM-like programming, is it possible to incorporate our hardware support as an add-on into other current and future many-core systems? The paper considers a concrete proposal for doing that: recasting our work as a hardware-enhanced programmer's workflow “module” that can then be essentially imported into the other systems.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123775383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Compiler-directed leakage reduction in embedded microprocessors 嵌入式微处理器中编译器导向的泄漏减少
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413178
Soumyaroop Roy, N. Ranganathan, S. Katkoori
Compiler-directed power gating is an approach in which sleep instructions are inserted appropriately at compile time into the application code to selectively deactivate the functional units in microprocessors during their idle periods to reduce power dissipation due to leakage. Although the effect of code transformations on dynamic and system power has been investigated and reported in the literature, such a study is lacking in the context of power gating. In this paper, we investigate and report how the leakage savings in both integer and floating point units can be improved using machine-dependent and independent optimizations in a compiler-directed power gating framework. In our study, it is ensured that power gating is applied only when the leakage savings are considerably more than the various overheads incurred in its implementation. The target embedded processor is modeled on the ARMv4 architecture, which is modified to support the power gating of its arithmetic functional units. For experimentation, GCC is used as the compiler infrastructure and Simplescalar-ARM is used as the detailed architectural simulator for reporting power and performance metrics for embedded applications belonging to the MiBench and MediaBench benchmark suites. Experimental results suggest that the additional savings in leakage energy due to one or more of the optimizations may vary largely depending on the benchmark. Moreover, the overhead of sleep instructions can be reduced by up to 50 times by performing procedure inlining.
编译器导向的功率门控是一种在编译时适当地将睡眠指令插入到应用程序代码中的方法,以便在微处理器的空闲期间选择性地停用功能单元,以减少由于泄漏引起的功耗。虽然文献中已经研究和报道了码变换对动态和系统功率的影响,但在功率门控的背景下缺乏这样的研究。在本文中,我们研究并报告了如何在编译器导向的功率门控框架中使用与机器相关和独立的优化来改进整数和浮点单元的泄漏节省。在我们的研究中,只有当泄漏节省大大超过其实施过程中产生的各种开销时,才能确保应用功率门控。目标嵌入式处理器在ARMv4架构上建模,并对其进行了修改以支持其算术功能单元的功率门控。为了进行实验,GCC被用作编译器基础设施,Simplescalar-ARM被用作详细的体系结构模拟器,用于报告属于MiBench和mediabbench基准套件的嵌入式应用程序的功率和性能指标。实验结果表明,由于一个或多个优化,泄漏能量的额外节省可能在很大程度上取决于基准。此外,通过执行过程内联,睡眠指令的开销可以减少多达50倍。
{"title":"Compiler-directed leakage reduction in embedded microprocessors","authors":"Soumyaroop Roy, N. Ranganathan, S. Katkoori","doi":"10.1109/ICCD.2009.5413178","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413178","url":null,"abstract":"Compiler-directed power gating is an approach in which sleep instructions are inserted appropriately at compile time into the application code to selectively deactivate the functional units in microprocessors during their idle periods to reduce power dissipation due to leakage. Although the effect of code transformations on dynamic and system power has been investigated and reported in the literature, such a study is lacking in the context of power gating. In this paper, we investigate and report how the leakage savings in both integer and floating point units can be improved using machine-dependent and independent optimizations in a compiler-directed power gating framework. In our study, it is ensured that power gating is applied only when the leakage savings are considerably more than the various overheads incurred in its implementation. The target embedded processor is modeled on the ARMv4 architecture, which is modified to support the power gating of its arithmetic functional units. For experimentation, GCC is used as the compiler infrastructure and Simplescalar-ARM is used as the detailed architectural simulator for reporting power and performance metrics for embedded applications belonging to the MiBench and MediaBench benchmark suites. Experimental results suggest that the additional savings in leakage energy due to one or more of the optimizations may vary largely depending on the benchmark. Moreover, the overhead of sleep instructions can be reduced by up to 50 times by performing procedure inlining.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126447559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Real-time, unobtrusive, and efficient program execution tracing with stream caches and last stream predictors 使用流缓存和最后流预测器进行实时、不显眼和高效的程序执行跟踪
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413159
Vladimir Uzelac, A. Milenković, M. Milenkovic, Martin Burtscher
This paper introduces a new hardware mechanism for capturing and compressing program execution traces unobtrusively in real-time. The proposed mechanism is based on two structures called stream cache and last stream predictor. We explore the effectiveness of a trace module based on these structures and analyze the design space. We show that our trace module, with less than 600 bytes of state, achieves a trace-port bandwidth of 0.15 bits/instruction/processor, which is over six times better than state-of-the-art commercial designs.
本文介绍了一种新的硬件机制,用于实时捕获和压缩程序执行轨迹。所提出的机制是基于两种结构:流缓存和最后流预测器。我们探讨了基于这些结构的跟踪模块的有效性,并分析了设计空间。我们展示了我们的跟踪模块,状态小于600字节,实现了0.15比特/指令/处理器的跟踪端口带宽,比最先进的商业设计好6倍以上。
{"title":"Real-time, unobtrusive, and efficient program execution tracing with stream caches and last stream predictors","authors":"Vladimir Uzelac, A. Milenković, M. Milenkovic, Martin Burtscher","doi":"10.1109/ICCD.2009.5413159","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413159","url":null,"abstract":"This paper introduces a new hardware mechanism for capturing and compressing program execution traces unobtrusively in real-time. The proposed mechanism is based on two structures called stream cache and last stream predictor. We explore the effectiveness of a trace module based on these structures and analyze the design space. We show that our trace module, with less than 600 bytes of state, achieves a trace-port bandwidth of 0.15 bits/instruction/processor, which is over six times better than state-of-the-art commercial designs.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129717122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A flexible communication scheme for rationally-related clock frequencies 合理相关时钟频率的灵活通信方案
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413166
Jean-Michel Chabloz, A. Hemani
As a replacement for the fast-fading Globally-Synchronous model, we have defined a flexible design style for SoCs, called GRLS, for Globally-Ratiochronous, Locally-Synchronous, which does not rely on global synchronization and is based on using rationally-related clock frequencies derived from the same source. In this paper, using the special periodical properties of rationally-related systems, we build a latency-insensitive, maximal-throughput, low-overhead communication method, based on the idea of using both clock edges to sample data at the Receiver. The validity of the method and its resistance to non-idealities such as jitter, misalignments and clock drifts are formally proven while experimental results including overhead are presented for 90 nm technology. Despite allowing much greater flexibility, the overhead of our method is comparable to that of state-of-the-art mesochronous communication techniques. We also show performances, complexity and overhead improvements over all other approaches that have so far been proposed for rationally-related clock frequencies.
作为快速衰落的global - synchronous模型的替代品,我们为soc定义了一种灵活的设计风格,称为GRLS,用于global - ratiochronous, local - synchronous,它不依赖于全局同步,而是基于使用来自同一源的合理相关时钟频率。本文利用理性相关系统的特殊周期特性,基于在接收端使用两个时钟边采样数据的思想,构建了一种延迟不敏感、最大吞吐量、低开销的通信方法。本文正式证明了该方法的有效性及其对抖动、失调和时钟漂移等非理想情况的抵抗能力,并给出了包括开销在内的90 nm技术的实验结果。尽管允许更大的灵活性,但我们的方法的开销与最先进的中同步通信技术相当。我们还展示了迄今为止针对合理相关时钟频率提出的所有其他方法在性能、复杂性和开销方面的改进。
{"title":"A flexible communication scheme for rationally-related clock frequencies","authors":"Jean-Michel Chabloz, A. Hemani","doi":"10.1109/ICCD.2009.5413166","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413166","url":null,"abstract":"As a replacement for the fast-fading Globally-Synchronous model, we have defined a flexible design style for SoCs, called GRLS, for Globally-Ratiochronous, Locally-Synchronous, which does not rely on global synchronization and is based on using rationally-related clock frequencies derived from the same source. In this paper, using the special periodical properties of rationally-related systems, we build a latency-insensitive, maximal-throughput, low-overhead communication method, based on the idea of using both clock edges to sample data at the Receiver. The validity of the method and its resistance to non-idealities such as jitter, misalignments and clock drifts are formally proven while experimental results including overhead are presented for 90 nm technology. Despite allowing much greater flexibility, the overhead of our method is comparable to that of state-of-the-art mesochronous communication techniques. We also show performances, complexity and overhead improvements over all other approaches that have so far been proposed for rationally-related clock frequencies.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133359238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2009 IEEE International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1