Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413135
Navid Farazmand, M. Tahoori
Crossbar nano-architectures based on self-assembled nano-structures are promising alternatives for current CMOS technology, which is facing serious challenges for further down-scaling. One of the major challenges in this nanotechnology is elevated failure rate due to atomic device sizes and inherent lack of control in self-assembly fabrication. Therefore, high permanent and transient failure rates lead to multiple faults during lifetime operation of crossbar nano architectures. In this paper, we present a concurrent multiple error detection scheme for multistage crossbar nano-architectures based on dual-rail implementations of logic functions. We prove the detectability of all single faults as well as most classes of multiple faults in this scheme. Based on statistical multiple fault injection, we compare the proposed technique with other online error detection and masking techniques such as Triple Module Redundancy (TMR), duplication, and parity checking, in terms of fault coverage as well as area and delay overhead.
{"title":"Online multiple error detection in crossbar nano-architectures","authors":"Navid Farazmand, M. Tahoori","doi":"10.1109/ICCD.2009.5413135","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413135","url":null,"abstract":"Crossbar nano-architectures based on self-assembled nano-structures are promising alternatives for current CMOS technology, which is facing serious challenges for further down-scaling. One of the major challenges in this nanotechnology is elevated failure rate due to atomic device sizes and inherent lack of control in self-assembly fabrication. Therefore, high permanent and transient failure rates lead to multiple faults during lifetime operation of crossbar nano architectures. In this paper, we present a concurrent multiple error detection scheme for multistage crossbar nano-architectures based on dual-rail implementations of logic functions. We prove the detectability of all single faults as well as most classes of multiple faults in this scheme. Based on statistical multiple fault injection, we compare the proposed technique with other online error detection and masking techniques such as Triple Module Redundancy (TMR), duplication, and parity checking, in terms of fault coverage as well as area and delay overhead.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132302170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413128
Lawrence Leinweber, C. Papachristou, F. Wolff
RFID tags will supplant barcodes for product identification in the supply chain. The capability of a tag to be read without a line of sight is its principal benefit, but compromises the privacy of the tag owner. Public key cryptography can restore this privacy. Because of the extreme economic constraints of the application, die area and power consumption for cryptographic functions must be minimized. Elliptic curve processors efficiently provide the cryptographic capability needed for RFID. This paper proposes efficient architectures for elliptic curve processors in GF(2m). One design requires six m-bit registers and six Galois field multiply operations per key bit. The other design requires five m-bit registers and seven Galois field multiply operations per key bit. These processors require a small number of circuit elements and clock cycles while providing protection from simple side-channel attacks. Synthesis results are presented for power, area, and delay in 250, 130 and 90 nm technologies. Compared with prior designs from the literature, the proposed processors require less area and energy. For the B-163 curve, with bit-serial multiplier, the first proposed design synthesized in an IBM low-power 130 nm technology requires an area of 9613 gate equivalents, 163,355 cycles and 4.14 µJ for an elliptic curve point multiplication. The other proposed design requires 8756 gate equivalents, 190,570 cycles and 4.19 µJ.
{"title":"Efficient architectures for elliptic curve cryptography processors for RFID","authors":"Lawrence Leinweber, C. Papachristou, F. Wolff","doi":"10.1109/ICCD.2009.5413128","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413128","url":null,"abstract":"RFID tags will supplant barcodes for product identification in the supply chain. The capability of a tag to be read without a line of sight is its principal benefit, but compromises the privacy of the tag owner. Public key cryptography can restore this privacy. Because of the extreme economic constraints of the application, die area and power consumption for cryptographic functions must be minimized. Elliptic curve processors efficiently provide the cryptographic capability needed for RFID. This paper proposes efficient architectures for elliptic curve processors in GF(2m). One design requires six m-bit registers and six Galois field multiply operations per key bit. The other design requires five m-bit registers and seven Galois field multiply operations per key bit. These processors require a small number of circuit elements and clock cycles while providing protection from simple side-channel attacks. Synthesis results are presented for power, area, and delay in 250, 130 and 90 nm technologies. Compared with prior designs from the literature, the proposed processors require less area and energy. For the B-163 curve, with bit-serial multiplier, the first proposed design synthesized in an IBM low-power 130 nm technology requires an area of 9613 gate equivalents, 163,355 cycles and 4.14 µJ for an elliptic curve point multiplication. The other proposed design requires 8756 gate equivalents, 190,570 cycles and 4.19 µJ.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131605242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413169
S. Sinha, W. Xu, J. Velamala, T. Dastagir, B. Bakkaloglu, Hongbin Yu, Yu Cao
Resonant clock distribution with distributed LC oscillators is promising to reducing clock power and jitter noise. Yet the difficulty in the integration of on-chip inductors still limits its application in practice. This paper resolves such a key issue with sub-50 µm magnetic inductors, which are fully compatible with the CMOS process. These inductors leverage soft magnetic coils to achieve inductances up to 4nH, Q-factor of 3 at 1 GHz with a device diameter of only 30–50 µm, resulting in area savings of nearly 100X as compared to conventional design. The latency and noise performance of the resonant clock network is demonstrated to be comparable to those using conventional inductors without soft magnetic materials. In addition, inductors with integrated magnetic materials significantly reduce mutual coupling and eddy current loss in the power grid below the clock network. These design advantages enable high density of on-chip distributed oscillators, providing better phase averaging, lower power and superior noise characteristics as compared to traditional buffer-tree based clock network.
{"title":"Enabling resonant clock distribution with scaled on-chip magnetic inductors","authors":"S. Sinha, W. Xu, J. Velamala, T. Dastagir, B. Bakkaloglu, Hongbin Yu, Yu Cao","doi":"10.1109/ICCD.2009.5413169","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413169","url":null,"abstract":"Resonant clock distribution with distributed LC oscillators is promising to reducing clock power and jitter noise. Yet the difficulty in the integration of on-chip inductors still limits its application in practice. This paper resolves such a key issue with sub-50 µm magnetic inductors, which are fully compatible with the CMOS process. These inductors leverage soft magnetic coils to achieve inductances up to 4nH, Q-factor of 3 at 1 GHz with a device diameter of only 30–50 µm, resulting in area savings of nearly 100X as compared to conventional design. The latency and noise performance of the resonant clock network is demonstrated to be comparable to those using conventional inductors without soft magnetic materials. In addition, inductors with integrated magnetic materials significantly reduce mutual coupling and eddy current loss in the power grid below the clock network. These design advantages enable high density of on-chip distributed oscillators, providing better phase averaging, lower power and superior noise characteristics as compared to traditional buffer-tree based clock network.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128746203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413170
Xiaoyao Liang, Benjamin C. Lee, Gu-Yeon Wei, D. Brooks
Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.
{"title":"Design and test strategies for microarchitectural post-fabrication tuning","authors":"Xiaoyao Liang, Benjamin C. Lee, Gu-Yeon Wei, D. Brooks","doi":"10.1109/ICCD.2009.5413170","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413170","url":null,"abstract":"Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116989107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413124
Kristen Lovin, Benjamin C. Lee, Xiaoyao Liang, D. Brooks, Gu-Yeon Wei
Process variation poses a threat to the performance and reliability of the 6T SRAM cell. Research has turned to new memory cell designs, such as the 3T1D DRAM cell, as potential replacement designs. If designers are to consider 3T1D memory architectures, performance models are needed to better understand memory cell behavior. We propose a decoupled approach for collecting Monte Carlo HSPICE data, reducing simulation times by simulating memory array components separately based on their contribution to the worst-case critical path. We use this Monte Carlo data to train regression models, which accurately predict retention and access times of a 3T1D memory array with a median error of 7.39%.
{"title":"Empirical performance models for 3T1D memories","authors":"Kristen Lovin, Benjamin C. Lee, Xiaoyao Liang, D. Brooks, Gu-Yeon Wei","doi":"10.1109/ICCD.2009.5413124","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413124","url":null,"abstract":"Process variation poses a threat to the performance and reliability of the 6T SRAM cell. Research has turned to new memory cell designs, such as the 3T1D DRAM cell, as potential replacement designs. If designers are to consider 3T1D memory architectures, performance models are needed to better understand memory cell behavior. We propose a decoupled approach for collecting Monte Carlo HSPICE data, reducing simulation times by simulating memory array components separately based on their contribution to the worst-case critical path. We use this Monte Carlo data to train regression models, which accurately predict retention and access times of a 3T1D memory array with a median error of 7.39%.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"38 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114112009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413158
Nasir Mohyuddin, Kimish Patel, Massoud Pedram
In this paper we present deterministic clock gating schemes for various micro architectural blocks of a modern out-of-order superscalar processor. We propose to make use of 1) idle stages of the pipelined function units (FUs) and 2) wrong-path instruction execution during branch mis-prediction, in order to clock gate various stages of FUs. The baseline Pipelined Functional unit Clock Gating (PFCG), presented for evaluation purpose only, disables the clock on idle stages and thus results in 13.93% chip-wide energy saving. Wrong-path instruction Clock Gating (WPCG) detects wrong-path instructions in the event of branch mis-prediction and prevents them from being issued to the FUs, and subsequently, disables the clock of these FUs along with reducing the stress on register file and cache. Simulations demonstrate that more than 92% of all wrong-path instructions can be detected and stopped from being executed. The WPCG architecture results in 16.26% chip-wide energy savings which is 2.33% more than that of the baseline PFCG scheme.
{"title":"Deterministic clock gating to eliminate wasteful activity due to wrong-path instructions in out-of-order superscalar processors","authors":"Nasir Mohyuddin, Kimish Patel, Massoud Pedram","doi":"10.1109/ICCD.2009.5413158","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413158","url":null,"abstract":"In this paper we present deterministic clock gating schemes for various micro architectural blocks of a modern out-of-order superscalar processor. We propose to make use of 1) idle stages of the pipelined function units (FUs) and 2) wrong-path instruction execution during branch mis-prediction, in order to clock gate various stages of FUs. The baseline Pipelined Functional unit Clock Gating (PFCG), presented for evaluation purpose only, disables the clock on idle stages and thus results in 13.93% chip-wide energy saving. Wrong-path instruction Clock Gating (WPCG) detects wrong-path instructions in the event of branch mis-prediction and prevents them from being issued to the FUs, and subsequently, disables the clock of these FUs along with reducing the stress on register file and cache. Simulations demonstrate that more than 92% of all wrong-path instructions can be detected and stopped from being executed. The WPCG architecture results in 16.26% chip-wide energy savings which is 2.33% more than that of the baseline PFCG scheme.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114765393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413174
U. Vishkin
Our earlier parallel algorithmics work on the parallel random-access-machine/model (PRAM) computation model led us to a PRAM-On-Chip vision: a comprehensive many-core system that can look to the programmer like the abstract PRAM model. We introduced the eXplicit MultiThreaded (XMT) design and prototyped it in hardware and software. XMT comprises a programmer's workflow that advances from work-depth, a standard PRAM theory abstraction, to an XMT program, and, if desired, to its performance tuning. XMT provides strong performance for programs developed this way due to its hardware support of very fine-grained threads and the overhead of handling them. XMT has also shown unique promise when it comes to ease-of-programming, the biggest problem that has limited the impact of all parallel systems to date. For example, teachability of XMT programming has been demonstrated at various levels from rising 6th graders to graduate students, and students in a freshman class were able to program 3 parallel sorting algorithms. The main purpose of the current paper is to stimulate discussion on the following somewhat open-ended question. Now that we made significant progress on a system devoted to supporting PRAM-like programming, is it possible to incorporate our hardware support as an add-on into other current and future many-core systems? The paper considers a concrete proposal for doing that: recasting our work as a hardware-enhanced programmer's workflow “module” that can then be essentially imported into the other systems.
我们早期在并行随机存取机器/模型(PRAM)计算模型上的并行算法工作使我们实现了PRAM- on - chip的愿景:一个全面的多核系统,可以像抽象的PRAM模型一样被程序员看到。介绍了显式多线程(eXplicit multithread, XMT)的设计,并从硬件和软件两个方面对其进行了原型化。XMT包含程序员的工作流,从工作深度(标准的PRAM理论抽象)到XMT程序,如果需要的话,再到它的性能调优。XMT为以这种方式开发的程序提供了强大的性能,因为它的硬件支持非常细粒度的线程和处理它们的开销。在易于编程方面,XMT也显示出了独特的前景,这是迄今为止限制所有并行系统影响的最大问题。例如,XMT编程的可教性已经在从六年级学生到研究生的各个层次上得到了证明,大一的学生能够编写3种并行排序算法。本文的主要目的是激发对下列开放式问题的讨论。既然我们在致力于支持类ram编程的系统上取得了重大进展,那么是否有可能将我们的硬件支持作为附加组件集成到其他当前和未来的多核系统中呢?本文考虑了这样做的一个具体建议:将我们的工作重新定义为一个硬件增强的程序员工作流“模块”,然后可以基本上导入到其他系统中。
{"title":"Algorithmic approach to designing an easy-to-program system: Can it lead to a HW-enhanced programmer's workflow add-on?","authors":"U. Vishkin","doi":"10.1109/ICCD.2009.5413174","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413174","url":null,"abstract":"Our earlier parallel algorithmics work on the parallel random-access-machine/model (PRAM) computation model led us to a PRAM-On-Chip vision: a comprehensive many-core system that can look to the programmer like the abstract PRAM model. We introduced the eXplicit MultiThreaded (XMT) design and prototyped it in hardware and software. XMT comprises a programmer's workflow that advances from work-depth, a standard PRAM theory abstraction, to an XMT program, and, if desired, to its performance tuning. XMT provides strong performance for programs developed this way due to its hardware support of very fine-grained threads and the overhead of handling them. XMT has also shown unique promise when it comes to ease-of-programming, the biggest problem that has limited the impact of all parallel systems to date. For example, teachability of XMT programming has been demonstrated at various levels from rising 6th graders to graduate students, and students in a freshman class were able to program 3 parallel sorting algorithms. The main purpose of the current paper is to stimulate discussion on the following somewhat open-ended question. Now that we made significant progress on a system devoted to supporting PRAM-like programming, is it possible to incorporate our hardware support as an add-on into other current and future many-core systems? The paper considers a concrete proposal for doing that: recasting our work as a hardware-enhanced programmer's workflow “module” that can then be essentially imported into the other systems.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123775383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413178
Soumyaroop Roy, N. Ranganathan, S. Katkoori
Compiler-directed power gating is an approach in which sleep instructions are inserted appropriately at compile time into the application code to selectively deactivate the functional units in microprocessors during their idle periods to reduce power dissipation due to leakage. Although the effect of code transformations on dynamic and system power has been investigated and reported in the literature, such a study is lacking in the context of power gating. In this paper, we investigate and report how the leakage savings in both integer and floating point units can be improved using machine-dependent and independent optimizations in a compiler-directed power gating framework. In our study, it is ensured that power gating is applied only when the leakage savings are considerably more than the various overheads incurred in its implementation. The target embedded processor is modeled on the ARMv4 architecture, which is modified to support the power gating of its arithmetic functional units. For experimentation, GCC is used as the compiler infrastructure and Simplescalar-ARM is used as the detailed architectural simulator for reporting power and performance metrics for embedded applications belonging to the MiBench and MediaBench benchmark suites. Experimental results suggest that the additional savings in leakage energy due to one or more of the optimizations may vary largely depending on the benchmark. Moreover, the overhead of sleep instructions can be reduced by up to 50 times by performing procedure inlining.
{"title":"Compiler-directed leakage reduction in embedded microprocessors","authors":"Soumyaroop Roy, N. Ranganathan, S. Katkoori","doi":"10.1109/ICCD.2009.5413178","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413178","url":null,"abstract":"Compiler-directed power gating is an approach in which sleep instructions are inserted appropriately at compile time into the application code to selectively deactivate the functional units in microprocessors during their idle periods to reduce power dissipation due to leakage. Although the effect of code transformations on dynamic and system power has been investigated and reported in the literature, such a study is lacking in the context of power gating. In this paper, we investigate and report how the leakage savings in both integer and floating point units can be improved using machine-dependent and independent optimizations in a compiler-directed power gating framework. In our study, it is ensured that power gating is applied only when the leakage savings are considerably more than the various overheads incurred in its implementation. The target embedded processor is modeled on the ARMv4 architecture, which is modified to support the power gating of its arithmetic functional units. For experimentation, GCC is used as the compiler infrastructure and Simplescalar-ARM is used as the detailed architectural simulator for reporting power and performance metrics for embedded applications belonging to the MiBench and MediaBench benchmark suites. Experimental results suggest that the additional savings in leakage energy due to one or more of the optimizations may vary largely depending on the benchmark. Moreover, the overhead of sleep instructions can be reduced by up to 50 times by performing procedure inlining.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126447559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413159
Vladimir Uzelac, A. Milenković, M. Milenkovic, Martin Burtscher
This paper introduces a new hardware mechanism for capturing and compressing program execution traces unobtrusively in real-time. The proposed mechanism is based on two structures called stream cache and last stream predictor. We explore the effectiveness of a trace module based on these structures and analyze the design space. We show that our trace module, with less than 600 bytes of state, achieves a trace-port bandwidth of 0.15 bits/instruction/processor, which is over six times better than state-of-the-art commercial designs.
{"title":"Real-time, unobtrusive, and efficient program execution tracing with stream caches and last stream predictors","authors":"Vladimir Uzelac, A. Milenković, M. Milenkovic, Martin Burtscher","doi":"10.1109/ICCD.2009.5413159","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413159","url":null,"abstract":"This paper introduces a new hardware mechanism for capturing and compressing program execution traces unobtrusively in real-time. The proposed mechanism is based on two structures called stream cache and last stream predictor. We explore the effectiveness of a trace module based on these structures and analyze the design space. We show that our trace module, with less than 600 bytes of state, achieves a trace-port bandwidth of 0.15 bits/instruction/processor, which is over six times better than state-of-the-art commercial designs.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129717122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413166
Jean-Michel Chabloz, A. Hemani
As a replacement for the fast-fading Globally-Synchronous model, we have defined a flexible design style for SoCs, called GRLS, for Globally-Ratiochronous, Locally-Synchronous, which does not rely on global synchronization and is based on using rationally-related clock frequencies derived from the same source. In this paper, using the special periodical properties of rationally-related systems, we build a latency-insensitive, maximal-throughput, low-overhead communication method, based on the idea of using both clock edges to sample data at the Receiver. The validity of the method and its resistance to non-idealities such as jitter, misalignments and clock drifts are formally proven while experimental results including overhead are presented for 90 nm technology. Despite allowing much greater flexibility, the overhead of our method is comparable to that of state-of-the-art mesochronous communication techniques. We also show performances, complexity and overhead improvements over all other approaches that have so far been proposed for rationally-related clock frequencies.
作为快速衰落的global - synchronous模型的替代品,我们为soc定义了一种灵活的设计风格,称为GRLS,用于global - ratiochronous, local - synchronous,它不依赖于全局同步,而是基于使用来自同一源的合理相关时钟频率。本文利用理性相关系统的特殊周期特性,基于在接收端使用两个时钟边采样数据的思想,构建了一种延迟不敏感、最大吞吐量、低开销的通信方法。本文正式证明了该方法的有效性及其对抖动、失调和时钟漂移等非理想情况的抵抗能力,并给出了包括开销在内的90 nm技术的实验结果。尽管允许更大的灵活性,但我们的方法的开销与最先进的中同步通信技术相当。我们还展示了迄今为止针对合理相关时钟频率提出的所有其他方法在性能、复杂性和开销方面的改进。
{"title":"A flexible communication scheme for rationally-related clock frequencies","authors":"Jean-Michel Chabloz, A. Hemani","doi":"10.1109/ICCD.2009.5413166","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413166","url":null,"abstract":"As a replacement for the fast-fading Globally-Synchronous model, we have defined a flexible design style for SoCs, called GRLS, for Globally-Ratiochronous, Locally-Synchronous, which does not rely on global synchronization and is based on using rationally-related clock frequencies derived from the same source. In this paper, using the special periodical properties of rationally-related systems, we build a latency-insensitive, maximal-throughput, low-overhead communication method, based on the idea of using both clock edges to sample data at the Receiver. The validity of the method and its resistance to non-idealities such as jitter, misalignments and clock drifts are formally proven while experimental results including overhead are presented for 90 nm technology. Despite allowing much greater flexibility, the overhead of our method is comparable to that of state-of-the-art mesochronous communication techniques. We also show performances, complexity and overhead improvements over all other approaches that have so far been proposed for rationally-related clock frequencies.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133359238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}