首页 > 最新文献

2009 IEEE International Conference on Computer Design最新文献

英文 中文
Extending data prefetching to cope with context switch misses 扩展数据预取以处理上下文切换错误
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413144
Hanyu Cui, S. Sair
Among the various costs of a context switch, its impact on the performance of L2 caches is the most significant because of the resulting high miss penalty. To reduce the impact of frequent context switches, we propose restoring a program's locality by prefetching into the L2 cache the data a program was using before it was swapped out. A Global History List is used to record a process' L2 read accesses in LRU order. These accesses are saved along with the process' context when the process is swapped out and loaded to guide prefetching when it is swapped in. We also propose a feedback mechanism that greatly reduces memory traffic incurred by our prefetching scheme. Experiments show significant speedup over baseline architectures with and without traditional prefetching in the presence of frequent context switches.
在上下文切换的各种成本中,它对L2缓存性能的影响是最显著的,因为它会导致较高的丢失损失。为了减少频繁上下文切换的影响,我们建议通过将程序在交换出之前使用的数据预取到L2缓存中来恢复程序的局域性。全局历史列表是用来记录进程L2读访问的LRU顺序。当交换出进程时,这些访问与进程的上下文一起保存,并加载以指导交换入进程时的预取。我们还提出了一种反馈机制,大大减少了我们的预取方案所带来的内存流量。实验表明,在频繁上下文切换的情况下,使用和不使用传统预取都比基线架构有显著的加速。
{"title":"Extending data prefetching to cope with context switch misses","authors":"Hanyu Cui, S. Sair","doi":"10.1109/ICCD.2009.5413144","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413144","url":null,"abstract":"Among the various costs of a context switch, its impact on the performance of L2 caches is the most significant because of the resulting high miss penalty. To reduce the impact of frequent context switches, we propose restoring a program's locality by prefetching into the L2 cache the data a program was using before it was swapped out. A Global History List is used to record a process' L2 read accesses in LRU order. These accesses are saved along with the process' context when the process is swapped out and loaded to guide prefetching when it is swapped in. We also propose a feedback mechanism that greatly reduces memory traffic incurred by our prefetching scheme. Experiments show significant speedup over baseline architectures with and without traditional prefetching in the presence of frequent context switches.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122605451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Defect-based test optimization for analog/RF circuits for near-zero DPPM applications 为接近零DPPM应用的模拟/射频电路的基于缺陷的测试优化
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413139
E. Yilmaz, S. Ozev
Analog circuits are often tested based on their specifications. While specification-based testing ensures the initial product quality, full testing is often not possible in high volume production. Moreover, even full specification-based testing cannot guarantee that the circuit does not contain any physical defects. Some application domains require near-zero defect levels independent of whether the specifications are met. In this work, we present a defect based test optimization method focusing on defective parts per million (DPPM) minimization. We extract potential defects through inductive fault analysis (IFA) and reduce the number of tests without degrading the test quality. In order to achieve near zero DPPM, we employ outlier analysis to identify defective circuits that cannot be identified using specification based methods. Simulation results on an LNA show that DPPM is reduced down to 0 at a cost of 0.2% yield loss with the proposed method.
模拟电路通常根据其规格进行测试。虽然基于规格的测试确保了最初的产品质量,但在大批量生产中,全面测试通常是不可能的。此外,即使是基于完整规格的测试也不能保证电路不包含任何物理缺陷。一些应用领域需要接近于零的缺陷级别,这与是否满足规范无关。在这项工作中,我们提出了一种基于缺陷的测试优化方法,该方法的重点是缺陷率(DPPM)最小化。通过归纳故障分析(IFA)提取潜在缺陷,在不降低测试质量的前提下减少测试次数。为了实现接近零的DPPM,我们采用离群值分析来识别无法使用基于规范的方法识别的缺陷电路。在LNA上的仿真结果表明,该方法以0.2%的产率损失为代价,将DPPM降至0。
{"title":"Defect-based test optimization for analog/RF circuits for near-zero DPPM applications","authors":"E. Yilmaz, S. Ozev","doi":"10.1109/ICCD.2009.5413139","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413139","url":null,"abstract":"Analog circuits are often tested based on their specifications. While specification-based testing ensures the initial product quality, full testing is often not possible in high volume production. Moreover, even full specification-based testing cannot guarantee that the circuit does not contain any physical defects. Some application domains require near-zero defect levels independent of whether the specifications are met. In this work, we present a defect based test optimization method focusing on defective parts per million (DPPM) minimization. We extract potential defects through inductive fault analysis (IFA) and reduce the number of tests without degrading the test quality. In order to achieve near zero DPPM, we employ outlier analysis to identify defective circuits that cannot be identified using specification based methods. Simulation results on an LNA show that DPPM is reduced down to 0 at a cost of 0.2% yield loss with the proposed method.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126223147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Enabling resonant clock distribution with scaled on-chip magnetic inductors 使共振时钟分布与缩放片上磁电感
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413169
S. Sinha, W. Xu, J. Velamala, T. Dastagir, B. Bakkaloglu, Hongbin Yu, Yu Cao
Resonant clock distribution with distributed LC oscillators is promising to reducing clock power and jitter noise. Yet the difficulty in the integration of on-chip inductors still limits its application in practice. This paper resolves such a key issue with sub-50 µm magnetic inductors, which are fully compatible with the CMOS process. These inductors leverage soft magnetic coils to achieve inductances up to 4nH, Q-factor of 3 at 1 GHz with a device diameter of only 30–50 µm, resulting in area savings of nearly 100X as compared to conventional design. The latency and noise performance of the resonant clock network is demonstrated to be comparable to those using conventional inductors without soft magnetic materials. In addition, inductors with integrated magnetic materials significantly reduce mutual coupling and eddy current loss in the power grid below the clock network. These design advantages enable high density of on-chip distributed oscillators, providing better phase averaging, lower power and superior noise characteristics as compared to traditional buffer-tree based clock network.
采用分布式LC振荡器进行谐振时钟分布,有望降低时钟功耗和抖动噪声。然而,片上电感集成的困难仍然限制了其在实际中的应用。本文用低于50µm的磁电感器解决了这一关键问题,该电感器完全兼容CMOS工艺。这些电感器利用软磁线圈实现高达4nH的电感,在1ghz时q因子为3,器件直径仅为30-50 μ m,与传统设计相比,面积节省近100倍。谐振时钟网络的延迟和噪声性能可与使用传统电感器而不使用软磁材料的电感器相媲美。此外,集成磁性材料的电感器显著降低了时钟网络以下电网的互耦和涡流损耗。与传统的基于缓冲树的时钟网络相比,这些设计优势可以实现高密度的片上分布式振荡器,提供更好的相位平均,更低的功耗和更优越的噪声特性。
{"title":"Enabling resonant clock distribution with scaled on-chip magnetic inductors","authors":"S. Sinha, W. Xu, J. Velamala, T. Dastagir, B. Bakkaloglu, Hongbin Yu, Yu Cao","doi":"10.1109/ICCD.2009.5413169","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413169","url":null,"abstract":"Resonant clock distribution with distributed LC oscillators is promising to reducing clock power and jitter noise. Yet the difficulty in the integration of on-chip inductors still limits its application in practice. This paper resolves such a key issue with sub-50 µm magnetic inductors, which are fully compatible with the CMOS process. These inductors leverage soft magnetic coils to achieve inductances up to 4nH, Q-factor of 3 at 1 GHz with a device diameter of only 30–50 µm, resulting in area savings of nearly 100X as compared to conventional design. The latency and noise performance of the resonant clock network is demonstrated to be comparable to those using conventional inductors without soft magnetic materials. In addition, inductors with integrated magnetic materials significantly reduce mutual coupling and eddy current loss in the power grid below the clock network. These design advantages enable high density of on-chip distributed oscillators, providing better phase averaging, lower power and superior noise characteristics as compared to traditional buffer-tree based clock network.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128746203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Design and test strategies for microarchitectural post-fabrication tuning 微架构后期调优的设计和测试策略
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413170
Xiaoyao Liang, Benjamin C. Lee, Gu-Yeon Wei, D. Brooks
Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.
工艺变化是技术持续扩展的主要障碍。系统变化和随机变化都会影响晶片的临界延迟,造成较宽的频率和功率分布。调整技术适应微架构,以减轻在制造后测试时间变化的影响。本文提出了一种考虑测试成本的新型制造后测试框架。该框架使用片上金丝雀电路捕获系统变化,同时使用统计分析来估计随机变化。我们推导回归模型来预测芯片性能和功耗。这些技术包括一个集成的框架,该框架确定了每个芯片最节能的制造后调谐配置。
{"title":"Design and test strategies for microarchitectural post-fabrication tuning","authors":"Xiaoyao Liang, Benjamin C. Lee, Gu-Yeon Wei, D. Brooks","doi":"10.1109/ICCD.2009.5413170","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413170","url":null,"abstract":"Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116989107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Empirical performance models for 3T1D memories 3T1D记忆体的经验性能模型
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413124
Kristen Lovin, Benjamin C. Lee, Xiaoyao Liang, D. Brooks, Gu-Yeon Wei
Process variation poses a threat to the performance and reliability of the 6T SRAM cell. Research has turned to new memory cell designs, such as the 3T1D DRAM cell, as potential replacement designs. If designers are to consider 3T1D memory architectures, performance models are needed to better understand memory cell behavior. We propose a decoupled approach for collecting Monte Carlo HSPICE data, reducing simulation times by simulating memory array components separately based on their contribution to the worst-case critical path. We use this Monte Carlo data to train regression models, which accurately predict retention and access times of a 3T1D memory array with a median error of 7.39%.
工艺变化对6T SRAM单元的性能和可靠性构成威胁。研究转向了新的存储单元设计,如3T1D DRAM单元,作为潜在的替代设计。如果设计人员要考虑3T1D内存架构,则需要性能模型来更好地理解内存单元的行为。我们提出了一种解耦的方法来收集蒙特卡罗HSPICE数据,通过根据内存阵列组件对最坏情况关键路径的贡献分别模拟内存阵列组件来减少模拟时间。我们使用蒙特卡罗数据训练回归模型,准确预测3T1D存储器阵列的保留和访问时间,中位数误差为7.39%。
{"title":"Empirical performance models for 3T1D memories","authors":"Kristen Lovin, Benjamin C. Lee, Xiaoyao Liang, D. Brooks, Gu-Yeon Wei","doi":"10.1109/ICCD.2009.5413124","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413124","url":null,"abstract":"Process variation poses a threat to the performance and reliability of the 6T SRAM cell. Research has turned to new memory cell designs, such as the 3T1D DRAM cell, as potential replacement designs. If designers are to consider 3T1D memory architectures, performance models are needed to better understand memory cell behavior. We propose a decoupled approach for collecting Monte Carlo HSPICE data, reducing simulation times by simulating memory array components separately based on their contribution to the worst-case critical path. We use this Monte Carlo data to train regression models, which accurately predict retention and access times of a 3T1D memory array with a median error of 7.39%.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"38 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114112009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Deterministic clock gating to eliminate wasteful activity due to wrong-path instructions in out-of-order superscalar processors 确定性时钟门控,以消除无序超标量处理器中由于错误路径指令而造成的浪费活动
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413158
Nasir Mohyuddin, Kimish Patel, Massoud Pedram
In this paper we present deterministic clock gating schemes for various micro architectural blocks of a modern out-of-order superscalar processor. We propose to make use of 1) idle stages of the pipelined function units (FUs) and 2) wrong-path instruction execution during branch mis-prediction, in order to clock gate various stages of FUs. The baseline Pipelined Functional unit Clock Gating (PFCG), presented for evaluation purpose only, disables the clock on idle stages and thus results in 13.93% chip-wide energy saving. Wrong-path instruction Clock Gating (WPCG) detects wrong-path instructions in the event of branch mis-prediction and prevents them from being issued to the FUs, and subsequently, disables the clock of these FUs along with reducing the stress on register file and cache. Simulations demonstrate that more than 92% of all wrong-path instructions can be detected and stopped from being executed. The WPCG architecture results in 16.26% chip-wide energy savings which is 2.33% more than that of the baseline PFCG scheme.
本文提出了一种现代无序超标量处理器的各种微结构块的确定性时钟门控方案。我们建议利用1)流水线功能单元(FUs)的空闲阶段和2)在分支错误预测期间的错误路径指令执行,以便对FUs的各个阶段进行时钟门。基线流水线功能单元时钟门控(PFCG),仅用于评估目的,在空闲阶段禁用时钟,从而在整个芯片范围内节省13.93%的能源。错误路径指令时钟门控(WPCG)在分支错误预测的情况下检测错误路径指令,并阻止它们被发布到FUs,随后禁用这些FUs的时钟,同时减少对寄存器文件和缓存的压力。仿真表明,92%以上的错误路径指令可以被检测到并阻止执行。WPCG架构在全芯片范围内节能16.26%,比基线PFCG方案节能2.33%。
{"title":"Deterministic clock gating to eliminate wasteful activity due to wrong-path instructions in out-of-order superscalar processors","authors":"Nasir Mohyuddin, Kimish Patel, Massoud Pedram","doi":"10.1109/ICCD.2009.5413158","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413158","url":null,"abstract":"In this paper we present deterministic clock gating schemes for various micro architectural blocks of a modern out-of-order superscalar processor. We propose to make use of 1) idle stages of the pipelined function units (FUs) and 2) wrong-path instruction execution during branch mis-prediction, in order to clock gate various stages of FUs. The baseline Pipelined Functional unit Clock Gating (PFCG), presented for evaluation purpose only, disables the clock on idle stages and thus results in 13.93% chip-wide energy saving. Wrong-path instruction Clock Gating (WPCG) detects wrong-path instructions in the event of branch mis-prediction and prevents them from being issued to the FUs, and subsequently, disables the clock of these FUs along with reducing the stress on register file and cache. Simulations demonstrate that more than 92% of all wrong-path instructions can be detected and stopped from being executed. The WPCG architecture results in 16.26% chip-wide energy savings which is 2.33% more than that of the baseline PFCG scheme.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114765393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Algorithmic approach to designing an easy-to-program system: Can it lead to a HW-enhanced programmer's workflow add-on? 算法方法设计一个易于编程的系统:它能导致hw增强程序员的工作流程附加组件吗?
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413174
U. Vishkin
Our earlier parallel algorithmics work on the parallel random-access-machine/model (PRAM) computation model led us to a PRAM-On-Chip vision: a comprehensive many-core system that can look to the programmer like the abstract PRAM model. We introduced the eXplicit MultiThreaded (XMT) design and prototyped it in hardware and software. XMT comprises a programmer's workflow that advances from work-depth, a standard PRAM theory abstraction, to an XMT program, and, if desired, to its performance tuning. XMT provides strong performance for programs developed this way due to its hardware support of very fine-grained threads and the overhead of handling them. XMT has also shown unique promise when it comes to ease-of-programming, the biggest problem that has limited the impact of all parallel systems to date. For example, teachability of XMT programming has been demonstrated at various levels from rising 6th graders to graduate students, and students in a freshman class were able to program 3 parallel sorting algorithms. The main purpose of the current paper is to stimulate discussion on the following somewhat open-ended question. Now that we made significant progress on a system devoted to supporting PRAM-like programming, is it possible to incorporate our hardware support as an add-on into other current and future many-core systems? The paper considers a concrete proposal for doing that: recasting our work as a hardware-enhanced programmer's workflow “module” that can then be essentially imported into the other systems.
我们早期在并行随机存取机器/模型(PRAM)计算模型上的并行算法工作使我们实现了PRAM- on - chip的愿景:一个全面的多核系统,可以像抽象的PRAM模型一样被程序员看到。介绍了显式多线程(eXplicit multithread, XMT)的设计,并从硬件和软件两个方面对其进行了原型化。XMT包含程序员的工作流,从工作深度(标准的PRAM理论抽象)到XMT程序,如果需要的话,再到它的性能调优。XMT为以这种方式开发的程序提供了强大的性能,因为它的硬件支持非常细粒度的线程和处理它们的开销。在易于编程方面,XMT也显示出了独特的前景,这是迄今为止限制所有并行系统影响的最大问题。例如,XMT编程的可教性已经在从六年级学生到研究生的各个层次上得到了证明,大一的学生能够编写3种并行排序算法。本文的主要目的是激发对下列开放式问题的讨论。既然我们在致力于支持类ram编程的系统上取得了重大进展,那么是否有可能将我们的硬件支持作为附加组件集成到其他当前和未来的多核系统中呢?本文考虑了这样做的一个具体建议:将我们的工作重新定义为一个硬件增强的程序员工作流“模块”,然后可以基本上导入到其他系统中。
{"title":"Algorithmic approach to designing an easy-to-program system: Can it lead to a HW-enhanced programmer's workflow add-on?","authors":"U. Vishkin","doi":"10.1109/ICCD.2009.5413174","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413174","url":null,"abstract":"Our earlier parallel algorithmics work on the parallel random-access-machine/model (PRAM) computation model led us to a PRAM-On-Chip vision: a comprehensive many-core system that can look to the programmer like the abstract PRAM model. We introduced the eXplicit MultiThreaded (XMT) design and prototyped it in hardware and software. XMT comprises a programmer's workflow that advances from work-depth, a standard PRAM theory abstraction, to an XMT program, and, if desired, to its performance tuning. XMT provides strong performance for programs developed this way due to its hardware support of very fine-grained threads and the overhead of handling them. XMT has also shown unique promise when it comes to ease-of-programming, the biggest problem that has limited the impact of all parallel systems to date. For example, teachability of XMT programming has been demonstrated at various levels from rising 6th graders to graduate students, and students in a freshman class were able to program 3 parallel sorting algorithms. The main purpose of the current paper is to stimulate discussion on the following somewhat open-ended question. Now that we made significant progress on a system devoted to supporting PRAM-like programming, is it possible to incorporate our hardware support as an add-on into other current and future many-core systems? The paper considers a concrete proposal for doing that: recasting our work as a hardware-enhanced programmer's workflow “module” that can then be essentially imported into the other systems.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123775383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Compiler-directed leakage reduction in embedded microprocessors 嵌入式微处理器中编译器导向的泄漏减少
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413178
Soumyaroop Roy, N. Ranganathan, S. Katkoori
Compiler-directed power gating is an approach in which sleep instructions are inserted appropriately at compile time into the application code to selectively deactivate the functional units in microprocessors during their idle periods to reduce power dissipation due to leakage. Although the effect of code transformations on dynamic and system power has been investigated and reported in the literature, such a study is lacking in the context of power gating. In this paper, we investigate and report how the leakage savings in both integer and floating point units can be improved using machine-dependent and independent optimizations in a compiler-directed power gating framework. In our study, it is ensured that power gating is applied only when the leakage savings are considerably more than the various overheads incurred in its implementation. The target embedded processor is modeled on the ARMv4 architecture, which is modified to support the power gating of its arithmetic functional units. For experimentation, GCC is used as the compiler infrastructure and Simplescalar-ARM is used as the detailed architectural simulator for reporting power and performance metrics for embedded applications belonging to the MiBench and MediaBench benchmark suites. Experimental results suggest that the additional savings in leakage energy due to one or more of the optimizations may vary largely depending on the benchmark. Moreover, the overhead of sleep instructions can be reduced by up to 50 times by performing procedure inlining.
编译器导向的功率门控是一种在编译时适当地将睡眠指令插入到应用程序代码中的方法,以便在微处理器的空闲期间选择性地停用功能单元,以减少由于泄漏引起的功耗。虽然文献中已经研究和报道了码变换对动态和系统功率的影响,但在功率门控的背景下缺乏这样的研究。在本文中,我们研究并报告了如何在编译器导向的功率门控框架中使用与机器相关和独立的优化来改进整数和浮点单元的泄漏节省。在我们的研究中,只有当泄漏节省大大超过其实施过程中产生的各种开销时,才能确保应用功率门控。目标嵌入式处理器在ARMv4架构上建模,并对其进行了修改以支持其算术功能单元的功率门控。为了进行实验,GCC被用作编译器基础设施,Simplescalar-ARM被用作详细的体系结构模拟器,用于报告属于MiBench和mediabbench基准套件的嵌入式应用程序的功率和性能指标。实验结果表明,由于一个或多个优化,泄漏能量的额外节省可能在很大程度上取决于基准。此外,通过执行过程内联,睡眠指令的开销可以减少多达50倍。
{"title":"Compiler-directed leakage reduction in embedded microprocessors","authors":"Soumyaroop Roy, N. Ranganathan, S. Katkoori","doi":"10.1109/ICCD.2009.5413178","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413178","url":null,"abstract":"Compiler-directed power gating is an approach in which sleep instructions are inserted appropriately at compile time into the application code to selectively deactivate the functional units in microprocessors during their idle periods to reduce power dissipation due to leakage. Although the effect of code transformations on dynamic and system power has been investigated and reported in the literature, such a study is lacking in the context of power gating. In this paper, we investigate and report how the leakage savings in both integer and floating point units can be improved using machine-dependent and independent optimizations in a compiler-directed power gating framework. In our study, it is ensured that power gating is applied only when the leakage savings are considerably more than the various overheads incurred in its implementation. The target embedded processor is modeled on the ARMv4 architecture, which is modified to support the power gating of its arithmetic functional units. For experimentation, GCC is used as the compiler infrastructure and Simplescalar-ARM is used as the detailed architectural simulator for reporting power and performance metrics for embedded applications belonging to the MiBench and MediaBench benchmark suites. Experimental results suggest that the additional savings in leakage energy due to one or more of the optimizations may vary largely depending on the benchmark. Moreover, the overhead of sleep instructions can be reduced by up to 50 times by performing procedure inlining.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126447559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Real-time, unobtrusive, and efficient program execution tracing with stream caches and last stream predictors 使用流缓存和最后流预测器进行实时、不显眼和高效的程序执行跟踪
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413159
Vladimir Uzelac, A. Milenković, M. Milenkovic, Martin Burtscher
This paper introduces a new hardware mechanism for capturing and compressing program execution traces unobtrusively in real-time. The proposed mechanism is based on two structures called stream cache and last stream predictor. We explore the effectiveness of a trace module based on these structures and analyze the design space. We show that our trace module, with less than 600 bytes of state, achieves a trace-port bandwidth of 0.15 bits/instruction/processor, which is over six times better than state-of-the-art commercial designs.
本文介绍了一种新的硬件机制,用于实时捕获和压缩程序执行轨迹。所提出的机制是基于两种结构:流缓存和最后流预测器。我们探讨了基于这些结构的跟踪模块的有效性,并分析了设计空间。我们展示了我们的跟踪模块,状态小于600字节,实现了0.15比特/指令/处理器的跟踪端口带宽,比最先进的商业设计好6倍以上。
{"title":"Real-time, unobtrusive, and efficient program execution tracing with stream caches and last stream predictors","authors":"Vladimir Uzelac, A. Milenković, M. Milenkovic, Martin Burtscher","doi":"10.1109/ICCD.2009.5413159","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413159","url":null,"abstract":"This paper introduces a new hardware mechanism for capturing and compressing program execution traces unobtrusively in real-time. The proposed mechanism is based on two structures called stream cache and last stream predictor. We explore the effectiveness of a trace module based on these structures and analyze the design space. We show that our trace module, with less than 600 bytes of state, achieves a trace-port bandwidth of 0.15 bits/instruction/processor, which is over six times better than state-of-the-art commercial designs.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129717122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A flexible communication scheme for rationally-related clock frequencies 合理相关时钟频率的灵活通信方案
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413166
Jean-Michel Chabloz, A. Hemani
As a replacement for the fast-fading Globally-Synchronous model, we have defined a flexible design style for SoCs, called GRLS, for Globally-Ratiochronous, Locally-Synchronous, which does not rely on global synchronization and is based on using rationally-related clock frequencies derived from the same source. In this paper, using the special periodical properties of rationally-related systems, we build a latency-insensitive, maximal-throughput, low-overhead communication method, based on the idea of using both clock edges to sample data at the Receiver. The validity of the method and its resistance to non-idealities such as jitter, misalignments and clock drifts are formally proven while experimental results including overhead are presented for 90 nm technology. Despite allowing much greater flexibility, the overhead of our method is comparable to that of state-of-the-art mesochronous communication techniques. We also show performances, complexity and overhead improvements over all other approaches that have so far been proposed for rationally-related clock frequencies.
作为快速衰落的global - synchronous模型的替代品,我们为soc定义了一种灵活的设计风格,称为GRLS,用于global - ratiochronous, local - synchronous,它不依赖于全局同步,而是基于使用来自同一源的合理相关时钟频率。本文利用理性相关系统的特殊周期特性,基于在接收端使用两个时钟边采样数据的思想,构建了一种延迟不敏感、最大吞吐量、低开销的通信方法。本文正式证明了该方法的有效性及其对抖动、失调和时钟漂移等非理想情况的抵抗能力,并给出了包括开销在内的90 nm技术的实验结果。尽管允许更大的灵活性,但我们的方法的开销与最先进的中同步通信技术相当。我们还展示了迄今为止针对合理相关时钟频率提出的所有其他方法在性能、复杂性和开销方面的改进。
{"title":"A flexible communication scheme for rationally-related clock frequencies","authors":"Jean-Michel Chabloz, A. Hemani","doi":"10.1109/ICCD.2009.5413166","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413166","url":null,"abstract":"As a replacement for the fast-fading Globally-Synchronous model, we have defined a flexible design style for SoCs, called GRLS, for Globally-Ratiochronous, Locally-Synchronous, which does not rely on global synchronization and is based on using rationally-related clock frequencies derived from the same source. In this paper, using the special periodical properties of rationally-related systems, we build a latency-insensitive, maximal-throughput, low-overhead communication method, based on the idea of using both clock edges to sample data at the Receiver. The validity of the method and its resistance to non-idealities such as jitter, misalignments and clock drifts are formally proven while experimental results including overhead are presented for 90 nm technology. Despite allowing much greater flexibility, the overhead of our method is comparable to that of state-of-the-art mesochronous communication techniques. We also show performances, complexity and overhead improvements over all other approaches that have so far been proposed for rationally-related clock frequencies.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133359238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2009 IEEE International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1