首页 > 最新文献

2009 IEEE International Conference on Computer Design最新文献

英文 中文
SHIELDSTRAP: Making secure processors truly secure SHIELDSTRAP:使安全处理器真正安全
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413140
Siddhartha Chhabra, Brian Rogers, Yan Solihin
Many systems may have security requirements such as protecting the privacy of data and code stored in the system, ensuring integrity of computations, or preventing the execution of unauthorized code. It is becoming increasingly difficult to ensure such protections as hardware-based attacks, in addition to software attacks, become more widespread and feasible. Many of these attacks target a system during booting before any employed security measures can take effect. In this paper, we propose SHIELDSTRAP, a security architecture capable of booting a system securely in the face of hardware and software attacks targeting the boot phase. SHIELDSTRAP bridges the gap between the vulnerable initialization of the system and the secure steady state execution environment provided by the secure processor. We present an analysis of the security of SHIELDSTRAP against several common boot time attacks. We also show that SHIELDSTRAP requires an on-chip area overhead of only 0.012% and incurs negligible boot time overhead of 0.37 seconds.
许多系统可能有安全需求,例如保护存储在系统中的数据和代码的隐私,确保计算的完整性,或防止执行未经授权的代码。随着基于硬件的攻击和软件攻击变得越来越普遍和可行,确保这种保护变得越来越困难。在任何安全措施生效之前,许多攻击都是在系统启动期间针对系统的。在本文中,我们提出了shield - strap,这是一种安全架构,能够在面对针对启动阶段的硬件和软件攻击时安全地启动系统。SHIELDSTRAP弥合了系统脆弱初始化和安全处理器提供的安全稳态执行环境之间的差距。我们介绍了shield对几种常见启动时间攻击的安全性分析。我们还表明,SHIELDSTRAP只需要0.012%的片上面积开销,而0.37秒的启动时间开销可以忽略不计。
{"title":"SHIELDSTRAP: Making secure processors truly secure","authors":"Siddhartha Chhabra, Brian Rogers, Yan Solihin","doi":"10.1109/ICCD.2009.5413140","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413140","url":null,"abstract":"Many systems may have security requirements such as protecting the privacy of data and code stored in the system, ensuring integrity of computations, or preventing the execution of unauthorized code. It is becoming increasingly difficult to ensure such protections as hardware-based attacks, in addition to software attacks, become more widespread and feasible. Many of these attacks target a system during booting before any employed security measures can take effect. In this paper, we propose SHIELDSTRAP, a security architecture capable of booting a system securely in the face of hardware and software attacks targeting the boot phase. SHIELDSTRAP bridges the gap between the vulnerable initialization of the system and the secure steady state execution environment provided by the secure processor. We present an analysis of the security of SHIELDSTRAP against several common boot time attacks. We also show that SHIELDSTRAP requires an on-chip area overhead of only 0.012% and incurs negligible boot time overhead of 0.37 seconds.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"270 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133474317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Efficient binary translation system with low hardware cost 高效的二进制翻译系统,硬件成本低
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413138
Weiwu Hu, Qi Liu, Jian Wang, Songsong Cai, Menghao Su, Xiaoyun Li
Binary translation is one of the most important approaches for system migration. However, software binary translation systems often suffer from the inefficiency and traditional hardware-software co-designed virtual machines require the unavoidable re-design of the processor architecture. This paper presents a novel hardware-software co-designed method to accelerate the binary translation on an existing architecture. The hardware supports for source-architecture-only functions, partial decodes and binary translation system acceleration are proposed. These hardware supports help the binary translation system to achieve high performance and simplify the design of the binary translation software. In the meantime, the hardware cost is well controlled in a certain low level. These supports are implemented in Godson-3 processors to speedup the x86 binary translation to the native MIPS instruction set. Performance evaluations on RTL simulation and FPGA emulation platforms show that the proposed method can speedup most benchmark programs by nearly 10 times compared to pure software-based binary translation and achieves about 70% performance of the native program execution. The chip is fabricated in ST 65nm CMOS technology, and the physical design results show that the chip area cost is less than 5%.
二进制翻译是系统迁移最重要的方法之一。然而,软件二进制转换系统常常存在效率低下的问题,而且传统的软硬件协同设计虚拟机需要不可避免地重新设计处理器体系结构。本文提出了一种新的软硬件协同设计方法,以加速现有体系结构上的二进制翻译。提出了对纯源架构功能、部分译码和二进制转换系统加速的硬件支持。这些硬件支持有助于二进制翻译系统实现高性能,并简化二进制翻译软件的设计。同时,将硬件成本控制在一定的低水平。这些支持在Godson-3处理器中实现,以加快x86二进制转换到本地MIPS指令集的速度。在RTL仿真和FPGA仿真平台上进行的性能评估表明,与纯软件二进制转换相比,该方法可以将大多数基准程序的速度提高近10倍,并达到本机程序执行性能的70%左右。该芯片采用ST 65nm CMOS工艺制作,物理设计结果表明,芯片面积成本小于5%。
{"title":"Efficient binary translation system with low hardware cost","authors":"Weiwu Hu, Qi Liu, Jian Wang, Songsong Cai, Menghao Su, Xiaoyun Li","doi":"10.1109/ICCD.2009.5413138","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413138","url":null,"abstract":"Binary translation is one of the most important approaches for system migration. However, software binary translation systems often suffer from the inefficiency and traditional hardware-software co-designed virtual machines require the unavoidable re-design of the processor architecture. This paper presents a novel hardware-software co-designed method to accelerate the binary translation on an existing architecture. The hardware supports for source-architecture-only functions, partial decodes and binary translation system acceleration are proposed. These hardware supports help the binary translation system to achieve high performance and simplify the design of the binary translation software. In the meantime, the hardware cost is well controlled in a certain low level. These supports are implemented in Godson-3 processors to speedup the x86 binary translation to the native MIPS instruction set. Performance evaluations on RTL simulation and FPGA emulation platforms show that the proposed method can speedup most benchmark programs by nearly 10 times compared to pure software-based binary translation and achieves about 70% performance of the native program execution. The chip is fabricated in ST 65nm CMOS technology, and the physical design results show that the chip area cost is less than 5%.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125924884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A high throughput FFT processor with no multipliers 无乘法器的高吞吐量FFT处理器
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413113
S. Abdulla, Haewoon Nam, Mark McDermot, J. Abraham
A novel technique for implementing very high speed FFTs based on unrolled CORDIC structures is proposed in this paper. There has been a lot of research in the area of FFT algorithm implementation; most of the research is focused on reduction of the computational complexity by selection and efficient decomposition of the FFT algorithm. However there has not been much research on using the CORDIC structures for FFT implementations, especially for large, high speed and high throughput FFT transforms, due to the recursive nature of the CORDIC algorithms. The key ideas in this paper are replacing the sine and cosine twiddle factors in the conventional FFT architecture by non-iterative CORDIC micro-rotations which allow substantial (~ 50%) reduction in read-only memory (ROM) table size, and total removal of complex multipliers. A new method to derive the optimal unrolling/unfolding factor for a desired FFT application based on the MSE (mean square error) is also proposed in this paper. Implemented on a Virtex-4 FPGA, the CORDIC based FFT runs 3.9 times faster and occupies 37% less area than an equivalent complex multiplier-based FFT implementation.
提出了一种基于展开CORDIC结构实现超高速fft的新技术。在FFT算法实现方面已经有了大量的研究;大多数研究都集中在通过FFT算法的选择和有效分解来降低计算复杂度。然而,由于CORDIC算法的递归性质,在FFT实现中使用CORDIC结构的研究并不多,特别是对于大型、高速和高吞吐量的FFT变换。本文的关键思想是用非迭代的CORDIC微旋转取代传统FFT架构中的正弦和余弦旋转因子,这使得只读存储器(ROM)表大小大幅减少(~ 50%),并完全去除复杂乘法器。本文还提出了一种基于均方误差(MSE)的新方法来推导期望FFT应用的最佳展开/展开因子。在Virtex-4 FPGA上实现,基于CORDIC的FFT运行速度比基于复杂乘法器的等效FFT实现快3.9倍,占地面积少37%。
{"title":"A high throughput FFT processor with no multipliers","authors":"S. Abdulla, Haewoon Nam, Mark McDermot, J. Abraham","doi":"10.1109/ICCD.2009.5413113","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413113","url":null,"abstract":"A novel technique for implementing very high speed FFTs based on unrolled CORDIC structures is proposed in this paper. There has been a lot of research in the area of FFT algorithm implementation; most of the research is focused on reduction of the computational complexity by selection and efficient decomposition of the FFT algorithm. However there has not been much research on using the CORDIC structures for FFT implementations, especially for large, high speed and high throughput FFT transforms, due to the recursive nature of the CORDIC algorithms. The key ideas in this paper are replacing the sine and cosine twiddle factors in the conventional FFT architecture by non-iterative CORDIC micro-rotations which allow substantial (~ 50%) reduction in read-only memory (ROM) table size, and total removal of complex multipliers. A new method to derive the optimal unrolling/unfolding factor for a desired FFT application based on the MSE (mean square error) is also proposed in this paper. Implemented on a Virtex-4 FPGA, the CORDIC based FFT runs 3.9 times faster and occupies 37% less area than an equivalent complex multiplier-based FFT implementation.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127438200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Analysis and optimization of pausible clocking based GALS design 基于可调时钟的GALS设计分析与优化
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413130
Xin Fan, M. Krstic, E. Grass
Pausible clocking based globally-asynchronous locally-synchronous (GALS) system design has been proven a promising approach to SoCs and NoCs. In this paper, we analyze the throughput reduction and synchronization failures introduced by the widely used pausible clocking scheme, and propose an optimized scheme for higher throughput and more reliable GALS design. The local clock generator is improved to minimize the acknowledge latency, and a novel input port is applied to maximize the safe timing region for the clock tree insertion. Simulation results using the IHP 0.13-¿m standard CMOS process demonstrate that up to one-third increase in data throughput and an almost doubled safe timing region for clock tree distribution can be achieved in comparison to the traditional pausible clocking scheme.
基于可调时钟的全局异步局部同步(GALS)系统设计已被证明是一种很有前途的soc和noc设计方法。本文分析了广泛使用的可调时钟方案所带来的吞吐量降低和同步故障,提出了一种优化方案,以实现更高的吞吐量和更可靠的GALS设计。改进了本地时钟发生器以最小化确认延迟,并应用了新的输入端口以最大化时钟树插入的安全时序区域。使用IHP 0.13-¿m标准CMOS工艺的仿真结果表明,与传统的可调时钟方案相比,可实现高达三分之一的数据吞吐量和几乎两倍的时钟树分布安全时序区域。
{"title":"Analysis and optimization of pausible clocking based GALS design","authors":"Xin Fan, M. Krstic, E. Grass","doi":"10.1109/ICCD.2009.5413130","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413130","url":null,"abstract":"Pausible clocking based globally-asynchronous locally-synchronous (GALS) system design has been proven a promising approach to SoCs and NoCs. In this paper, we analyze the throughput reduction and synchronization failures introduced by the widely used pausible clocking scheme, and propose an optimized scheme for higher throughput and more reliable GALS design. The local clock generator is improved to minimize the acknowledge latency, and a novel input port is applied to maximize the safe timing region for the clock tree insertion. Simulation results using the IHP 0.13-¿m standard CMOS process demonstrate that up to one-third increase in data throughput and an almost doubled safe timing region for clock tree distribution can be achieved in comparison to the traditional pausible clocking scheme.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129948023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
A disruptive computer design idea: Architectures with repeatable timing 一个颠覆性的计算机设计思想:具有可重复时序的体系结构
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413177
S. Edwards, Sungjun Kim, Edward A. Lee, Isaac Liu, Hiren D. Patel, Martin Schoeberl
This paper argues that repeatable timing is more important and more achievable than predictable timing. It describes microarchitecture approaches to pipelining and memory hierarchy that deliver repeatable timing and promise comparable or better performance compared to established techniques. Specifically, threads are interleaved in a pipeline to eliminate pipeline hazards, and a hierarchical memory architecture is outlined that hides memory latencies.
本文认为可重复计时比可预测计时更重要,也更容易实现。它描述了实现流水线和内存层次结构的微体系结构方法,这些方法提供了可重复的计时,并承诺与现有技术相比具有相当或更好的性能。具体来说,线程在管道中交错以消除管道危险,并且概述了隐藏内存延迟的分层内存体系结构。
{"title":"A disruptive computer design idea: Architectures with repeatable timing","authors":"S. Edwards, Sungjun Kim, Edward A. Lee, Isaac Liu, Hiren D. Patel, Martin Schoeberl","doi":"10.1109/ICCD.2009.5413177","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413177","url":null,"abstract":"This paper argues that repeatable timing is more important and more achievable than predictable timing. It describes microarchitecture approaches to pipelining and memory hierarchy that deliver repeatable timing and promise comparable or better performance compared to established techniques. Specifically, threads are interleaved in a pipeline to eliminate pipeline hazards, and a hierarchical memory architecture is outlined that hides memory latencies.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129456970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
ColSpace: Towards algorithm/implementation co-optimization ColSpace:迈向算法/实现协同优化
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413125
Jiawei Huang, J. Lach
Application-specific integrated circuits (ASICs) are physical implementations of algorithms, so implementation metrics are determined in large part by the algorithm specification. However, the system abstraction layers that have been developed to manage the ever-increasing complexity of digital systems separate algorithm designers from hardware designers, forcing the latter to work within the design space specified by the former, even for applications such as multimedia that do not have hard fidelity requirements. Designers typically employ informal iterative design to adjust fidelity, but a formal design methodology would increase designer efficiency and improve the quality of the solutions. This paper introduces such a methodology (and accompanying tool) that enables algorithm and implementation metrics to be co-optimized during early design exploration, opening the design space to include solutions that may provide, for example, significant performance improvements while only slightly compromising fidelity. Hierarchical dependency graphs (HDGs) are used to represent both the algorithm and the implementation architecture, providing a common interface through which algorithm designers and hardware designers can explore the collaborative space (ColSpace) together. Using the proposed technique, the ColSpace tool can trade off various metrics to find the best overall design while managing complexity with the HDG hierarchy. Two image processing case studies demonstrate that in ColSpace-optimized designs, latency savings can exceed fidelity losses, resulting in cost function reductions that would not have been possible without this co-optimization methodology.
专用集成电路(asic)是算法的物理实现,因此实现指标在很大程度上由算法规范决定。然而,为了管理数字系统日益增加的复杂性而开发的系统抽象层将算法设计人员与硬件设计人员分开,迫使后者在前者指定的设计空间内工作,即使对于诸如多媒体之类没有硬保真要求的应用程序也是如此。设计师通常采用非正式的迭代设计来调整保真度,但正式的设计方法可以提高设计师的效率并改善解决方案的质量。本文介绍了这样一种方法(以及附带的工具),使算法和实现指标能够在早期设计探索中协同优化,打开设计空间,包括可能提供的解决方案,例如,显著的性能改进,同时只略微损害保真度。分层依赖图(HDGs)用于表示算法和实现架构,提供了一个通用接口,通过该接口,算法设计者和硬件设计者可以一起探索协作空间(ColSpace)。使用提出的技术,ColSpace工具可以权衡各种指标,找到最佳的整体设计,同时管理HDG层次结构的复杂性。两个图像处理案例研究表明,在colspace优化设计中,延迟节省可以超过保真度损失,从而导致成本函数的降低,如果没有这种协同优化方法,这是不可能实现的。
{"title":"ColSpace: Towards algorithm/implementation co-optimization","authors":"Jiawei Huang, J. Lach","doi":"10.1109/ICCD.2009.5413125","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413125","url":null,"abstract":"Application-specific integrated circuits (ASICs) are physical implementations of algorithms, so implementation metrics are determined in large part by the algorithm specification. However, the system abstraction layers that have been developed to manage the ever-increasing complexity of digital systems separate algorithm designers from hardware designers, forcing the latter to work within the design space specified by the former, even for applications such as multimedia that do not have hard fidelity requirements. Designers typically employ informal iterative design to adjust fidelity, but a formal design methodology would increase designer efficiency and improve the quality of the solutions. This paper introduces such a methodology (and accompanying tool) that enables algorithm and implementation metrics to be co-optimized during early design exploration, opening the design space to include solutions that may provide, for example, significant performance improvements while only slightly compromising fidelity. Hierarchical dependency graphs (HDGs) are used to represent both the algorithm and the implementation architecture, providing a common interface through which algorithm designers and hardware designers can explore the collaborative space (ColSpace) together. Using the proposed technique, the ColSpace tool can trade off various metrics to find the best overall design while managing complexity with the HDG hierarchy. Two image processing case studies demonstrate that in ColSpace-optimized designs, latency savings can exceed fidelity losses, resulting in cost function reductions that would not have been possible without this co-optimization methodology.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117227750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Topology-driven cell layout migration with collinear constraints 具有共线约束的拓扑驱动单元布局迁移
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413118
De-Shiun Fu, Ying-Zhih Chaung, Yen-Hung Lin, Yih-Lang Li
Traditional layout migration focuses on area minimization, thus suffered wire distortion, which caused loss of layout topology. A migrated layout inheriting original topology owns original design intention and predictable property, such as wire length which determines the path delay importantly. This work presents a new rectangular topological layout to preserve layout topology and combine its flexibility of handling wires with traditional scan-line based compaction algorithm for area minimization. The proposed migration flow contains devices and wires extraction, topological layout construction, unidirectional compression combining scan-line algorithm with collinear equation solver, and wire restoration. Experimental results show that cell topology is well preserved, and a several times runtime speedup is achieved as compared with recent migration research based on ILP (integer linear programming) formulation.
传统的布局迁移注重面积的最小化,导致线材变形,造成布局拓扑的丢失。继承了原拓扑结构的迁移布局具有原有的设计意图和可预测的特性,如导线长度对路径延迟有重要的决定作用。本文提出了一种新的矩形拓扑布局,以保持布局拓扑结构,并将其处理导线的灵活性与传统的基于扫描线的压缩算法相结合,以实现面积最小化。提出的迁移流程包括设备和导线提取、拓扑布局构建、结合扫描线算法和共线方程求解器的单向压缩以及导线恢复。实验结果表明,与基于整数线性规划(ILP)公式的迁移研究相比,该方法可以很好地保留单元的拓扑结构,并且可以实现数倍的运行速度加快。
{"title":"Topology-driven cell layout migration with collinear constraints","authors":"De-Shiun Fu, Ying-Zhih Chaung, Yen-Hung Lin, Yih-Lang Li","doi":"10.1109/ICCD.2009.5413118","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413118","url":null,"abstract":"Traditional layout migration focuses on area minimization, thus suffered wire distortion, which caused loss of layout topology. A migrated layout inheriting original topology owns original design intention and predictable property, such as wire length which determines the path delay importantly. This work presents a new rectangular topological layout to preserve layout topology and combine its flexibility of handling wires with traditional scan-line based compaction algorithm for area minimization. The proposed migration flow contains devices and wires extraction, topological layout construction, unidirectional compression combining scan-line algorithm with collinear equation solver, and wire restoration. Experimental results show that cell topology is well preserved, and a several times runtime speedup is achieved as compared with recent migration research based on ILP (integer linear programming) formulation.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132526903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Panoptic DVS: A fine-grained dynamic voltage scaling framework for energy scalable CMOS design Panoptic DVS:用于能量可扩展CMOS设计的细粒度动态电压缩放框架
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413110
M. Putic, Liang Di, B. Calhoun, J. Lach
The energy efficiency of a CMOS architecture processing dynamic workloads directly affects its ability to provide long battery lifetimes while maintaining required application performance. Existing scalable architecture design approaches are often limited in scope, focusing either only on circuit-level optimizations or architectural adaptations individually. In this paper, we propose a circuit/architecture co-design methodology called Panoptic Dynamic Voltage Scaling (PDVS) that makes more efficient use of common circuit structures and algorithm-level processing rate control. PDVS expands upon prior work by using multiple component-level PMOS header switches to enable fine-grained rate control, allowing efficient dithering among statically scheduled algorithms with sub-block energy savings. This way, PDVS is able to achieve a wide variety of processing rates to match incoming workload as closely as possible, while each iteration takes less energy to process than on architectures with coarser levels of rate control. Measurements taken from a fabricated 90nm test chip characterize both savings and overheads and are used to inform PDVS synthesis decisions. Results show that PDVS consumes up to 34% and 44% less energy than Multi-VDD and Single-VDD systems, respectively.
CMOS架构处理动态工作负载的能效直接影响其在保持所需应用性能的同时提供较长电池寿命的能力。现有的可伸缩架构设计方法通常在范围上受到限制,要么只关注电路级优化,要么单独关注架构调整。在本文中,我们提出了一种电路/架构协同设计方法,称为泛光动态电压缩放(PDVS),它可以更有效地利用通用电路结构和算法级处理速率控制。pdv通过使用多个组件级PMOS报头开关扩展了先前的工作,实现了细粒度的速率控制,允许在静态调度算法之间进行有效的抖动,并节省了子块能源。通过这种方式,pdv能够实现各种各样的处理速率,以尽可能接近地匹配传入的工作负载,而每次迭代处理所需的能量比具有粗糙速率控制级别的体系结构要少。从制造的90nm测试芯片进行的测量具有节省成本和开销的特点,并用于为pdv合成决策提供信息。结果表明,与多vdd和单vdd系统相比,PDVS的能耗分别减少34%和44%。
{"title":"Panoptic DVS: A fine-grained dynamic voltage scaling framework for energy scalable CMOS design","authors":"M. Putic, Liang Di, B. Calhoun, J. Lach","doi":"10.1109/ICCD.2009.5413110","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413110","url":null,"abstract":"The energy efficiency of a CMOS architecture processing dynamic workloads directly affects its ability to provide long battery lifetimes while maintaining required application performance. Existing scalable architecture design approaches are often limited in scope, focusing either only on circuit-level optimizations or architectural adaptations individually. In this paper, we propose a circuit/architecture co-design methodology called Panoptic Dynamic Voltage Scaling (PDVS) that makes more efficient use of common circuit structures and algorithm-level processing rate control. PDVS expands upon prior work by using multiple component-level PMOS header switches to enable fine-grained rate control, allowing efficient dithering among statically scheduled algorithms with sub-block energy savings. This way, PDVS is able to achieve a wide variety of processing rates to match incoming workload as closely as possible, while each iteration takes less energy to process than on architectures with coarser levels of rate control. Measurements taken from a fabricated 90nm test chip characterize both savings and overheads and are used to inform PDVS synthesis decisions. Results show that PDVS consumes up to 34% and 44% less energy than Multi-VDD and Single-VDD systems, respectively.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131465545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
N-way ring and square arbiters n向环形和方形仲裁器
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413164
Masashi Imai, T. Yoneda, T. Nanya
In this paper, we propose two new N-way arbiter circuits. One circuit is based on the token-ring arbiters and another circuit is based on the mesh arbiters. The idea of the ring arbiter is to generate a lock signal by a token which is based on the non-return-to-zero signaling. It can achieve low latency and high throughput arbitration for a heavy work load environment. The idea of the mesh arbiter is to perform arbitrations between N/2 pairs at the same level and repeat them N-1 times. They can issue grant signals fairly. In this paper, we compare the performance of these N-way arbiters using 65nm process technologies qualitatively and quantitatively. We conclude that the proposed mesh arbiters are suitable when the number of inputs is 5 or less. We also conclude that we must select the appropriate arbiters considering tradeoff between latency, throughput, area, and energy when the number of inputs is larger than 5.
本文提出了两种新的n路仲裁电路。一个电路基于令牌环仲裁器,另一个电路基于网格仲裁器。环形仲裁器的思想是通过基于不归零信令的令牌生成锁定信号。它可以实现高负载环境下的低延迟和高吞吐量仲裁。网格仲裁器的思想是在同一级别的N/2对之间执行仲裁,并重复它们N-1次。他们可以公平地发出授权信号。在本文中,我们定性和定量地比较了这些使用65nm工艺技术的n路仲裁者的性能。我们得出的结论是,当输入数量为5或更少时,所提出的网格仲裁器是合适的。我们还得出结论,当输入数量大于5时,我们必须选择适当的仲裁器,考虑延迟、吞吐量、面积和能量之间的权衡。
{"title":"N-way ring and square arbiters","authors":"Masashi Imai, T. Yoneda, T. Nanya","doi":"10.1109/ICCD.2009.5413164","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413164","url":null,"abstract":"In this paper, we propose two new N-way arbiter circuits. One circuit is based on the token-ring arbiters and another circuit is based on the mesh arbiters. The idea of the ring arbiter is to generate a lock signal by a token which is based on the non-return-to-zero signaling. It can achieve low latency and high throughput arbitration for a heavy work load environment. The idea of the mesh arbiter is to perform arbitrations between N/2 pairs at the same level and repeat them N-1 times. They can issue grant signals fairly. In this paper, we compare the performance of these N-way arbiters using 65nm process technologies qualitatively and quantitatively. We conclude that the proposed mesh arbiters are suitable when the number of inputs is 5 or less. We also conclude that we must select the appropriate arbiters considering tradeoff between latency, throughput, area, and energy when the number of inputs is larger than 5.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126775074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Accurate estimation of vector dependent leakage power in the presence of process variations 在存在过程变化的情况下准确估计矢量相关泄漏功率
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413116
Romana Fernandes, R. Vemuri
With the increasing importance of run-time leakage power dissipation (around 55% of total power), it has become necessary to accurately estimate it not only as a function of input vectors but also as a function of process parameters. Leakage power corresponding to the maximum vector presents itself as a higher bound for run-time leakage and is a measure of reliability. In this work, we address the problem of accurately estimating the probabilistic distribution of the maximum runtime leakage power in the presence of variations in process parameters such as threshold voltage, critical dimensions and doping concentration. Both sub-threshold and gate leakage current are considered. A heuristic approach is proposed to determine the vector that causes the maximum leakage power under the influence of random process variations. This vector is then used to estimate the lognormal distribution of the total leakage current of the circuit by summing up the lognormal leakage current distributions of the individual standard cells at their respective input levels. The proposed method has been effective in accurately estimating the leakage mean, standard deviation and probability density function (PDF) of ISCAS-85 benchmark circuits. The average errors of our method compared with near exhaustive random vector testing for mean and standard deviation are 1.32% and 1.41% respectively.
随着运行时泄漏功耗(约占总功率的55%)的重要性日益增加,有必要准确地估计它不仅是输入向量的函数,而且是过程参数的函数。最大向量对应的泄漏功率是运行时泄漏的上界,是可靠性的度量。在这项工作中,我们解决了在阈值电压、临界尺寸和掺杂浓度等工艺参数变化的情况下,准确估计最大运行时泄漏功率的概率分布的问题。同时考虑了亚阈值电流和栅极泄漏电流。在随机过程变化的影响下,提出了一种启发式方法来确定引起最大泄漏功率的向量。然后,通过将各个标准单元在各自输入电平上的对数正态漏电流分布相加,该矢量用于估计电路总漏电流的对数正态分布。该方法能够准确估计ISCAS-85基准电路的泄漏平均值、标准差和概率密度函数(PDF)。与近穷举随机向量检验相比,该方法的均值和标准差的平均误差分别为1.32%和1.41%。
{"title":"Accurate estimation of vector dependent leakage power in the presence of process variations","authors":"Romana Fernandes, R. Vemuri","doi":"10.1109/ICCD.2009.5413116","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413116","url":null,"abstract":"With the increasing importance of run-time leakage power dissipation (around 55% of total power), it has become necessary to accurately estimate it not only as a function of input vectors but also as a function of process parameters. Leakage power corresponding to the maximum vector presents itself as a higher bound for run-time leakage and is a measure of reliability. In this work, we address the problem of accurately estimating the probabilistic distribution of the maximum runtime leakage power in the presence of variations in process parameters such as threshold voltage, critical dimensions and doping concentration. Both sub-threshold and gate leakage current are considered. A heuristic approach is proposed to determine the vector that causes the maximum leakage power under the influence of random process variations. This vector is then used to estimate the lognormal distribution of the total leakage current of the circuit by summing up the lognormal leakage current distributions of the individual standard cells at their respective input levels. The proposed method has been effective in accurately estimating the leakage mean, standard deviation and probability density function (PDF) of ISCAS-85 benchmark circuits. The average errors of our method compared with near exhaustive random vector testing for mean and standard deviation are 1.32% and 1.41% respectively.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126752807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
2009 IEEE International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1