Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413140
Siddhartha Chhabra, Brian Rogers, Yan Solihin
Many systems may have security requirements such as protecting the privacy of data and code stored in the system, ensuring integrity of computations, or preventing the execution of unauthorized code. It is becoming increasingly difficult to ensure such protections as hardware-based attacks, in addition to software attacks, become more widespread and feasible. Many of these attacks target a system during booting before any employed security measures can take effect. In this paper, we propose SHIELDSTRAP, a security architecture capable of booting a system securely in the face of hardware and software attacks targeting the boot phase. SHIELDSTRAP bridges the gap between the vulnerable initialization of the system and the secure steady state execution environment provided by the secure processor. We present an analysis of the security of SHIELDSTRAP against several common boot time attacks. We also show that SHIELDSTRAP requires an on-chip area overhead of only 0.012% and incurs negligible boot time overhead of 0.37 seconds.
{"title":"SHIELDSTRAP: Making secure processors truly secure","authors":"Siddhartha Chhabra, Brian Rogers, Yan Solihin","doi":"10.1109/ICCD.2009.5413140","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413140","url":null,"abstract":"Many systems may have security requirements such as protecting the privacy of data and code stored in the system, ensuring integrity of computations, or preventing the execution of unauthorized code. It is becoming increasingly difficult to ensure such protections as hardware-based attacks, in addition to software attacks, become more widespread and feasible. Many of these attacks target a system during booting before any employed security measures can take effect. In this paper, we propose SHIELDSTRAP, a security architecture capable of booting a system securely in the face of hardware and software attacks targeting the boot phase. SHIELDSTRAP bridges the gap between the vulnerable initialization of the system and the secure steady state execution environment provided by the secure processor. We present an analysis of the security of SHIELDSTRAP against several common boot time attacks. We also show that SHIELDSTRAP requires an on-chip area overhead of only 0.012% and incurs negligible boot time overhead of 0.37 seconds.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"270 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133474317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413138
Weiwu Hu, Qi Liu, Jian Wang, Songsong Cai, Menghao Su, Xiaoyun Li
Binary translation is one of the most important approaches for system migration. However, software binary translation systems often suffer from the inefficiency and traditional hardware-software co-designed virtual machines require the unavoidable re-design of the processor architecture. This paper presents a novel hardware-software co-designed method to accelerate the binary translation on an existing architecture. The hardware supports for source-architecture-only functions, partial decodes and binary translation system acceleration are proposed. These hardware supports help the binary translation system to achieve high performance and simplify the design of the binary translation software. In the meantime, the hardware cost is well controlled in a certain low level. These supports are implemented in Godson-3 processors to speedup the x86 binary translation to the native MIPS instruction set. Performance evaluations on RTL simulation and FPGA emulation platforms show that the proposed method can speedup most benchmark programs by nearly 10 times compared to pure software-based binary translation and achieves about 70% performance of the native program execution. The chip is fabricated in ST 65nm CMOS technology, and the physical design results show that the chip area cost is less than 5%.
{"title":"Efficient binary translation system with low hardware cost","authors":"Weiwu Hu, Qi Liu, Jian Wang, Songsong Cai, Menghao Su, Xiaoyun Li","doi":"10.1109/ICCD.2009.5413138","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413138","url":null,"abstract":"Binary translation is one of the most important approaches for system migration. However, software binary translation systems often suffer from the inefficiency and traditional hardware-software co-designed virtual machines require the unavoidable re-design of the processor architecture. This paper presents a novel hardware-software co-designed method to accelerate the binary translation on an existing architecture. The hardware supports for source-architecture-only functions, partial decodes and binary translation system acceleration are proposed. These hardware supports help the binary translation system to achieve high performance and simplify the design of the binary translation software. In the meantime, the hardware cost is well controlled in a certain low level. These supports are implemented in Godson-3 processors to speedup the x86 binary translation to the native MIPS instruction set. Performance evaluations on RTL simulation and FPGA emulation platforms show that the proposed method can speedup most benchmark programs by nearly 10 times compared to pure software-based binary translation and achieves about 70% performance of the native program execution. The chip is fabricated in ST 65nm CMOS technology, and the physical design results show that the chip area cost is less than 5%.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125924884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413113
S. Abdulla, Haewoon Nam, Mark McDermot, J. Abraham
A novel technique for implementing very high speed FFTs based on unrolled CORDIC structures is proposed in this paper. There has been a lot of research in the area of FFT algorithm implementation; most of the research is focused on reduction of the computational complexity by selection and efficient decomposition of the FFT algorithm. However there has not been much research on using the CORDIC structures for FFT implementations, especially for large, high speed and high throughput FFT transforms, due to the recursive nature of the CORDIC algorithms. The key ideas in this paper are replacing the sine and cosine twiddle factors in the conventional FFT architecture by non-iterative CORDIC micro-rotations which allow substantial (~ 50%) reduction in read-only memory (ROM) table size, and total removal of complex multipliers. A new method to derive the optimal unrolling/unfolding factor for a desired FFT application based on the MSE (mean square error) is also proposed in this paper. Implemented on a Virtex-4 FPGA, the CORDIC based FFT runs 3.9 times faster and occupies 37% less area than an equivalent complex multiplier-based FFT implementation.
{"title":"A high throughput FFT processor with no multipliers","authors":"S. Abdulla, Haewoon Nam, Mark McDermot, J. Abraham","doi":"10.1109/ICCD.2009.5413113","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413113","url":null,"abstract":"A novel technique for implementing very high speed FFTs based on unrolled CORDIC structures is proposed in this paper. There has been a lot of research in the area of FFT algorithm implementation; most of the research is focused on reduction of the computational complexity by selection and efficient decomposition of the FFT algorithm. However there has not been much research on using the CORDIC structures for FFT implementations, especially for large, high speed and high throughput FFT transforms, due to the recursive nature of the CORDIC algorithms. The key ideas in this paper are replacing the sine and cosine twiddle factors in the conventional FFT architecture by non-iterative CORDIC micro-rotations which allow substantial (~ 50%) reduction in read-only memory (ROM) table size, and total removal of complex multipliers. A new method to derive the optimal unrolling/unfolding factor for a desired FFT application based on the MSE (mean square error) is also proposed in this paper. Implemented on a Virtex-4 FPGA, the CORDIC based FFT runs 3.9 times faster and occupies 37% less area than an equivalent complex multiplier-based FFT implementation.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127438200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413130
Xin Fan, M. Krstic, E. Grass
Pausible clocking based globally-asynchronous locally-synchronous (GALS) system design has been proven a promising approach to SoCs and NoCs. In this paper, we analyze the throughput reduction and synchronization failures introduced by the widely used pausible clocking scheme, and propose an optimized scheme for higher throughput and more reliable GALS design. The local clock generator is improved to minimize the acknowledge latency, and a novel input port is applied to maximize the safe timing region for the clock tree insertion. Simulation results using the IHP 0.13-¿m standard CMOS process demonstrate that up to one-third increase in data throughput and an almost doubled safe timing region for clock tree distribution can be achieved in comparison to the traditional pausible clocking scheme.
{"title":"Analysis and optimization of pausible clocking based GALS design","authors":"Xin Fan, M. Krstic, E. Grass","doi":"10.1109/ICCD.2009.5413130","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413130","url":null,"abstract":"Pausible clocking based globally-asynchronous locally-synchronous (GALS) system design has been proven a promising approach to SoCs and NoCs. In this paper, we analyze the throughput reduction and synchronization failures introduced by the widely used pausible clocking scheme, and propose an optimized scheme for higher throughput and more reliable GALS design. The local clock generator is improved to minimize the acknowledge latency, and a novel input port is applied to maximize the safe timing region for the clock tree insertion. Simulation results using the IHP 0.13-¿m standard CMOS process demonstrate that up to one-third increase in data throughput and an almost doubled safe timing region for clock tree distribution can be achieved in comparison to the traditional pausible clocking scheme.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129948023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413177
S. Edwards, Sungjun Kim, Edward A. Lee, Isaac Liu, Hiren D. Patel, Martin Schoeberl
This paper argues that repeatable timing is more important and more achievable than predictable timing. It describes microarchitecture approaches to pipelining and memory hierarchy that deliver repeatable timing and promise comparable or better performance compared to established techniques. Specifically, threads are interleaved in a pipeline to eliminate pipeline hazards, and a hierarchical memory architecture is outlined that hides memory latencies.
{"title":"A disruptive computer design idea: Architectures with repeatable timing","authors":"S. Edwards, Sungjun Kim, Edward A. Lee, Isaac Liu, Hiren D. Patel, Martin Schoeberl","doi":"10.1109/ICCD.2009.5413177","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413177","url":null,"abstract":"This paper argues that repeatable timing is more important and more achievable than predictable timing. It describes microarchitecture approaches to pipelining and memory hierarchy that deliver repeatable timing and promise comparable or better performance compared to established techniques. Specifically, threads are interleaved in a pipeline to eliminate pipeline hazards, and a hierarchical memory architecture is outlined that hides memory latencies.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129456970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413125
Jiawei Huang, J. Lach
Application-specific integrated circuits (ASICs) are physical implementations of algorithms, so implementation metrics are determined in large part by the algorithm specification. However, the system abstraction layers that have been developed to manage the ever-increasing complexity of digital systems separate algorithm designers from hardware designers, forcing the latter to work within the design space specified by the former, even for applications such as multimedia that do not have hard fidelity requirements. Designers typically employ informal iterative design to adjust fidelity, but a formal design methodology would increase designer efficiency and improve the quality of the solutions. This paper introduces such a methodology (and accompanying tool) that enables algorithm and implementation metrics to be co-optimized during early design exploration, opening the design space to include solutions that may provide, for example, significant performance improvements while only slightly compromising fidelity. Hierarchical dependency graphs (HDGs) are used to represent both the algorithm and the implementation architecture, providing a common interface through which algorithm designers and hardware designers can explore the collaborative space (ColSpace) together. Using the proposed technique, the ColSpace tool can trade off various metrics to find the best overall design while managing complexity with the HDG hierarchy. Two image processing case studies demonstrate that in ColSpace-optimized designs, latency savings can exceed fidelity losses, resulting in cost function reductions that would not have been possible without this co-optimization methodology.
{"title":"ColSpace: Towards algorithm/implementation co-optimization","authors":"Jiawei Huang, J. Lach","doi":"10.1109/ICCD.2009.5413125","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413125","url":null,"abstract":"Application-specific integrated circuits (ASICs) are physical implementations of algorithms, so implementation metrics are determined in large part by the algorithm specification. However, the system abstraction layers that have been developed to manage the ever-increasing complexity of digital systems separate algorithm designers from hardware designers, forcing the latter to work within the design space specified by the former, even for applications such as multimedia that do not have hard fidelity requirements. Designers typically employ informal iterative design to adjust fidelity, but a formal design methodology would increase designer efficiency and improve the quality of the solutions. This paper introduces such a methodology (and accompanying tool) that enables algorithm and implementation metrics to be co-optimized during early design exploration, opening the design space to include solutions that may provide, for example, significant performance improvements while only slightly compromising fidelity. Hierarchical dependency graphs (HDGs) are used to represent both the algorithm and the implementation architecture, providing a common interface through which algorithm designers and hardware designers can explore the collaborative space (ColSpace) together. Using the proposed technique, the ColSpace tool can trade off various metrics to find the best overall design while managing complexity with the HDG hierarchy. Two image processing case studies demonstrate that in ColSpace-optimized designs, latency savings can exceed fidelity losses, resulting in cost function reductions that would not have been possible without this co-optimization methodology.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117227750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413118
De-Shiun Fu, Ying-Zhih Chaung, Yen-Hung Lin, Yih-Lang Li
Traditional layout migration focuses on area minimization, thus suffered wire distortion, which caused loss of layout topology. A migrated layout inheriting original topology owns original design intention and predictable property, such as wire length which determines the path delay importantly. This work presents a new rectangular topological layout to preserve layout topology and combine its flexibility of handling wires with traditional scan-line based compaction algorithm for area minimization. The proposed migration flow contains devices and wires extraction, topological layout construction, unidirectional compression combining scan-line algorithm with collinear equation solver, and wire restoration. Experimental results show that cell topology is well preserved, and a several times runtime speedup is achieved as compared with recent migration research based on ILP (integer linear programming) formulation.
{"title":"Topology-driven cell layout migration with collinear constraints","authors":"De-Shiun Fu, Ying-Zhih Chaung, Yen-Hung Lin, Yih-Lang Li","doi":"10.1109/ICCD.2009.5413118","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413118","url":null,"abstract":"Traditional layout migration focuses on area minimization, thus suffered wire distortion, which caused loss of layout topology. A migrated layout inheriting original topology owns original design intention and predictable property, such as wire length which determines the path delay importantly. This work presents a new rectangular topological layout to preserve layout topology and combine its flexibility of handling wires with traditional scan-line based compaction algorithm for area minimization. The proposed migration flow contains devices and wires extraction, topological layout construction, unidirectional compression combining scan-line algorithm with collinear equation solver, and wire restoration. Experimental results show that cell topology is well preserved, and a several times runtime speedup is achieved as compared with recent migration research based on ILP (integer linear programming) formulation.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132526903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413110
M. Putic, Liang Di, B. Calhoun, J. Lach
The energy efficiency of a CMOS architecture processing dynamic workloads directly affects its ability to provide long battery lifetimes while maintaining required application performance. Existing scalable architecture design approaches are often limited in scope, focusing either only on circuit-level optimizations or architectural adaptations individually. In this paper, we propose a circuit/architecture co-design methodology called Panoptic Dynamic Voltage Scaling (PDVS) that makes more efficient use of common circuit structures and algorithm-level processing rate control. PDVS expands upon prior work by using multiple component-level PMOS header switches to enable fine-grained rate control, allowing efficient dithering among statically scheduled algorithms with sub-block energy savings. This way, PDVS is able to achieve a wide variety of processing rates to match incoming workload as closely as possible, while each iteration takes less energy to process than on architectures with coarser levels of rate control. Measurements taken from a fabricated 90nm test chip characterize both savings and overheads and are used to inform PDVS synthesis decisions. Results show that PDVS consumes up to 34% and 44% less energy than Multi-VDD and Single-VDD systems, respectively.
{"title":"Panoptic DVS: A fine-grained dynamic voltage scaling framework for energy scalable CMOS design","authors":"M. Putic, Liang Di, B. Calhoun, J. Lach","doi":"10.1109/ICCD.2009.5413110","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413110","url":null,"abstract":"The energy efficiency of a CMOS architecture processing dynamic workloads directly affects its ability to provide long battery lifetimes while maintaining required application performance. Existing scalable architecture design approaches are often limited in scope, focusing either only on circuit-level optimizations or architectural adaptations individually. In this paper, we propose a circuit/architecture co-design methodology called Panoptic Dynamic Voltage Scaling (PDVS) that makes more efficient use of common circuit structures and algorithm-level processing rate control. PDVS expands upon prior work by using multiple component-level PMOS header switches to enable fine-grained rate control, allowing efficient dithering among statically scheduled algorithms with sub-block energy savings. This way, PDVS is able to achieve a wide variety of processing rates to match incoming workload as closely as possible, while each iteration takes less energy to process than on architectures with coarser levels of rate control. Measurements taken from a fabricated 90nm test chip characterize both savings and overheads and are used to inform PDVS synthesis decisions. Results show that PDVS consumes up to 34% and 44% less energy than Multi-VDD and Single-VDD systems, respectively.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131465545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413164
Masashi Imai, T. Yoneda, T. Nanya
In this paper, we propose two new N-way arbiter circuits. One circuit is based on the token-ring arbiters and another circuit is based on the mesh arbiters. The idea of the ring arbiter is to generate a lock signal by a token which is based on the non-return-to-zero signaling. It can achieve low latency and high throughput arbitration for a heavy work load environment. The idea of the mesh arbiter is to perform arbitrations between N/2 pairs at the same level and repeat them N-1 times. They can issue grant signals fairly. In this paper, we compare the performance of these N-way arbiters using 65nm process technologies qualitatively and quantitatively. We conclude that the proposed mesh arbiters are suitable when the number of inputs is 5 or less. We also conclude that we must select the appropriate arbiters considering tradeoff between latency, throughput, area, and energy when the number of inputs is larger than 5.
{"title":"N-way ring and square arbiters","authors":"Masashi Imai, T. Yoneda, T. Nanya","doi":"10.1109/ICCD.2009.5413164","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413164","url":null,"abstract":"In this paper, we propose two new N-way arbiter circuits. One circuit is based on the token-ring arbiters and another circuit is based on the mesh arbiters. The idea of the ring arbiter is to generate a lock signal by a token which is based on the non-return-to-zero signaling. It can achieve low latency and high throughput arbitration for a heavy work load environment. The idea of the mesh arbiter is to perform arbitrations between N/2 pairs at the same level and repeat them N-1 times. They can issue grant signals fairly. In this paper, we compare the performance of these N-way arbiters using 65nm process technologies qualitatively and quantitatively. We conclude that the proposed mesh arbiters are suitable when the number of inputs is 5 or less. We also conclude that we must select the appropriate arbiters considering tradeoff between latency, throughput, area, and energy when the number of inputs is larger than 5.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126775074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413116
Romana Fernandes, R. Vemuri
With the increasing importance of run-time leakage power dissipation (around 55% of total power), it has become necessary to accurately estimate it not only as a function of input vectors but also as a function of process parameters. Leakage power corresponding to the maximum vector presents itself as a higher bound for run-time leakage and is a measure of reliability. In this work, we address the problem of accurately estimating the probabilistic distribution of the maximum runtime leakage power in the presence of variations in process parameters such as threshold voltage, critical dimensions and doping concentration. Both sub-threshold and gate leakage current are considered. A heuristic approach is proposed to determine the vector that causes the maximum leakage power under the influence of random process variations. This vector is then used to estimate the lognormal distribution of the total leakage current of the circuit by summing up the lognormal leakage current distributions of the individual standard cells at their respective input levels. The proposed method has been effective in accurately estimating the leakage mean, standard deviation and probability density function (PDF) of ISCAS-85 benchmark circuits. The average errors of our method compared with near exhaustive random vector testing for mean and standard deviation are 1.32% and 1.41% respectively.
{"title":"Accurate estimation of vector dependent leakage power in the presence of process variations","authors":"Romana Fernandes, R. Vemuri","doi":"10.1109/ICCD.2009.5413116","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413116","url":null,"abstract":"With the increasing importance of run-time leakage power dissipation (around 55% of total power), it has become necessary to accurately estimate it not only as a function of input vectors but also as a function of process parameters. Leakage power corresponding to the maximum vector presents itself as a higher bound for run-time leakage and is a measure of reliability. In this work, we address the problem of accurately estimating the probabilistic distribution of the maximum runtime leakage power in the presence of variations in process parameters such as threshold voltage, critical dimensions and doping concentration. Both sub-threshold and gate leakage current are considered. A heuristic approach is proposed to determine the vector that causes the maximum leakage power under the influence of random process variations. This vector is then used to estimate the lognormal distribution of the total leakage current of the circuit by summing up the lognormal leakage current distributions of the individual standard cells at their respective input levels. The proposed method has been effective in accurately estimating the leakage mean, standard deviation and probability density function (PDF) of ISCAS-85 benchmark circuits. The average errors of our method compared with near exhaustive random vector testing for mean and standard deviation are 1.32% and 1.41% respectively.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126752807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}