Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413108
R. Kumar, V. Karkala, Rajesh Garg, Tanuj Jindal, S. Khatri
With decreasing feature sizes, lowered supply voltages and increasing operating frequencies, the radiation tolerance of digital circuits is becoming an increasingly important problem. Many radiation hardening techniques have been presented in the literature for combinational as well as sequential logic. However, the radiation tolerance of clock generation circuitry has received scant attention to date. Recently, it has been shown that in the deep submicron regime, the clock network contributes significantly to the chip level Soft Error Rate (SER). The on-chip Phase Locked Loop (PLL) is particularly vulnerable to radiation strikes. In this paper, we present a radiation hardened PLL design. Each of the components of this design — the voltage controlled oscillator (VCO), the phase frequency detector (PFD) and the loop filter are designed in a radiation tolerant manner. Whenever possible, the circuit elements used in our PLL exploit the fact that if a gate is implemented using only PMOS (NMOS) transistors then a radiation particle strike can result only in a logic 0 to 1 (1 to 0) flip. By separating the PMOS and NMOS devices, and splitting the gate output into two signals, extreme high levels of radiation tolerance are obtained. Our PLL is tested for radiation immunity for critical charge values up to 250fC. Our results demonstrate that over a large number of radiation strikes on a number of sensitive nodes in our design, the worst case jitter is just 18%. In the worst case, our PLL returns to the locked state in 16 cycles of the VCO clock, after a radiation strike.
{"title":"A radiation tolerant Phase Locked Loop design for digital electronics","authors":"R. Kumar, V. Karkala, Rajesh Garg, Tanuj Jindal, S. Khatri","doi":"10.1109/ICCD.2009.5413108","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413108","url":null,"abstract":"With decreasing feature sizes, lowered supply voltages and increasing operating frequencies, the radiation tolerance of digital circuits is becoming an increasingly important problem. Many radiation hardening techniques have been presented in the literature for combinational as well as sequential logic. However, the radiation tolerance of clock generation circuitry has received scant attention to date. Recently, it has been shown that in the deep submicron regime, the clock network contributes significantly to the chip level Soft Error Rate (SER). The on-chip Phase Locked Loop (PLL) is particularly vulnerable to radiation strikes. In this paper, we present a radiation hardened PLL design. Each of the components of this design — the voltage controlled oscillator (VCO), the phase frequency detector (PFD) and the loop filter are designed in a radiation tolerant manner. Whenever possible, the circuit elements used in our PLL exploit the fact that if a gate is implemented using only PMOS (NMOS) transistors then a radiation particle strike can result only in a logic 0 to 1 (1 to 0) flip. By separating the PMOS and NMOS devices, and splitting the gate output into two signals, extreme high levels of radiation tolerance are obtained. Our PLL is tested for radiation immunity for critical charge values up to 250fC. Our results demonstrate that over a large number of radiation strikes on a number of sensitive nodes in our design, the worst case jitter is just 18%. In the worst case, our PLL returns to the locked state in 16 cycles of the VCO clock, after a radiation strike.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413120
Rance Rodrigues, Aswin Sreedhar, S. Kundu
Optical Lithography is an indispensible step in the process flow of Design for Manufacturability (DFM). Optical lithography simulation is a compute intensive task and simulation performance, or lack thereof can be a determining factor in time to market. Thus, the efficiency of lithography simulation is of paramount importance. Coherent decomposition is a popular simulation technique for aerial imaging simulation. In this paper, we propose an approximate simulation technique based on the 2D wavelet transform and use a number of optimization methods to further improve polygon edge detection. Results show that the proposed method suffers from an average error of less than 5% when compared with the coherent decomposition method. The benefits of the proposed method are (i) >10X increase in performance and more importantly (ii) it allows very large circuits to be simulated while some commercial tools are severely capacity limited. Approximate simulation is quite attractive for layout optimization where it may be used in a loop and may even be acceptable for final layout verification.
{"title":"Optical lithography simulation using wavelet transform","authors":"Rance Rodrigues, Aswin Sreedhar, S. Kundu","doi":"10.1109/ICCD.2009.5413120","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413120","url":null,"abstract":"Optical Lithography is an indispensible step in the process flow of Design for Manufacturability (DFM). Optical lithography simulation is a compute intensive task and simulation performance, or lack thereof can be a determining factor in time to market. Thus, the efficiency of lithography simulation is of paramount importance. Coherent decomposition is a popular simulation technique for aerial imaging simulation. In this paper, we propose an approximate simulation technique based on the 2D wavelet transform and use a number of optimization methods to further improve polygon edge detection. Results show that the proposed method suffers from an average error of less than 5% when compared with the coherent decomposition method. The benefits of the proposed method are (i) >10X increase in performance and more importantly (ii) it allows very large circuits to be simulated while some commercial tools are severely capacity limited. Approximate simulation is quite attractive for layout optimization where it may be used in a loop and may even be acceptable for final layout verification.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114874887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413155
R. Chakraborty, D. R. Chowdhury
The high complexity and the core diversities make timing verification of an entire flattened SoC design a tedious process. In this paper, at first the various timing issues related to modular SoC verification have been investigated and then a bottom-up hierarchical approach of verifying the system level timing of an SoC, is presented. The timing abstractions of the cores are assumed to be provided by the core vendors. The interconnection delays of the SoC may be extracted from the SDF file generated after post layout simulation. The hierarchical approach provides a fast and systematic way of timing verification, as opposed to the flattened approach. Experiments were conducted on synthetic SoCs, using ISCAS benchmark circuits as cores. Results validate the claim of the proposed approach.
{"title":"A hierarchical approach towards system level static timing verification of SoCs","authors":"R. Chakraborty, D. R. Chowdhury","doi":"10.1109/ICCD.2009.5413155","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413155","url":null,"abstract":"The high complexity and the core diversities make timing verification of an entire flattened SoC design a tedious process. In this paper, at first the various timing issues related to modular SoC verification have been investigated and then a bottom-up hierarchical approach of verifying the system level timing of an SoC, is presented. The timing abstractions of the cores are assumed to be provided by the core vendors. The interconnection delays of the SoC may be extracted from the SDF file generated after post layout simulation. The hierarchical approach provides a fast and systematic way of timing verification, as opposed to the flattened approach. Experiments were conducted on synthetic SoCs, using ISCAS benchmark circuits as cores. Results validate the claim of the proposed approach.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122324397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Application-Specific Instruction set Processor (ASIP) has become an increasingly popular platform for embedded systems because of its high performance and flexibility. Energy efficiency is critical for portable and embedded devices, and should be addressed separately from performance consideration. The hardware extension in ASIPs can speed-up program execution, but also incurs area overhead and static energy consumption of the processors. Traditional data path merging techniques reduce circuit overhead by reusing hardware resources for executing multiple custom instructions. However, they introduce structural hazard for custom instructions on extended processors, and hence reduce the performance improvement. In this paper, we introduce a pipelined configurable hardware structure for the hardware extension in ASIPs, so that structural hazards can be remedied. With multiple subgraphs of operations selected for custom hardware realization, we devise a novel operation-to-hardware mapping algorithm based on Integer Linear Programming (ILP) to automatically construct a resource-efficient pipelined configurable hardware extension. We demonstrate that different resource sharing schemes would affect both the hardware overhead and datapath delay of the custom instructions. We analyze the design tradeoffs between resource efficiency and performance improvement, and present the design space exploration results.
专用指令集处理器(Application-Specific Instruction set Processor, ASIP)以其高性能和灵活性成为嵌入式系统中越来越受欢迎的平台。对于便携式和嵌入式设备来说,能源效率是至关重要的,应该与性能考虑分开考虑。api中的硬件扩展可以加快程序的执行速度,但也会增加处理器的面积开销和静态能耗。传统的数据路径合并技术通过重用硬件资源来执行多个自定义指令来减少电路开销。然而,它们会给扩展处理器上的自定义指令带来结构性危害,从而降低了性能改进。在本文中,我们为api中的硬件扩展引入了一种流水线可配置的硬件结构,从而可以弥补结构上的危害。通过选择多个操作子图用于自定义硬件实现,我们设计了一种新的基于整数线性规划(ILP)的操作到硬件映射算法,以自动构建资源高效的流水线可配置硬件扩展。我们证明了不同的资源共享方案会影响自定义指令的硬件开销和数据路径延迟。我们分析了资源效率和性能改进之间的设计权衡,并给出了设计空间探索的结果。
{"title":"Resource sharing of pipelined custom hardware extension for energy-efficient application-specific instruction set processor design","authors":"Hai Lin, Yunsi Fei","doi":"10.1145/2348839.2348843","DOIUrl":"https://doi.org/10.1145/2348839.2348843","url":null,"abstract":"Application-Specific Instruction set Processor (ASIP) has become an increasingly popular platform for embedded systems because of its high performance and flexibility. Energy efficiency is critical for portable and embedded devices, and should be addressed separately from performance consideration. The hardware extension in ASIPs can speed-up program execution, but also incurs area overhead and static energy consumption of the processors. Traditional data path merging techniques reduce circuit overhead by reusing hardware resources for executing multiple custom instructions. However, they introduce structural hazard for custom instructions on extended processors, and hence reduce the performance improvement. In this paper, we introduce a pipelined configurable hardware structure for the hardware extension in ASIPs, so that structural hazards can be remedied. With multiple subgraphs of operations selected for custom hardware realization, we devise a novel operation-to-hardware mapping algorithm based on Integer Linear Programming (ILP) to automatically construct a resource-efficient pipelined configurable hardware extension. We demonstrate that different resource sharing schemes would affect both the hardware overhead and datapath delay of the custom instructions. We analyze the design tradeoffs between resource efficiency and performance improvement, and present the design space exploration results.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130513785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413163
Weirong Jiang, V. Prasanna
Power consumption has become a limiting factor in next-generation routers. IP forwarding engines dominate the overall power dissipation in a router. Although SRAM-based pipeline architectures have recently been developed as a promising alternative to power-hungry TCAM-based solutions for high-throughput IP forwarding, it remains a challenge to achieve low power. This paper proposes several novel architecture-specific techniques to reduce the dynamic power consumption in SRAM-based pipelined IP forwarding engines. First, the pipeline architecture itself is built as an inherent cache, exploiting the data locality in Internet traffic. The number of memory accesses which contribute to the majority of power consumption, is thus reduced. No external cache is needed. Second, instead of using a global clock, different pipeline stages are driven by separate clocks. The local clocking scheme is carefully designed to exploit the traffic rate variation and improve the caching performance. Third, a fine-grained memory enabling scheme is developed to eliminate unnecessary memory accesses, while preserving the packet order. Simulation experiments using real-life traces show that our solutions can achieve up to 15-fold reduction in dynamic power dissipation, over the baseline pipeline architecture that does not employ the proposed schemes. FPGA implementation results show that our design sustains 40 Gbps throughput for minimum size (40 bytes) packets while consuming a small amount of logic resources.
{"title":"Reducing dynamic power dissipation in pipelined forwarding engines","authors":"Weirong Jiang, V. Prasanna","doi":"10.1109/ICCD.2009.5413163","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413163","url":null,"abstract":"Power consumption has become a limiting factor in next-generation routers. IP forwarding engines dominate the overall power dissipation in a router. Although SRAM-based pipeline architectures have recently been developed as a promising alternative to power-hungry TCAM-based solutions for high-throughput IP forwarding, it remains a challenge to achieve low power. This paper proposes several novel architecture-specific techniques to reduce the dynamic power consumption in SRAM-based pipelined IP forwarding engines. First, the pipeline architecture itself is built as an inherent cache, exploiting the data locality in Internet traffic. The number of memory accesses which contribute to the majority of power consumption, is thus reduced. No external cache is needed. Second, instead of using a global clock, different pipeline stages are driven by separate clocks. The local clocking scheme is carefully designed to exploit the traffic rate variation and improve the caching performance. Third, a fine-grained memory enabling scheme is developed to eliminate unnecessary memory accesses, while preserving the packet order. Simulation experiments using real-life traces show that our solutions can achieve up to 15-fold reduction in dynamic power dissipation, over the baseline pipeline architecture that does not employ the proposed schemes. FPGA implementation results show that our design sustains 40 Gbps throughput for minimum size (40 bytes) packets while consuming a small amount of logic resources.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131017792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413112
C. Meenderinck, B. Juurlink
Current research is mainly focussing on exploiting TLP to increase performance. Another avenue, however, for achieving performance scalability is specialization. In this paper we propose application specific intra-vector instructions for two dimensional signal processing kernels. In such kernels usually significant data rearrangement overhead is required in order to use the SIMD capabilities. When using the intra-vector instructions the overhead can be avoided. We have implemented intra-vector instructions in the Cell SPU core and measured speedups of up to 2.06, with an average of 1.45.
{"title":"Intra-vector SIMD instructions for core specialization","authors":"C. Meenderinck, B. Juurlink","doi":"10.1109/ICCD.2009.5413112","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413112","url":null,"abstract":"Current research is mainly focussing on exploiting TLP to increase performance. Another avenue, however, for achieving performance scalability is specialization. In this paper we propose application specific intra-vector instructions for two dimensional signal processing kernels. In such kernels usually significant data rearrangement overhead is required in order to use the SIMD capabilities. When using the intra-vector instructions the overhead can be avoided. We have implemented intra-vector instructions in the Cell SPU core and measured speedups of up to 2.06, with an average of 1.45.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131989049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413123
Seung Eun Lee, Yong Zhang, Zhen Fang, S. Srinivasan, R. Iyer, D. Newell
Mobile Augmented Reality (MAR) is an emerging visual computing application for the mobile internet device (MID). In one MAR usage model, the user points the handheld device to an object (like a wine bottle or a building) and the MID automatically recognizes and displays information regarding the object. Achieving this in software on the handheld requires significant compute processing for object recognition and matching. In this paper, we identify hotspot functions of the MAR workload on a low-power x86 platform that motivates acceleration. We present the detailed design of two hardware accelerators, one for object recognition (MAR-HA) and the other for match processing (MAR-MA). We also quantify the performance and area efficiency of the hardware accelerators. Our analysis shows that hardware acceleration has the potential to improve the individual hotspot functions by as much as 20x, and overall response time by 7x. As a result, user response time can be reduced significantly.
{"title":"Accelerating mobile augmented reality on a handheld platform","authors":"Seung Eun Lee, Yong Zhang, Zhen Fang, S. Srinivasan, R. Iyer, D. Newell","doi":"10.1109/ICCD.2009.5413123","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413123","url":null,"abstract":"Mobile Augmented Reality (MAR) is an emerging visual computing application for the mobile internet device (MID). In one MAR usage model, the user points the handheld device to an object (like a wine bottle or a building) and the MID automatically recognizes and displays information regarding the object. Achieving this in software on the handheld requires significant compute processing for object recognition and matching. In this paper, we identify hotspot functions of the MAR workload on a low-power x86 platform that motivates acceleration. We present the detailed design of two hardware accelerators, one for object recognition (MAR-HA) and the other for match processing (MAR-MA). We also quantify the performance and area efficiency of the hardware accelerators. Our analysis shows that hardware acceleration has the potential to improve the individual hotspot functions by as much as 20x, and overall response time by 7x. As a result, user response time can be reduced significantly.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129614221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413180
Rensheng Wang, Takumi Okamoto, Chung-Kuan Cheng
As the feature size of VLSI circuits scales down and clock rates increases, circuit performance is becoming more sensitive to process variations. This paper proposes an algorithm of symmetrical buffer placement in symmetrical clock trees to achieve zero-skew in theory, as well as robust low skew under process or environment variations. With the completely symmetrical structure, we can eliminate many factors of clock skew such as model inaccuracy, environment temperature and intra-die process variations. We devise a new dynamic programming scheme to handle buffer placement and wire sizing under the constraint of symmetry. By classifying the wires by tree levels and defining the level-dependent blockages, the potential candidate points in the gaps of circuit blocks can be fully explored. The algorithm is efficient for minimizing source-sink delay as well as other linear cost functions. Experiments show that our method helps to obtain a balanced design of clock tree with low delay, skew and power.
{"title":"Symmetrical buffer placement in clock trees for minimal skew immune to global on-chip variations","authors":"Rensheng Wang, Takumi Okamoto, Chung-Kuan Cheng","doi":"10.1109/ICCD.2009.5413180","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413180","url":null,"abstract":"As the feature size of VLSI circuits scales down and clock rates increases, circuit performance is becoming more sensitive to process variations. This paper proposes an algorithm of symmetrical buffer placement in symmetrical clock trees to achieve zero-skew in theory, as well as robust low skew under process or environment variations. With the completely symmetrical structure, we can eliminate many factors of clock skew such as model inaccuracy, environment temperature and intra-die process variations. We devise a new dynamic programming scheme to handle buffer placement and wire sizing under the constraint of symmetry. By classifying the wires by tree levels and defining the level-dependent blockages, the potential candidate points in the gaps of circuit blocks can be fully explored. The algorithm is efficient for minimizing source-sink delay as well as other linear cost functions. Experiments show that our method helps to obtain a balanced design of clock tree with low delay, skew and power.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122092845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413127
A. Bhoj, N. Jha
Scaling bulk CMOS SRAM technology for on-chip caches beyond the 22nm node is questionable, on account of high leakage power consumption, performance degradation, and instability due to process variations. Recently, two/three transistor one gated-diode (2T/3T1D) DRAMs were proposed as alternatives to address the SRAM variability problem, with an emphasis on high-activity embedded cache applications. They are highly competitive with an SRAM in terms of performance, while having a smaller power and area footprint at lower technology nodes. The current evolutionary trend in transistor structures is toward an era of multi-gate devices, which makes it necessary to identify design issues and advantages of gated-diode DRAMs implemented in a multi-gate technology. In this work, we address gated-diode DRAM design in FinFET technology using mixed-mode 2D-device simulations. We revisit the model of internal voltage gain in bulk gated-diodes and extend it to provide quantitative insight into designing Fin gated-diodes, i.e., gated-diodes in FinFET technology. To this effect, we propose FinFET variants of the bulk gated-diode configuration and identify parameters that are critical to enhancing the retention time and read current in 2T/3T1D FinFET DRAMs. Additionally, we show the superiority of 2T1D FinFET DRAM over 6T FinFET SRAM having pass-gate feedback (6T PGFB) and 2T1D bulk DRAM under the effect of variations using a quasi-Monte Carlo method implemented in FinE, an environment we have developed for double-gate circuit design that integrates Sentaurus TCAD from Synopsys with the Spice3-UFDG double-gate compact model from University of Florida under a single framework. Finally, we present a new tunable threshold gated-diode FinFET amplifier which uses an n-type gated-diode for voltage-boosting, along with a p-type gated-diode for zero-suppression.
{"title":"Pragmatic design of gated-diode FinFET DRAMs","authors":"A. Bhoj, N. Jha","doi":"10.1109/ICCD.2009.5413127","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413127","url":null,"abstract":"Scaling bulk CMOS SRAM technology for on-chip caches beyond the 22nm node is questionable, on account of high leakage power consumption, performance degradation, and instability due to process variations. Recently, two/three transistor one gated-diode (2T/3T1D) DRAMs were proposed as alternatives to address the SRAM variability problem, with an emphasis on high-activity embedded cache applications. They are highly competitive with an SRAM in terms of performance, while having a smaller power and area footprint at lower technology nodes. The current evolutionary trend in transistor structures is toward an era of multi-gate devices, which makes it necessary to identify design issues and advantages of gated-diode DRAMs implemented in a multi-gate technology. In this work, we address gated-diode DRAM design in FinFET technology using mixed-mode 2D-device simulations. We revisit the model of internal voltage gain in bulk gated-diodes and extend it to provide quantitative insight into designing Fin gated-diodes, i.e., gated-diodes in FinFET technology. To this effect, we propose FinFET variants of the bulk gated-diode configuration and identify parameters that are critical to enhancing the retention time and read current in 2T/3T1D FinFET DRAMs. Additionally, we show the superiority of 2T1D FinFET DRAM over 6T FinFET SRAM having pass-gate feedback (6T PGFB) and 2T1D bulk DRAM under the effect of variations using a quasi-Monte Carlo method implemented in FinE, an environment we have developed for double-gate circuit design that integrates Sentaurus TCAD from Synopsys with the Spice3-UFDG double-gate compact model from University of Florida under a single framework. Finally, we present a new tunable threshold gated-diode FinFET amplifier which uses an n-type gated-diode for voltage-boosting, along with a p-type gated-diode for zero-suppression.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124091347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413109
V. Karkala, Kalyana C. Bollapalli, Rajesh Garg, S. Khatri
In this paper, we present a new continuously variable high frequency standing wave oscillator, and demonstrate its use in generating the phase locked clock signal of a digital IC. The ring based standing wave resonant oscillator is implemented with a plurality of wires connected in a mobius configuration, with a cross coupled inverter pair connected across the wires. The oscillation frequency can be modulated by two means. Coarse modification is achieved by altering the number of wires in the ring that participate in the oscillation, by driving a digital word to a set of passgates which are connected to each wire in the ring. Fine tuning of the oscillation frequency is achieved by varying the body bias voltage of both the PMOS transistors in the cross coupled inverter pair which sustains the oscillations in the resonant ring. We have validated our PLL design in a 90nm process technology. 3D parasitic RLCs for our oscillator simulations were extracted, with skin effect accounted for. Our PLL has been implemented to provide a frequency locking range from ∼6 GHz to ∼9 GHz, with a center frequency of 7.5 GHz. The oscillator alone consumes about 25 mW of power, and the complete PLL consumes a power of 28.5 mW. The observed jitter of the PLL is 2.56%.
{"title":"A PLL design based on a standing wave resonant oscillator","authors":"V. Karkala, Kalyana C. Bollapalli, Rajesh Garg, S. Khatri","doi":"10.1109/ICCD.2009.5413109","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413109","url":null,"abstract":"In this paper, we present a new continuously variable high frequency standing wave oscillator, and demonstrate its use in generating the phase locked clock signal of a digital IC. The ring based standing wave resonant oscillator is implemented with a plurality of wires connected in a mobius configuration, with a cross coupled inverter pair connected across the wires. The oscillation frequency can be modulated by two means. Coarse modification is achieved by altering the number of wires in the ring that participate in the oscillation, by driving a digital word to a set of passgates which are connected to each wire in the ring. Fine tuning of the oscillation frequency is achieved by varying the body bias voltage of both the PMOS transistors in the cross coupled inverter pair which sustains the oscillations in the resonant ring. We have validated our PLL design in a 90nm process technology. 3D parasitic RLCs for our oscillator simulations were extracted, with skin effect accounted for. Our PLL has been implemented to provide a frequency locking range from ∼6 GHz to ∼9 GHz, with a center frequency of 7.5 GHz. The oscillator alone consumes about 25 mW of power, and the complete PLL consumes a power of 28.5 mW. The observed jitter of the PLL is 2.56%.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123398944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}