首页 > 最新文献

2009 IEEE International Conference on Computer Design最新文献

英文 中文
A radiation tolerant Phase Locked Loop design for digital electronics 一种用于数字电子器件的耐辐射锁相环设计
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413108
R. Kumar, V. Karkala, Rajesh Garg, Tanuj Jindal, S. Khatri
With decreasing feature sizes, lowered supply voltages and increasing operating frequencies, the radiation tolerance of digital circuits is becoming an increasingly important problem. Many radiation hardening techniques have been presented in the literature for combinational as well as sequential logic. However, the radiation tolerance of clock generation circuitry has received scant attention to date. Recently, it has been shown that in the deep submicron regime, the clock network contributes significantly to the chip level Soft Error Rate (SER). The on-chip Phase Locked Loop (PLL) is particularly vulnerable to radiation strikes. In this paper, we present a radiation hardened PLL design. Each of the components of this design — the voltage controlled oscillator (VCO), the phase frequency detector (PFD) and the loop filter are designed in a radiation tolerant manner. Whenever possible, the circuit elements used in our PLL exploit the fact that if a gate is implemented using only PMOS (NMOS) transistors then a radiation particle strike can result only in a logic 0 to 1 (1 to 0) flip. By separating the PMOS and NMOS devices, and splitting the gate output into two signals, extreme high levels of radiation tolerance are obtained. Our PLL is tested for radiation immunity for critical charge values up to 250fC. Our results demonstrate that over a large number of radiation strikes on a number of sensitive nodes in our design, the worst case jitter is just 18%. In the worst case, our PLL returns to the locked state in 16 cycles of the VCO clock, after a radiation strike.
随着特征尺寸的减小、电源电压的降低和工作频率的提高,数字电路的辐射容限问题日益重要。许多辐射硬化技术已经在文献中提出,用于组合和顺序逻辑。然而,时钟产生电路的辐射容限迄今为止受到的关注很少。近年来,已有研究表明,在深亚微米范围内,时钟网络对芯片级软错误率(SER)有显著影响。片上锁相环(PLL)特别容易受到辐射打击。本文提出了一种抗辐射锁相环的设计方案。本设计的每个组件-压控振荡器(VCO),相频检测器(PFD)和环路滤波器都以耐辐射的方式设计。只要有可能,我们的锁相环中使用的电路元件利用了这样一个事实,即如果栅极仅使用PMOS (NMOS)晶体管实现,那么辐射粒子撞击只能导致逻辑0到1(1到0)翻转。通过分离PMOS和NMOS器件,并将栅极输出分成两个信号,可以获得极高的辐射容忍度。我们的锁相环在高达250fC的临界电荷值下进行了辐射抗扰性测试。我们的结果表明,在我们的设计中,在大量的辐射击中许多敏感节点时,最坏情况下的抖动仅为18%。在最坏的情况下,我们的锁相环在VCO时钟的16个周期内返回锁定状态,在辐射打击之后。
{"title":"A radiation tolerant Phase Locked Loop design for digital electronics","authors":"R. Kumar, V. Karkala, Rajesh Garg, Tanuj Jindal, S. Khatri","doi":"10.1109/ICCD.2009.5413108","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413108","url":null,"abstract":"With decreasing feature sizes, lowered supply voltages and increasing operating frequencies, the radiation tolerance of digital circuits is becoming an increasingly important problem. Many radiation hardening techniques have been presented in the literature for combinational as well as sequential logic. However, the radiation tolerance of clock generation circuitry has received scant attention to date. Recently, it has been shown that in the deep submicron regime, the clock network contributes significantly to the chip level Soft Error Rate (SER). The on-chip Phase Locked Loop (PLL) is particularly vulnerable to radiation strikes. In this paper, we present a radiation hardened PLL design. Each of the components of this design — the voltage controlled oscillator (VCO), the phase frequency detector (PFD) and the loop filter are designed in a radiation tolerant manner. Whenever possible, the circuit elements used in our PLL exploit the fact that if a gate is implemented using only PMOS (NMOS) transistors then a radiation particle strike can result only in a logic 0 to 1 (1 to 0) flip. By separating the PMOS and NMOS devices, and splitting the gate output into two signals, extreme high levels of radiation tolerance are obtained. Our PLL is tested for radiation immunity for critical charge values up to 250fC. Our results demonstrate that over a large number of radiation strikes on a number of sensitive nodes in our design, the worst case jitter is just 18%. In the worst case, our PLL returns to the locked state in 16 cycles of the VCO clock, after a radiation strike.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Optical lithography simulation using wavelet transform 基于小波变换的光刻仿真
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413120
Rance Rodrigues, Aswin Sreedhar, S. Kundu
Optical Lithography is an indispensible step in the process flow of Design for Manufacturability (DFM). Optical lithography simulation is a compute intensive task and simulation performance, or lack thereof can be a determining factor in time to market. Thus, the efficiency of lithography simulation is of paramount importance. Coherent decomposition is a popular simulation technique for aerial imaging simulation. In this paper, we propose an approximate simulation technique based on the 2D wavelet transform and use a number of optimization methods to further improve polygon edge detection. Results show that the proposed method suffers from an average error of less than 5% when compared with the coherent decomposition method. The benefits of the proposed method are (i) >10X increase in performance and more importantly (ii) it allows very large circuits to be simulated while some commercial tools are severely capacity limited. Approximate simulation is quite attractive for layout optimization where it may be used in a loop and may even be acceptable for final layout verification.
光学光刻是可制造性设计(DFM)工艺流程中不可缺少的步骤。光学光刻仿真是一项计算密集型任务,仿真性能的缺乏可能是上市时间的决定性因素。因此,光刻模拟的效率是至关重要的。相干分解是航空成像仿真中常用的一种仿真技术。本文提出了一种基于二维小波变换的近似仿真技术,并利用多种优化方法进一步改进多边形边缘检测。结果表明,与相干分解方法相比,该方法的平均误差小于5%。所提出的方法的优点是(i) >10倍的性能提高,更重要的是(ii)它允许模拟非常大的电路,而一些商业工具的容量严重有限。近似模拟对于布局优化非常有吸引力,它可以在循环中使用,甚至可以接受最终的布局验证。
{"title":"Optical lithography simulation using wavelet transform","authors":"Rance Rodrigues, Aswin Sreedhar, S. Kundu","doi":"10.1109/ICCD.2009.5413120","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413120","url":null,"abstract":"Optical Lithography is an indispensible step in the process flow of Design for Manufacturability (DFM). Optical lithography simulation is a compute intensive task and simulation performance, or lack thereof can be a determining factor in time to market. Thus, the efficiency of lithography simulation is of paramount importance. Coherent decomposition is a popular simulation technique for aerial imaging simulation. In this paper, we propose an approximate simulation technique based on the 2D wavelet transform and use a number of optimization methods to further improve polygon edge detection. Results show that the proposed method suffers from an average error of less than 5% when compared with the coherent decomposition method. The benefits of the proposed method are (i) >10X increase in performance and more importantly (ii) it allows very large circuits to be simulated while some commercial tools are severely capacity limited. Approximate simulation is quite attractive for layout optimization where it may be used in a loop and may even be acceptable for final layout verification.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114874887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A hierarchical approach towards system level static timing verification of SoCs 一种系统级soc静态时序验证的分层方法
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413155
R. Chakraborty, D. R. Chowdhury
The high complexity and the core diversities make timing verification of an entire flattened SoC design a tedious process. In this paper, at first the various timing issues related to modular SoC verification have been investigated and then a bottom-up hierarchical approach of verifying the system level timing of an SoC, is presented. The timing abstractions of the cores are assumed to be provided by the core vendors. The interconnection delays of the SoC may be extracted from the SDF file generated after post layout simulation. The hierarchical approach provides a fast and systematic way of timing verification, as opposed to the flattened approach. Experiments were conducted on synthetic SoCs, using ISCAS benchmark circuits as cores. Results validate the claim of the proposed approach.
高复杂度和核心多样性使得整个扁平SoC设计的时序验证过程非常繁琐。在本文中,首先研究了与模块化SoC验证相关的各种时序问题,然后提出了一种自下而上的分层方法来验证SoC的系统级时序。核心的时间抽象假定由核心供应商提供。SoC的互连延迟可以从布局后仿真生成的SDF文件中提取。与扁平方法相反,分层方法提供了一种快速和系统的时间验证方法。以ISCAS基准电路为核心,在合成soc上进行了实验。结果验证了该方法的正确性。
{"title":"A hierarchical approach towards system level static timing verification of SoCs","authors":"R. Chakraborty, D. R. Chowdhury","doi":"10.1109/ICCD.2009.5413155","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413155","url":null,"abstract":"The high complexity and the core diversities make timing verification of an entire flattened SoC design a tedious process. In this paper, at first the various timing issues related to modular SoC verification have been investigated and then a bottom-up hierarchical approach of verifying the system level timing of an SoC, is presented. The timing abstractions of the cores are assumed to be provided by the core vendors. The interconnection delays of the SoC may be extracted from the SDF file generated after post layout simulation. The hierarchical approach provides a fast and systematic way of timing verification, as opposed to the flattened approach. Experiments were conducted on synthetic SoCs, using ISCAS benchmark circuits as cores. Results validate the claim of the proposed approach.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122324397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Resource sharing of pipelined custom hardware extension for energy-efficient application-specific instruction set processor design 面向节能专用指令集处理器设计的流水线定制硬件扩展的资源共享
Pub Date : 2009-10-04 DOI: 10.1145/2348839.2348843
Hai Lin, Yunsi Fei
Application-Specific Instruction set Processor (ASIP) has become an increasingly popular platform for embedded systems because of its high performance and flexibility. Energy efficiency is critical for portable and embedded devices, and should be addressed separately from performance consideration. The hardware extension in ASIPs can speed-up program execution, but also incurs area overhead and static energy consumption of the processors. Traditional data path merging techniques reduce circuit overhead by reusing hardware resources for executing multiple custom instructions. However, they introduce structural hazard for custom instructions on extended processors, and hence reduce the performance improvement. In this paper, we introduce a pipelined configurable hardware structure for the hardware extension in ASIPs, so that structural hazards can be remedied. With multiple subgraphs of operations selected for custom hardware realization, we devise a novel operation-to-hardware mapping algorithm based on Integer Linear Programming (ILP) to automatically construct a resource-efficient pipelined configurable hardware extension. We demonstrate that different resource sharing schemes would affect both the hardware overhead and datapath delay of the custom instructions. We analyze the design tradeoffs between resource efficiency and performance improvement, and present the design space exploration results.
专用指令集处理器(Application-Specific Instruction set Processor, ASIP)以其高性能和灵活性成为嵌入式系统中越来越受欢迎的平台。对于便携式和嵌入式设备来说,能源效率是至关重要的,应该与性能考虑分开考虑。api中的硬件扩展可以加快程序的执行速度,但也会增加处理器的面积开销和静态能耗。传统的数据路径合并技术通过重用硬件资源来执行多个自定义指令来减少电路开销。然而,它们会给扩展处理器上的自定义指令带来结构性危害,从而降低了性能改进。在本文中,我们为api中的硬件扩展引入了一种流水线可配置的硬件结构,从而可以弥补结构上的危害。通过选择多个操作子图用于自定义硬件实现,我们设计了一种新的基于整数线性规划(ILP)的操作到硬件映射算法,以自动构建资源高效的流水线可配置硬件扩展。我们证明了不同的资源共享方案会影响自定义指令的硬件开销和数据路径延迟。我们分析了资源效率和性能改进之间的设计权衡,并给出了设计空间探索的结果。
{"title":"Resource sharing of pipelined custom hardware extension for energy-efficient application-specific instruction set processor design","authors":"Hai Lin, Yunsi Fei","doi":"10.1145/2348839.2348843","DOIUrl":"https://doi.org/10.1145/2348839.2348843","url":null,"abstract":"Application-Specific Instruction set Processor (ASIP) has become an increasingly popular platform for embedded systems because of its high performance and flexibility. Energy efficiency is critical for portable and embedded devices, and should be addressed separately from performance consideration. The hardware extension in ASIPs can speed-up program execution, but also incurs area overhead and static energy consumption of the processors. Traditional data path merging techniques reduce circuit overhead by reusing hardware resources for executing multiple custom instructions. However, they introduce structural hazard for custom instructions on extended processors, and hence reduce the performance improvement. In this paper, we introduce a pipelined configurable hardware structure for the hardware extension in ASIPs, so that structural hazards can be remedied. With multiple subgraphs of operations selected for custom hardware realization, we devise a novel operation-to-hardware mapping algorithm based on Integer Linear Programming (ILP) to automatically construct a resource-efficient pipelined configurable hardware extension. We demonstrate that different resource sharing schemes would affect both the hardware overhead and datapath delay of the custom instructions. We analyze the design tradeoffs between resource efficiency and performance improvement, and present the design space exploration results.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130513785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Reducing dynamic power dissipation in pipelined forwarding engines 减少流水线转发引擎的动态功耗
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413163
Weirong Jiang, V. Prasanna
Power consumption has become a limiting factor in next-generation routers. IP forwarding engines dominate the overall power dissipation in a router. Although SRAM-based pipeline architectures have recently been developed as a promising alternative to power-hungry TCAM-based solutions for high-throughput IP forwarding, it remains a challenge to achieve low power. This paper proposes several novel architecture-specific techniques to reduce the dynamic power consumption in SRAM-based pipelined IP forwarding engines. First, the pipeline architecture itself is built as an inherent cache, exploiting the data locality in Internet traffic. The number of memory accesses which contribute to the majority of power consumption, is thus reduced. No external cache is needed. Second, instead of using a global clock, different pipeline stages are driven by separate clocks. The local clocking scheme is carefully designed to exploit the traffic rate variation and improve the caching performance. Third, a fine-grained memory enabling scheme is developed to eliminate unnecessary memory accesses, while preserving the packet order. Simulation experiments using real-life traces show that our solutions can achieve up to 15-fold reduction in dynamic power dissipation, over the baseline pipeline architecture that does not employ the proposed schemes. FPGA implementation results show that our design sustains 40 Gbps throughput for minimum size (40 bytes) packets while consuming a small amount of logic resources.
功耗已经成为下一代路由器的一个限制因素。IP转发引擎在路由器的整体功耗中占主导地位。尽管基于sram的管道架构最近被开发为一种有前途的替代方案,用于高吞吐量IP转发,但实现低功耗仍然是一个挑战。为了降低基于sram的流水线IP转发引擎的动态功耗,本文提出了几种新的特定于体系结构的技术。首先,管道架构本身被构建为一个固有的缓存,利用互联网流量中的数据局部性。因此,内存访问的数量减少了,而内存访问是造成大部分功耗的原因。不需要外部缓存。其次,不同的流水线阶段由不同的时钟驱动,而不是使用全局时钟。本地时钟方案是精心设计的,以利用流量速率的变化,提高缓存性能。第三,开发了一种细粒度的内存启用方案,以消除不必要的内存访问,同时保持数据包的顺序。使用真实轨迹的仿真实验表明,我们的解决方案可以在不采用所提出方案的基线管道架构上实现高达15倍的动态功耗降低。FPGA实现结果表明,我们的设计在消耗少量逻辑资源的同时,保持最小尺寸(40字节)数据包的40 Gbps吞吐量。
{"title":"Reducing dynamic power dissipation in pipelined forwarding engines","authors":"Weirong Jiang, V. Prasanna","doi":"10.1109/ICCD.2009.5413163","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413163","url":null,"abstract":"Power consumption has become a limiting factor in next-generation routers. IP forwarding engines dominate the overall power dissipation in a router. Although SRAM-based pipeline architectures have recently been developed as a promising alternative to power-hungry TCAM-based solutions for high-throughput IP forwarding, it remains a challenge to achieve low power. This paper proposes several novel architecture-specific techniques to reduce the dynamic power consumption in SRAM-based pipelined IP forwarding engines. First, the pipeline architecture itself is built as an inherent cache, exploiting the data locality in Internet traffic. The number of memory accesses which contribute to the majority of power consumption, is thus reduced. No external cache is needed. Second, instead of using a global clock, different pipeline stages are driven by separate clocks. The local clocking scheme is carefully designed to exploit the traffic rate variation and improve the caching performance. Third, a fine-grained memory enabling scheme is developed to eliminate unnecessary memory accesses, while preserving the packet order. Simulation experiments using real-life traces show that our solutions can achieve up to 15-fold reduction in dynamic power dissipation, over the baseline pipeline architecture that does not employ the proposed schemes. FPGA implementation results show that our design sustains 40 Gbps throughput for minimum size (40 bytes) packets while consuming a small amount of logic resources.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131017792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Intra-vector SIMD instructions for core specialization 内部矢量SIMD指令的核心专业化
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413112
C. Meenderinck, B. Juurlink
Current research is mainly focussing on exploiting TLP to increase performance. Another avenue, however, for achieving performance scalability is specialization. In this paper we propose application specific intra-vector instructions for two dimensional signal processing kernels. In such kernels usually significant data rearrangement overhead is required in order to use the SIMD capabilities. When using the intra-vector instructions the overhead can be avoided. We have implemented intra-vector instructions in the Cell SPU core and measured speedups of up to 2.06, with an average of 1.45.
目前的研究主要集中在利用张力腿来提高性能。然而,实现性能可伸缩性的另一个途径是专门化。在本文中,我们提出了用于二维信号处理内核的特定应用的向量内指令。在这样的内核中,为了使用SIMD功能,通常需要大量的数据重排开销。当使用intra-vector指令时,可以避免这种开销。我们已经在Cell SPU内核中实现了向量内指令,并测量了高达2.06的速度,平均速度为1.45。
{"title":"Intra-vector SIMD instructions for core specialization","authors":"C. Meenderinck, B. Juurlink","doi":"10.1109/ICCD.2009.5413112","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413112","url":null,"abstract":"Current research is mainly focussing on exploiting TLP to increase performance. Another avenue, however, for achieving performance scalability is specialization. In this paper we propose application specific intra-vector instructions for two dimensional signal processing kernels. In such kernels usually significant data rearrangement overhead is required in order to use the SIMD capabilities. When using the intra-vector instructions the overhead can be avoided. We have implemented intra-vector instructions in the Cell SPU core and measured speedups of up to 2.06, with an average of 1.45.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131989049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Accelerating mobile augmented reality on a handheld platform 在手持平台上加速移动增强现实
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413123
Seung Eun Lee, Yong Zhang, Zhen Fang, S. Srinivasan, R. Iyer, D. Newell
Mobile Augmented Reality (MAR) is an emerging visual computing application for the mobile internet device (MID). In one MAR usage model, the user points the handheld device to an object (like a wine bottle or a building) and the MID automatically recognizes and displays information regarding the object. Achieving this in software on the handheld requires significant compute processing for object recognition and matching. In this paper, we identify hotspot functions of the MAR workload on a low-power x86 platform that motivates acceleration. We present the detailed design of two hardware accelerators, one for object recognition (MAR-HA) and the other for match processing (MAR-MA). We also quantify the performance and area efficiency of the hardware accelerators. Our analysis shows that hardware acceleration has the potential to improve the individual hotspot functions by as much as 20x, and overall response time by 7x. As a result, user response time can be reduced significantly.
移动增强现实(MAR)是一种新兴的针对移动互联网设备(MID)的视觉计算应用。在一个MAR使用模型中,用户将手持设备指向一个对象(如酒瓶或建筑物),MID自动识别并显示有关该对象的信息。在手持设备上的软件中实现这一点需要大量的计算处理来进行对象识别和匹配。在本文中,我们确定了低功耗x86平台上MAR工作负载的热点功能,这些功能可以激发加速。我们提出了两个硬件加速器的详细设计,一个用于目标识别(MAR-HA),另一个用于匹配处理(MAR-MA)。我们还量化了硬件加速器的性能和面积效率。我们的分析表明,硬件加速有可能将单个热点功能提高20倍,并将总体响应时间提高7倍。因此,用户响应时间可以大大缩短。
{"title":"Accelerating mobile augmented reality on a handheld platform","authors":"Seung Eun Lee, Yong Zhang, Zhen Fang, S. Srinivasan, R. Iyer, D. Newell","doi":"10.1109/ICCD.2009.5413123","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413123","url":null,"abstract":"Mobile Augmented Reality (MAR) is an emerging visual computing application for the mobile internet device (MID). In one MAR usage model, the user points the handheld device to an object (like a wine bottle or a building) and the MID automatically recognizes and displays information regarding the object. Achieving this in software on the handheld requires significant compute processing for object recognition and matching. In this paper, we identify hotspot functions of the MAR workload on a low-power x86 platform that motivates acceleration. We present the detailed design of two hardware accelerators, one for object recognition (MAR-HA) and the other for match processing (MAR-MA). We also quantify the performance and area efficiency of the hardware accelerators. Our analysis shows that hardware acceleration has the potential to improve the individual hotspot functions by as much as 20x, and overall response time by 7x. As a result, user response time can be reduced significantly.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129614221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Symmetrical buffer placement in clock trees for minimal skew immune to global on-chip variations 对称缓冲放置在时钟树最小倾斜免疫全局芯片上的变化
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413180
Rensheng Wang, Takumi Okamoto, Chung-Kuan Cheng
As the feature size of VLSI circuits scales down and clock rates increases, circuit performance is becoming more sensitive to process variations. This paper proposes an algorithm of symmetrical buffer placement in symmetrical clock trees to achieve zero-skew in theory, as well as robust low skew under process or environment variations. With the completely symmetrical structure, we can eliminate many factors of clock skew such as model inaccuracy, environment temperature and intra-die process variations. We devise a new dynamic programming scheme to handle buffer placement and wire sizing under the constraint of symmetry. By classifying the wires by tree levels and defining the level-dependent blockages, the potential candidate points in the gaps of circuit blocks can be fully explored. The algorithm is efficient for minimizing source-sink delay as well as other linear cost functions. Experiments show that our method helps to obtain a balanced design of clock tree with low delay, skew and power.
随着VLSI电路特征尺寸的缩小和时钟速率的增加,电路性能对工艺变化变得更加敏感。本文提出了一种在对称时钟树中放置对称缓冲区的算法,从理论上实现零倾斜,并在过程或环境变化下实现鲁棒的低倾斜。采用完全对称的结构,可以消除模型误差、环境温度和模内工艺变化等造成时钟偏差的诸多因素。我们设计了一种新的动态规划方案来处理对称约束下的缓冲区位置和导线尺寸。通过对线路进行树级分类,并定义与电平相关的阻塞,可以充分探索电路块间隙中的潜在候选点。该算法对最小化源-汇延迟和其他线性代价函数都是有效的。实验表明,该方法有助于实现低延迟、低倾斜、低功耗的均衡时钟树设计。
{"title":"Symmetrical buffer placement in clock trees for minimal skew immune to global on-chip variations","authors":"Rensheng Wang, Takumi Okamoto, Chung-Kuan Cheng","doi":"10.1109/ICCD.2009.5413180","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413180","url":null,"abstract":"As the feature size of VLSI circuits scales down and clock rates increases, circuit performance is becoming more sensitive to process variations. This paper proposes an algorithm of symmetrical buffer placement in symmetrical clock trees to achieve zero-skew in theory, as well as robust low skew under process or environment variations. With the completely symmetrical structure, we can eliminate many factors of clock skew such as model inaccuracy, environment temperature and intra-die process variations. We devise a new dynamic programming scheme to handle buffer placement and wire sizing under the constraint of symmetry. By classifying the wires by tree levels and defining the level-dependent blockages, the potential candidate points in the gaps of circuit blocks can be fully explored. The algorithm is efficient for minimizing source-sink delay as well as other linear cost functions. Experiments show that our method helps to obtain a balanced design of clock tree with low delay, skew and power.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122092845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Pragmatic design of gated-diode FinFET DRAMs 实用的门控二极管FinFET dram设计
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413127
A. Bhoj, N. Jha
Scaling bulk CMOS SRAM technology for on-chip caches beyond the 22nm node is questionable, on account of high leakage power consumption, performance degradation, and instability due to process variations. Recently, two/three transistor one gated-diode (2T/3T1D) DRAMs were proposed as alternatives to address the SRAM variability problem, with an emphasis on high-activity embedded cache applications. They are highly competitive with an SRAM in terms of performance, while having a smaller power and area footprint at lower technology nodes. The current evolutionary trend in transistor structures is toward an era of multi-gate devices, which makes it necessary to identify design issues and advantages of gated-diode DRAMs implemented in a multi-gate technology. In this work, we address gated-diode DRAM design in FinFET technology using mixed-mode 2D-device simulations. We revisit the model of internal voltage gain in bulk gated-diodes and extend it to provide quantitative insight into designing Fin gated-diodes, i.e., gated-diodes in FinFET technology. To this effect, we propose FinFET variants of the bulk gated-diode configuration and identify parameters that are critical to enhancing the retention time and read current in 2T/3T1D FinFET DRAMs. Additionally, we show the superiority of 2T1D FinFET DRAM over 6T FinFET SRAM having pass-gate feedback (6T PGFB) and 2T1D bulk DRAM under the effect of variations using a quasi-Monte Carlo method implemented in FinE, an environment we have developed for double-gate circuit design that integrates Sentaurus TCAD from Synopsys with the Spice3-UFDG double-gate compact model from University of Florida under a single framework. Finally, we present a new tunable threshold gated-diode FinFET amplifier which uses an n-type gated-diode for voltage-boosting, along with a p-type gated-diode for zero-suppression.
考虑到高泄漏功耗、性能下降和工艺变化带来的不稳定性,将批量CMOS SRAM技术扩展到超过22nm节点的片上缓存是有问题的。最近,二/三晶体管一门二极管(2T/3T1D) dram被提出作为解决SRAM可变性问题的替代方案,重点是高活性嵌入式缓存应用。它们在性能方面与SRAM具有很强的竞争力,同时在较低的技术节点上具有更小的功耗和占地面积。当前晶体管结构的发展趋势是朝着多栅极器件时代发展,因此有必要确定采用多栅极技术实现的栅极二极管dram的设计问题和优势。在这项工作中,我们使用混合模式2d器件模拟解决了FinFET技术中的门控二极管DRAM设计。我们重新审视了本体门控二极管的内部电压增益模型,并将其扩展到为设计Fin门控二极管(即FinFET技术中的门控二极管)提供定量的见解。为此,我们提出了体栅二极管配置的FinFET变体,并确定了在2T/3T1D FinFET dram中提高保持时间和读取电流的关键参数。此外,我们展示了2T1D FinFET DRAM优于具有通门反馈(6T PGFB)的6T FinFET SRAM和2T1D体DRAM的优势,使用FinE中实现的准蒙特卡罗方法,我们为双栅极电路设计开发了一个环境,该环境将Synopsys的Sentaurus TCAD与佛罗里达大学的Spice3-UFDG双栅极紧凑型模型集成在一个框架下。最后,我们提出了一种新的可调阈值门控二极管FinFET放大器,该放大器使用n型门控二极管进行升压,使用p型门控二极管进行零抑制。
{"title":"Pragmatic design of gated-diode FinFET DRAMs","authors":"A. Bhoj, N. Jha","doi":"10.1109/ICCD.2009.5413127","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413127","url":null,"abstract":"Scaling bulk CMOS SRAM technology for on-chip caches beyond the 22nm node is questionable, on account of high leakage power consumption, performance degradation, and instability due to process variations. Recently, two/three transistor one gated-diode (2T/3T1D) DRAMs were proposed as alternatives to address the SRAM variability problem, with an emphasis on high-activity embedded cache applications. They are highly competitive with an SRAM in terms of performance, while having a smaller power and area footprint at lower technology nodes. The current evolutionary trend in transistor structures is toward an era of multi-gate devices, which makes it necessary to identify design issues and advantages of gated-diode DRAMs implemented in a multi-gate technology. In this work, we address gated-diode DRAM design in FinFET technology using mixed-mode 2D-device simulations. We revisit the model of internal voltage gain in bulk gated-diodes and extend it to provide quantitative insight into designing Fin gated-diodes, i.e., gated-diodes in FinFET technology. To this effect, we propose FinFET variants of the bulk gated-diode configuration and identify parameters that are critical to enhancing the retention time and read current in 2T/3T1D FinFET DRAMs. Additionally, we show the superiority of 2T1D FinFET DRAM over 6T FinFET SRAM having pass-gate feedback (6T PGFB) and 2T1D bulk DRAM under the effect of variations using a quasi-Monte Carlo method implemented in FinE, an environment we have developed for double-gate circuit design that integrates Sentaurus TCAD from Synopsys with the Spice3-UFDG double-gate compact model from University of Florida under a single framework. Finally, we present a new tunable threshold gated-diode FinFET amplifier which uses an n-type gated-diode for voltage-boosting, along with a p-type gated-diode for zero-suppression.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124091347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A PLL design based on a standing wave resonant oscillator 基于驻波谐振振荡器的锁相环设计
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413109
V. Karkala, Kalyana C. Bollapalli, Rajesh Garg, S. Khatri
In this paper, we present a new continuously variable high frequency standing wave oscillator, and demonstrate its use in generating the phase locked clock signal of a digital IC. The ring based standing wave resonant oscillator is implemented with a plurality of wires connected in a mobius configuration, with a cross coupled inverter pair connected across the wires. The oscillation frequency can be modulated by two means. Coarse modification is achieved by altering the number of wires in the ring that participate in the oscillation, by driving a digital word to a set of passgates which are connected to each wire in the ring. Fine tuning of the oscillation frequency is achieved by varying the body bias voltage of both the PMOS transistors in the cross coupled inverter pair which sustains the oscillations in the resonant ring. We have validated our PLL design in a 90nm process technology. 3D parasitic RLCs for our oscillator simulations were extracted, with skin effect accounted for. Our PLL has been implemented to provide a frequency locking range from ∼6 GHz to ∼9 GHz, with a center frequency of 7.5 GHz. The oscillator alone consumes about 25 mW of power, and the complete PLL consumes a power of 28.5 mW. The observed jitter of the PLL is 2.56%.
在本文中,我们提出了一种新的连续可变高频驻波振荡器,并演示了它在数字IC中产生锁相时钟信号的用途。环形驻波谐振振荡器是由以莫比乌斯结构连接的多根导线实现的,在导线之间连接一个交叉耦合的逆变器对。振荡频率可以用两种方法调制。粗略的修改是通过改变环中参与振荡的导线的数量来实现的,通过将数字字驱动到连接到环中的每条导线的一组通道中。通过改变交叉耦合逆变器对中两个PMOS晶体管的体偏置电压来实现振荡频率的微调,从而维持谐振环内的振荡。我们已经在90nm工艺技术中验证了我们的锁相环设计。我们提取了用于振荡器模拟的3D寄生rlc,并考虑了皮肤效应。我们的锁相环已实现提供从~ 6 GHz到~ 9 GHz的频率锁定范围,中心频率为7.5 GHz。振荡器本身的功耗约为25mw,整个锁相环的功耗为28.5 mW。锁相环的抖动值为2.56%。
{"title":"A PLL design based on a standing wave resonant oscillator","authors":"V. Karkala, Kalyana C. Bollapalli, Rajesh Garg, S. Khatri","doi":"10.1109/ICCD.2009.5413109","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413109","url":null,"abstract":"In this paper, we present a new continuously variable high frequency standing wave oscillator, and demonstrate its use in generating the phase locked clock signal of a digital IC. The ring based standing wave resonant oscillator is implemented with a plurality of wires connected in a mobius configuration, with a cross coupled inverter pair connected across the wires. The oscillation frequency can be modulated by two means. Coarse modification is achieved by altering the number of wires in the ring that participate in the oscillation, by driving a digital word to a set of passgates which are connected to each wire in the ring. Fine tuning of the oscillation frequency is achieved by varying the body bias voltage of both the PMOS transistors in the cross coupled inverter pair which sustains the oscillations in the resonant ring. We have validated our PLL design in a 90nm process technology. 3D parasitic RLCs for our oscillator simulations were extracted, with skin effect accounted for. Our PLL has been implemented to provide a frequency locking range from ∼6 GHz to ∼9 GHz, with a center frequency of 7.5 GHz. The oscillator alone consumes about 25 mW of power, and the complete PLL consumes a power of 28.5 mW. The observed jitter of the PLL is 2.56%.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123398944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
2009 IEEE International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1