首页 > 最新文献

2007 25th International Conference on Computer Design最新文献

英文 中文
Power efficient register file update approach for embedded processors 嵌入式处理器的节能寄存器文件更新方法
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601935
R. Ayoub, A. Orailoglu
In this paper we present an approach for a low power register file in the domain of embedded processors. The suggested approach obtains power savings through tackling the unnecessary writes to register files for short live registers. Writes to register files are essentially redundant when an instruction manages to forward its results to all of its dependents through forwarding hardware. As the percentage of registers that exhibit short liveness is shown to be significant, tackling unnecessary writes contributes to delivering appreciable power savings. In this work we show that tackling the unnecessary writes could be attained efficiently through a register based encoding scheme. The suggested encoding scheme exploits application-specific information and renames all or most of the short live registers to a small subset of the registers that are prespecified during the hardware design. The renaming process is performed at the compiler level. Power savings can be obtained through precluding the set of prespecified registers from writing to the register file. We suggest in this paper efficient algorithms for the purpose of renaming, one algorithm to perform the renaming in the cases of no register pressure and another one for the cases of register pressure. In the cases of register pressure, some of the prespecified registers may need to be turned into normal registers, a process that is managed through the use of reprogrammable hardware support. Although the cases of register pressure could impact power savings, the detailed analysis we outline shows that the size of the prespecified registers subset is typically small which makes register pressure an infrequent event. Experimental analysis on numerical and DSP codes indicates appreciable improvements in power savings.
本文提出了一种在嵌入式处理器领域实现低功耗寄存器文件的方法。建议的方法通过处理对短活动寄存器的注册文件的不必要写操作来节省电力。当一条指令设法通过转发硬件将其结果转发给所有依赖它的指令时,对注册文件的写操作基本上是多余的。由于表现出短寿命的寄存器的百分比非常大,因此处理不必要的写有助于提供可观的电力节省。在这项工作中,我们表明可以通过基于寄存器的编码方案有效地处理不必要的写入。建议的编码方案利用特定于应用程序的信息,并将所有或大部分短活寄存器重命名为硬件设计期间预先指定的寄存器的一个小子集。重命名过程在编译器级别执行。通过阻止一组预先指定的寄存器写入寄存器文件,可以节省电力。本文提出了一种有效的重命名算法,一种算法用于无寄存器压力情况下的重命名,另一种算法用于有寄存器压力情况下的重命名。在寄存器压力的情况下,一些预先指定的寄存器可能需要转换为正常寄存器,这个过程是通过使用可重新编程的硬件支持来管理的。尽管寄存器压力的情况可能会影响节能,但我们概述的详细分析表明,预先指定的寄存器子集的大小通常很小,这使得寄存器压力很少发生。对数字和DSP代码的实验分析表明,该方法在节能方面有明显的改进。
{"title":"Power efficient register file update approach for embedded processors","authors":"R. Ayoub, A. Orailoglu","doi":"10.1109/ICCD.2007.4601935","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601935","url":null,"abstract":"In this paper we present an approach for a low power register file in the domain of embedded processors. The suggested approach obtains power savings through tackling the unnecessary writes to register files for short live registers. Writes to register files are essentially redundant when an instruction manages to forward its results to all of its dependents through forwarding hardware. As the percentage of registers that exhibit short liveness is shown to be significant, tackling unnecessary writes contributes to delivering appreciable power savings. In this work we show that tackling the unnecessary writes could be attained efficiently through a register based encoding scheme. The suggested encoding scheme exploits application-specific information and renames all or most of the short live registers to a small subset of the registers that are prespecified during the hardware design. The renaming process is performed at the compiler level. Power savings can be obtained through precluding the set of prespecified registers from writing to the register file. We suggest in this paper efficient algorithms for the purpose of renaming, one algorithm to perform the renaming in the cases of no register pressure and another one for the cases of register pressure. In the cases of register pressure, some of the prespecified registers may need to be turned into normal registers, a process that is managed through the use of reprogrammable hardware support. Although the cases of register pressure could impact power savings, the detailed analysis we outline shows that the size of the prespecified registers subset is typically small which makes register pressure an infrequent event. Experimental analysis on numerical and DSP codes indicates appreciable improvements in power savings.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"17 1","pages":"431-437"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78386680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS 采用新颖的65nm CMOS开关分配器的4.6Tbits/s 3.6GHz单周期NoC路由器
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601881
A. Kumary, Partha Kunduz, A.P. Singhx, Li-Shiuan Pehy, N. K. Jhay
As chip multiprocessors (CMPs) become the only viable way to scale up and utilize the abundant transistors made available in current microprocessors, the design of on-chip networks is becoming critically important. These networks face unique design constraints and are required to provide extremely fast and high bandwidth communication, yet meet tight power and area budgets. In this paper, we present a detailed design of our on-chip network router targeted at a 36-core shared-memory CMP system in 65 nm technology. Our design targets an aggressive clock frequency of 3.6 GHz, thus posing tough design challenges that led to several unique circuit and microarchitectural innovations and design choices, including a novel high throughput and low latency switch allocation mechanism, a non-speculative single-cycle router pipeline which uses advanced bundles to remove control setup overhead, a low-complexity virtual channel allocator and a dynamically-managed shared buffer design which uses prefetching to minimize critical path delay. Our router takes up 1.19 mm2 area and expends 551 mW power at 10% activity, delivering a single-cycle no-load latency at 3.6 GHz clock frequency while achieving apeak switching data rate in excess of 4.6 Tbits/sper router node.
随着芯片多处理器(cmp)成为扩展和利用当前微处理器中可用的大量晶体管的唯一可行方法,片上网络的设计变得至关重要。这些网络面临着独特的设计限制,需要提供极快和高带宽的通信,同时满足紧张的功率和面积预算。在本文中,我们提出了针对65纳米技术的36核共享内存CMP系统的片上网络路由器的详细设计。我们的设计目标是3.6 GHz的激进时钟频率,因此提出了严峻的设计挑战,导致了一些独特的电路和微架构创新和设计选择,包括新颖的高吞吐量和低延迟交换机分配机制,非投机单周期路由器管道,使用先进的束来消除控制设置开销,一种低复杂度的虚拟通道分配器和一种动态管理的共享缓冲区设计,该设计使用预取来最小化关键路径延迟。我们的路由器占用1.19 mm2的面积,在10%的活动下消耗551 mW的功率,在3.6 GHz时钟频率下提供单周期空载延迟,同时实现超过4.6 Tbits/ per路由器节点的峰值交换数据速率。
{"title":"A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS","authors":"A. Kumary, Partha Kunduz, A.P. Singhx, Li-Shiuan Pehy, N. K. Jhay","doi":"10.1109/ICCD.2007.4601881","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601881","url":null,"abstract":"As chip multiprocessors (CMPs) become the only viable way to scale up and utilize the abundant transistors made available in current microprocessors, the design of on-chip networks is becoming critically important. These networks face unique design constraints and are required to provide extremely fast and high bandwidth communication, yet meet tight power and area budgets. In this paper, we present a detailed design of our on-chip network router targeted at a 36-core shared-memory CMP system in 65 nm technology. Our design targets an aggressive clock frequency of 3.6 GHz, thus posing tough design challenges that led to several unique circuit and microarchitectural innovations and design choices, including a novel high throughput and low latency switch allocation mechanism, a non-speculative single-cycle router pipeline which uses advanced bundles to remove control setup overhead, a low-complexity virtual channel allocator and a dynamically-managed shared buffer design which uses prefetching to minimize critical path delay. Our router takes up 1.19 mm2 area and expends 551 mW power at 10% activity, delivering a single-cycle no-load latency at 3.6 GHz clock frequency while achieving apeak switching data rate in excess of 4.6 Tbits/sper router node.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"163 1","pages":"63-70"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83528352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 212
Optimized design of a double-precision floating-point multiply-add-dused unit for data dependence 基于数据依赖性的双精度浮点乘加单元的优化设计
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601918
Gongqiong Li, Zhaolin Li
This paper presents a novel double-precision floating-point multiply-add-fused unit, which is implemented in three pipeline stages. The main improvement over the conventional design is data dependence between two consecutive floating-point instructions is considered. In the new design the intermediate computation results of the first floating-point instruction are first pretreated and then fed back to the first stage for being directly used by the second floating-point instruction if the two consecutive floating-point instructions are data dependent. In this way, floating point instructions can be executed directly following their preceding floating-point instructions without being stalled due to data dependence. 11 data dependence cases are accelerated in this paper. The experiments, which are done over four SPEC2000 benchmark programs, show that 25% performance increase can be attained at the cost of 0.27 ns time delay added to the critical path.
本文提出了一种新的双精度浮点乘加融合单元,该单元分三个流水线阶段实现。相对于传统设计的主要改进是考虑了两个连续浮点指令之间的数据依赖性。在新设计中,如果两个连续的浮点指令是数据相关的,则首先对第一个浮点指令的中间计算结果进行预处理,然后反馈到第一级供第二个浮点指令直接使用。这样,浮点指令就可以直接在前面的浮点指令之后执行,而不会因为数据依赖而停滞。本文加速了11种数据依赖情况。在四个SPEC2000基准程序上进行的实验表明,在关键路径上增加0.27 ns的时间延迟可以使性能提高25%。
{"title":"Optimized design of a double-precision floating-point multiply-add-dused unit for data dependence","authors":"Gongqiong Li, Zhaolin Li","doi":"10.1109/ICCD.2007.4601918","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601918","url":null,"abstract":"This paper presents a novel double-precision floating-point multiply-add-fused unit, which is implemented in three pipeline stages. The main improvement over the conventional design is data dependence between two consecutive floating-point instructions is considered. In the new design the intermediate computation results of the first floating-point instruction are first pretreated and then fed back to the first stage for being directly used by the second floating-point instruction if the two consecutive floating-point instructions are data dependent. In this way, floating point instructions can be executed directly following their preceding floating-point instructions without being stalled due to data dependence. 11 data dependence cases are accelerated in this paper. The experiments, which are done over four SPEC2000 benchmark programs, show that 25% performance increase can be attained at the cost of 0.27 ns time delay added to the critical path.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"32 1","pages":"311-316"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89976521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Placement and routing of RF embedded passive designs in LCP substrate 射频嵌入式无源设计在LCP基板上的放置与布线
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601913
M. Pathak, S. Mukherjee, M. Swaminathan, E. Engin, S. Lim
Physical layout generation of RF embedded passive design is not an easy task since the response of a given layout is tightly coupled with the response of the individual components and the effect of interconnect parasitics. In this paper we propose a methodology for automatic layout generation of embedded passive RF circuits. We make use of circuit models to represent and optimize a given layout and use non-linear optimization at various stages of the methodology to obtain the desired goals. Full-wave EM simulations is completely out of the design loop, so our methodology significantly reduces the design time for RF embedded passive circuits. The proposed approach has been used successfully to generate layout for band-pass filters of varying sizes.
射频嵌入式无源设计的物理布局生成不是一件容易的事情,因为给定布局的响应与单个组件的响应和互连寄生的影响紧密耦合。本文提出了一种嵌入式无源射频电路版图自动生成的方法。我们利用电路模型来表示和优化给定的布局,并在方法的各个阶段使用非线性优化来获得期望的目标。全波电磁仿真完全在设计循环之外,因此我们的方法大大减少了射频嵌入式无源电路的设计时间。该方法已成功用于不同尺寸的带通滤波器的版图生成。
{"title":"Placement and routing of RF embedded passive designs in LCP substrate","authors":"M. Pathak, S. Mukherjee, M. Swaminathan, E. Engin, S. Lim","doi":"10.1109/ICCD.2007.4601913","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601913","url":null,"abstract":"Physical layout generation of RF embedded passive design is not an easy task since the response of a given layout is tightly coupled with the response of the individual components and the effect of interconnect parasitics. In this paper we propose a methodology for automatic layout generation of embedded passive RF circuits. We make use of circuit models to represent and optimize a given layout and use non-linear optimization at various stages of the methodology to obtain the desired goals. Full-wave EM simulations is completely out of the design loop, so our methodology significantly reduces the design time for RF embedded passive circuits. The proposed approach has been used successfully to generate layout for band-pass filters of varying sizes.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"63 1","pages":"273-279"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77801641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory based computation using embedded cache for processor yield and reliability improvement 基于内存的计算采用嵌入式缓存来提高处理器的良率和可靠性
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601922
Somnath Paul, S. Bhunia
VLSI systems in the nanometer regime suffer from high defect rates and large parametric variations that lead to yield loss as well as reduced reliability of operation. In this paper, we propose a novel memory-based computation framework that exploits on-chip memory for reliable operation by transferring activity from a defective or unreliable functional unit to the embedded memory. This allows the die to run at a reduced performance level instead of being completely discarded or being throttled (in case of variations). We show that the proposed method improves yield and reliability in a superscalar out-of-order processor by tolerating defective functional units and allowing dynamic thermal management. The simulation results show that it entails only a small loss in performance (average 1.8%) at the cost of 9.5% of area overhead required with hardware duplication.
纳米级的超大规模集成电路系统存在高缺陷率和大的参数变化,导致良率损失和运行可靠性降低。在本文中,我们提出了一种新的基于内存的计算框架,通过将活动从有缺陷或不可靠的功能单元转移到嵌入式存储器,利用片上存储器进行可靠的操作。这允许模具运行在一个降低的性能水平,而不是被完全丢弃或被节流(在变化的情况下)。我们表明,该方法通过容忍有缺陷的功能单元和允许动态热管理,提高了超标量失序处理器的良率和可靠性。仿真结果表明,它只带来很小的性能损失(平均为1.8%),而代价是硬件复制所需的面积开销为9.5%。
{"title":"Memory based computation using embedded cache for processor yield and reliability improvement","authors":"Somnath Paul, S. Bhunia","doi":"10.1109/ICCD.2007.4601922","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601922","url":null,"abstract":"VLSI systems in the nanometer regime suffer from high defect rates and large parametric variations that lead to yield loss as well as reduced reliability of operation. In this paper, we propose a novel memory-based computation framework that exploits on-chip memory for reliable operation by transferring activity from a defective or unreliable functional unit to the embedded memory. This allows the die to run at a reduced performance level instead of being completely discarded or being throttled (in case of variations). We show that the proposed method improves yield and reliability in a superscalar out-of-order processor by tolerating defective functional units and allowing dynamic thermal management. The simulation results show that it entails only a small loss in performance (average 1.8%) at the cost of 9.5% of area overhead required with hardware duplication.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"9 1","pages":"341-346"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73675550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Prioritizing verification via value-based correctness criticality 通过基于值的正确性关键性来确定验证的优先级
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601921
Joonhyuk Yoo, M. Franklin
Microprocessors are becoming increasingly susceptible to soft errors due to the current trends of semiconductor technology scaling. Traditional redundant multi-threading architectures provide good fault tolerance by re-executing all the computations. However, such a full re-execution significantly increases the demand on the processor resources, resulting in severe performance degradation. To address this problem, this paper introduces a correctness criticality based filter checker, which prioritizes the verification candidates so as to selectively do verification. Binary Correctness Criticality (BCC) and Likelihood of Correctness Criticality (LoCC) are metrics that quantify whether an instruction is important for reliability or how likely an instruction is correctness-critical, respectively. A likelihood of correctness criticality is computed by a value vulnerability factor, which is defined by the numerically significant bit-width used to compute a result. The proposed technique is accomplished by exploiting information redundancy of compressing computationally useful data bits. Based on the likelihood of correctness criticality test, the filter checker mitigates the verification workload by bypassing instructions that are unimportant for correct execution. Extensive measurements prove that the LoCC metric yields quite a wide distribution of values, indicating that it has the potential to differentiate diverse degrees of correctness criticality. Experimental results show that the proposed scheme accelerates a traditional fully-fault-tolerant processor by 1.7 times, while it reduces the soft error rate to 18% of that of a non-fault-tolerant processor.
由于目前半导体技术的规模化趋势,微处理器变得越来越容易受到软错误的影响。传统的冗余多线程架构通过重新执行所有的计算来提供良好的容错性。但是,这样的完全重新执行会显著增加对处理器资源的需求,从而导致严重的性能下降。为了解决这一问题,本文引入了一种基于正确性临界度的过滤器检查器,对验证候选者进行优先级排序,从而有选择地进行验证。二进制正确性关键性(BCC)和正确性关键性可能性(LoCC)是分别量化一条指令对可靠性是否重要或一条指令对正确性关键性的可能性有多大的度量。正确临界性的可能性是通过脆弱性因子的值来计算的,脆弱性因子由用于计算结果的数字有效位宽度来定义。该技术是通过利用压缩计算有用数据位的信息冗余来实现的。基于正确性临界性测试的可能性,过滤器检查器通过绕过对正确执行不重要的指令来减轻验证工作负载。大量的测量证明,LoCC度量产生了相当广泛的值分布,表明它具有区分不同程度的正确性临界性的潜力。实验结果表明,该方案将传统的全容错处理器的速度提高了1.7倍,将软错误率降低到非容错处理器的18%。
{"title":"Prioritizing verification via value-based correctness criticality","authors":"Joonhyuk Yoo, M. Franklin","doi":"10.1109/ICCD.2007.4601921","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601921","url":null,"abstract":"Microprocessors are becoming increasingly susceptible to soft errors due to the current trends of semiconductor technology scaling. Traditional redundant multi-threading architectures provide good fault tolerance by re-executing all the computations. However, such a full re-execution significantly increases the demand on the processor resources, resulting in severe performance degradation. To address this problem, this paper introduces a correctness criticality based filter checker, which prioritizes the verification candidates so as to selectively do verification. Binary Correctness Criticality (BCC) and Likelihood of Correctness Criticality (LoCC) are metrics that quantify whether an instruction is important for reliability or how likely an instruction is correctness-critical, respectively. A likelihood of correctness criticality is computed by a value vulnerability factor, which is defined by the numerically significant bit-width used to compute a result. The proposed technique is accomplished by exploiting information redundancy of compressing computationally useful data bits. Based on the likelihood of correctness criticality test, the filter checker mitigates the verification workload by bypassing instructions that are unimportant for correct execution. Extensive measurements prove that the LoCC metric yields quite a wide distribution of values, indicating that it has the potential to differentiate diverse degrees of correctness criticality. Experimental results show that the proposed scheme accelerates a traditional fully-fault-tolerant processor by 1.7 times, while it reduces the soft error rate to 18% of that of a non-fault-tolerant processor.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"214 1","pages":"333-340"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79526525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A low overhead hardware technique for software integrity and confidentiality 一种低开销的硬件技术,可以保证软件的完整性和保密性
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601889
Austin Rogers, M. Milenkovic, A. Milenković
Software integrity and confidentiality play a central role in making embedded computer systems resilient to various malicious actions, such as software attacks; probing and tampering with buses, memory, and I/O devices; and reverse engineering. In this paper we describe an efficient hardware mechanism that protects software integrity and guarantees software confidentiality. To provide software integrity, each instruction block is signed during program installation with a cryptographically secure signature. The signatures embedded in the code are verified during program execution. Software confidentiality is provided by encrypting instruction blocks. To achieve low performance overhead, the proposed mechanism combines several architectural enhancements: a variation of one-time-pad encryption, parallelizable signatures, and conditional execution of unverified instructions. A relatively high memory overhead due to embedded signatures can be reduced by protecting multiple instruction blocks with one signature, with minimal effects on complexity and performance overhead.
软件完整性和保密性在使嵌入式计算机系统能够抵御各种恶意行为(例如软件攻击)方面发挥着核心作用;探测和篡改总线、内存和I/O设备;逆向工程。本文描述了一种有效的保护软件完整性和保证软件机密性的硬件机制。为了保证软件的完整性,在程序安装过程中,每个指令块都使用加密安全签名进行签名。嵌入在代码中的签名在程序执行期间进行验证。软件机密性是通过加密指令块来实现的。为了实现低性能开销,提议的机制结合了几个架构增强:一次性加密的变体、可并行签名和未经验证的指令的条件执行。通过使用一个签名保护多个指令块,可以减少由于嵌入签名而导致的相对较高的内存开销,同时对复杂性和性能开销的影响最小。
{"title":"A low overhead hardware technique for software integrity and confidentiality","authors":"Austin Rogers, M. Milenkovic, A. Milenković","doi":"10.1109/ICCD.2007.4601889","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601889","url":null,"abstract":"Software integrity and confidentiality play a central role in making embedded computer systems resilient to various malicious actions, such as software attacks; probing and tampering with buses, memory, and I/O devices; and reverse engineering. In this paper we describe an efficient hardware mechanism that protects software integrity and guarantees software confidentiality. To provide software integrity, each instruction block is signed during program installation with a cryptographically secure signature. The signatures embedded in the code are verified during program execution. Software confidentiality is provided by encrypting instruction blocks. To achieve low performance overhead, the proposed mechanism combines several architectural enhancements: a variation of one-time-pad encryption, parallelizable signatures, and conditional execution of unverified instructions. A relatively high memory overhead due to embedded signatures can be reduced by protecting multiple instruction blocks with one signature, with minimal effects on complexity and performance overhead.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"5 1","pages":"113-120"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77249103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Why we need statistical static timing analysis 为什么我们需要统计静态时序分析
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601885
C. Forzan, D. Pandini
As technology continues to advance deeper into the nanometer regime, a tight control on the process parameters is increasingly difficult. As a consequence, variability has turned out to be a dominant factor in the design of complex ICs. Traditional static timing analysis (STA) is becoming insufficient to accurately evaluate the process variation impact on the design performance considering the increasing number of process, power supply voltage, and temperature (PVT) corners. In contrast, statistical static timing analysis (SSTA) is a promising innovative technique to handle increasingly larger environmental and process fluctuations, especially on-chip parameter variations. However, the statistical approach needs a set of costly additional data such as an accurate process variation description, and a statistical standard cell library characterization. In this paper, STA and SSTA are applied on a real industrial design to compare the two techniques, in terms of both accuracy and cost. From our analysis, we have concluded that the potential advantages offered by SSTA exceed the additional library characterization cost and process data assembly effort.
随着纳米技术的不断深入,对工艺参数的严格控制变得越来越困难。因此,可变性已被证明是复杂集成电路设计中的一个主要因素。考虑到工艺、电源电压和温度(PVT)角的增加,传统的静态时序分析(STA)已不足以准确评估工艺变化对设计性能的影响。相比之下,统计静态时序分析(SSTA)是一种很有前途的创新技术,可以处理越来越大的环境和工艺波动,特别是片上参数变化。然而,统计方法需要一组昂贵的附加数据,如准确的过程变化描述和统计标准细胞库特征。本文将STA和SSTA应用于实际工业设计中,从精度和成本两方面对两种技术进行比较。从我们的分析中,我们得出结论,SSTA提供的潜在优势超过了额外的库表征成本和过程数据组装工作。
{"title":"Why we need statistical static timing analysis","authors":"C. Forzan, D. Pandini","doi":"10.1109/ICCD.2007.4601885","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601885","url":null,"abstract":"As technology continues to advance deeper into the nanometer regime, a tight control on the process parameters is increasingly difficult. As a consequence, variability has turned out to be a dominant factor in the design of complex ICs. Traditional static timing analysis (STA) is becoming insufficient to accurately evaluate the process variation impact on the design performance considering the increasing number of process, power supply voltage, and temperature (PVT) corners. In contrast, statistical static timing analysis (SSTA) is a promising innovative technique to handle increasingly larger environmental and process fluctuations, especially on-chip parameter variations. However, the statistical approach needs a set of costly additional data such as an accurate process variation description, and a statistical standard cell library characterization. In this paper, STA and SSTA are applied on a real industrial design to compare the two techniques, in terms of both accuracy and cost. From our analysis, we have concluded that the potential advantages offered by SSTA exceed the additional library characterization cost and process data assembly effort.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"26 1","pages":"91-96"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73196981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
CMOS logic design with independent-gate FinFETs 独立栅极finfet的CMOS逻辑设计
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601953
Anish Muttreja, Niket Agarwal, N. Jha
Fin-type field-effect transistors (FinFETs) are promising substitutes for bulk CMOS in nano-scale circuits. In this paper, it is observed that in spite of improved device characteristics, high active leakage may remain a problem for FinFET logic circuits. Leakage is found to contribute 31.3% of total power consumption in power-optimized FinFET logic circuits. Various FinFET logic design styles, based on independent control of FinFET gates, are studied. A new low-leakage logic style is presented. Leakage (total) power savings of 64.7% (14.5%) under tight delay constraints and 91.2% (37.2%) under relaxed delay constraints, through the judicious use of FinFET logic styles, are demonstrated.
翅片型场效应晶体管(finfet)是纳米级电路中大块CMOS的有前途的替代品。本文观察到,尽管器件特性得到了改善,但对于FinFET逻辑电路来说,高有源泄漏可能仍然是一个问题。在功率优化的FinFET逻辑电路中,泄漏占总功耗的31.3%。研究了基于FinFET栅极独立控制的各种FinFET逻辑设计风格。提出了一种新的低泄漏逻辑方式。通过明智地使用FinFET逻辑样式,证明了在严格延迟约束下泄漏(总)功耗节省64.7%(14.5%),在宽松延迟约束下节省91.2%(37.2%)。
{"title":"CMOS logic design with independent-gate FinFETs","authors":"Anish Muttreja, Niket Agarwal, N. Jha","doi":"10.1109/ICCD.2007.4601953","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601953","url":null,"abstract":"Fin-type field-effect transistors (FinFETs) are promising substitutes for bulk CMOS in nano-scale circuits. In this paper, it is observed that in spite of improved device characteristics, high active leakage may remain a problem for FinFET logic circuits. Leakage is found to contribute 31.3% of total power consumption in power-optimized FinFET logic circuits. Various FinFET logic design styles, based on independent control of FinFET gates, are studied. A new low-leakage logic style is presented. Leakage (total) power savings of 64.7% (14.5%) under tight delay constraints and 91.2% (37.2%) under relaxed delay constraints, through the judicious use of FinFET logic styles, are demonstrated.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"29 1","pages":"560-567"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73739656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 150
Maximizing the throughput-area efficiency of fully-parallel low-density parity-check decoding with C-slow retiming and asynchronous deep pipelining 最大化全并行低密度奇偶校验解码的吞吐量面积效率与C-slow重定时和异步深管道
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601964
Ming Su, Lili Zhou, C. Shi
In this paper, we apply C-slow retiming and asynchronous deep pipelining to maximize the throughput-area efficiency of fully parallel low-density-parity-check (LDPC) decoding. Pipelined decoders are implemented in a 0.18 mum FDSOI CMOS process. Experimental results show that our pipelining technique is an efficient approach to maximizing LDPC decoding throughput while minimizing the area consumption. First, pipelined decoders can achieve extraordinary high throughput which non-pipelined design cannot. Second, for the same throughput, pipelined decoders use less area than non-pipelined design. Our approach can improve the throughput of a published implementation by 4 times with only about 80% area overhead. Without using clocks, proposed asynchronous pipelined decoders are more scalable in design complexity and more robust to process-voltage-temperature variations than existing clock-based LDPC decoders.
在本文中,我们使用C-slow重定时和异步深管道来最大化全并行低密度奇偶校验(LDPC)译码的吞吐量效率。流水线解码器在0.18 μ m FDSOI CMOS工艺中实现。实验结果表明,我们的流水线技术是一种有效的方法,可以最大限度地提高LDPC解码吞吐量,同时最小化面积消耗。首先,流水线解码器可以实现非流水线设计无法实现的高吞吐量。其次,对于相同的吞吐量,流水线解码器比非流水线设计使用更少的面积。我们的方法可以将发布实现的吞吐量提高4倍,而面积开销仅为80%左右。在不使用时钟的情况下,与现有的基于时钟的LDPC解码器相比,所提出的异步流水线解码器在设计复杂性上更具可扩展性,并且对处理电压温度变化的鲁棒性更强。
{"title":"Maximizing the throughput-area efficiency of fully-parallel low-density parity-check decoding with C-slow retiming and asynchronous deep pipelining","authors":"Ming Su, Lili Zhou, C. Shi","doi":"10.1109/ICCD.2007.4601964","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601964","url":null,"abstract":"In this paper, we apply C-slow retiming and asynchronous deep pipelining to maximize the throughput-area efficiency of fully parallel low-density-parity-check (LDPC) decoding. Pipelined decoders are implemented in a 0.18 mum FDSOI CMOS process. Experimental results show that our pipelining technique is an efficient approach to maximizing LDPC decoding throughput while minimizing the area consumption. First, pipelined decoders can achieve extraordinary high throughput which non-pipelined design cannot. Second, for the same throughput, pipelined decoders use less area than non-pipelined design. Our approach can improve the throughput of a published implementation by 4 times with only about 80% area overhead. Without using clocks, proposed asynchronous pipelined decoders are more scalable in design complexity and more robust to process-voltage-temperature variations than existing clock-based LDPC decoders.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"10 1","pages":"636-643"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87839480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2007 25th International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1