首页 > 最新文献

2008 IEEE International Conference on Computer Design最新文献

英文 中文
Energy-delay tradeoffs in 32-bit static shifter designs 32位静态移位器设计中的能量延迟权衡
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751926
Steve Huntzicker, Michael Dayringer, Justin Soprano, Anthony Weerasinghe, D. Harris, D. Patil
This paper compares the energy-delay tradeoff curves of 32-bit static barrel and funnel shifters. The Stanford Circuit Optimization Tool (SCOT) is used to determine best transistor sizes in a 90 nm process. The paper evaluates the effect of multiplexer valency, circuit design, and physical placement. It also quantifies the costs of various shift operations. A funnel shifter using 4- and 8-input static multiplexer stages gives the best energy-delay tradeoff, with a knee at 440 ps (15 FO4 inverter delays) consuming 0.9 pJ per shift.
本文比较了32位静态桶和漏斗移位器的能量延迟权衡曲线。斯坦福电路优化工具(SCOT)用于确定90nm工艺中的最佳晶体管尺寸。本文评估了多路复用器价格、电路设计和物理放置的影响。它还量化了各种移位操作的成本。使用4和8输入静态多路复用器级的漏斗移位器提供了最佳的能量延迟权衡,在440 ps (15 FO4逆变器延迟)时,每个移位消耗0.9 pJ。
{"title":"Energy-delay tradeoffs in 32-bit static shifter designs","authors":"Steve Huntzicker, Michael Dayringer, Justin Soprano, Anthony Weerasinghe, D. Harris, D. Patil","doi":"10.1109/ICCD.2008.4751926","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751926","url":null,"abstract":"This paper compares the energy-delay tradeoff curves of 32-bit static barrel and funnel shifters. The Stanford Circuit Optimization Tool (SCOT) is used to determine best transistor sizes in a 90 nm process. The paper evaluates the effect of multiplexer valency, circuit design, and physical placement. It also quantifies the costs of various shift operations. A funnel shifter using 4- and 8-input static multiplexer stages gives the best energy-delay tradeoff, with a knee at 440 ps (15 FO4 inverter delays) consuming 0.9 pJ per shift.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131919074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Energy-aware opcode design 节能操作码设计
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751918
Balaji V. Iyer, Jason A. Poovey, T. Conte
Embedded processors are required to achieve high performance while running on batteries. Thus, they must exploit all the possible means available to reduce energy consumption while not sacrificing performance. In this work, one technique to reduce energy is explored to intelligently design the instruction-opcodes of a processor based on a target-workload. The optimization is done using a heuristic that not-only minimizes switching between adjacent instructions, but also simplifies the decoding to reduce latches to save dynamic energy. On average, an optimized opcode is able to be decoded using 40-60% less latches in the decoder. In addition, it is shown that a decoder optimized for algorithms that had similar program structure, similar data-types or similar behavior exhibited consistent patterns of energy reduction. The techniques presented in this paper yield an average 10% reduction in the total dynamic energy. It is also shown that this heuristic can be used to achieve similar results on different issue-width processors.
嵌入式处理器需要在电池供电的情况下实现高性能。因此,他们必须利用所有可能的方法来减少能源消耗,同时不牺牲性能。本文探讨了一种基于目标工作负载的处理器指令操作码智能设计的节能技术。优化使用启发式算法,不仅可以最大限度地减少相邻指令之间的切换,还可以简化解码以减少锁存以节省动态能量。平均而言,优化的操作码能够使用解码器中减少40-60%的锁存器进行解码。此外,研究表明,针对具有相似程序结构、相似数据类型或相似行为的算法进行优化的解码器具有一致的节能模式。本文提出的技术使总动态能量平均降低10%。还表明,这种启发式方法可用于在不同的问题宽度处理器上获得类似的结果。
{"title":"Energy-aware opcode design","authors":"Balaji V. Iyer, Jason A. Poovey, T. Conte","doi":"10.1109/ICCD.2008.4751918","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751918","url":null,"abstract":"Embedded processors are required to achieve high performance while running on batteries. Thus, they must exploit all the possible means available to reduce energy consumption while not sacrificing performance. In this work, one technique to reduce energy is explored to intelligently design the instruction-opcodes of a processor based on a target-workload. The optimization is done using a heuristic that not-only minimizes switching between adjacent instructions, but also simplifies the decoding to reduce latches to save dynamic energy. On average, an optimized opcode is able to be decoded using 40-60% less latches in the decoder. In addition, it is shown that a decoder optimized for algorithms that had similar program structure, similar data-types or similar behavior exhibited consistent patterns of energy reduction. The techniques presented in this paper yield an average 10% reduction in the total dynamic energy. It is also shown that this heuristic can be used to achieve similar results on different issue-width processors.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124535299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved combined binary/decimal fixed-point multipliers 改进的组合二进制/十进制定点乘法器
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751845
Brian J. Hickmann, M. Schulte, M. A. Erle
Decimal multiplication is important in many commercial applications including banking, tax calculation, currency conversion, and other financial areas. This paper presents several combined binary/decimal fixed-point multipliers that use the BCD-4221 recoding for the decimal digits. This allows the use of binary carry-save hardware to perform decimal addition with a small correction. Our proposed designs contain several novel improvements over previously published designs. These include an improved reduction tree organization to reduce the area and delay of the multiplier and improved reduction tree components that leverage the redundant decimal encodings to help reduce delay. A novel split reduction tree architecture is also introduced that reduces the delay of the binary product with only a small increase in total area. Area and delay estimates are presented that show that the proposed designs have significant area improvements over separate binary and decimal multipliers while still maintaining similar latencies for both decimal and binary operations.
十进制乘法在许多商业应用中都很重要,包括银行、税收计算、货币转换和其他金融领域。本文介绍了几种使用BCD-4221编码的二进制/十进制组合定点乘法器。这允许使用二进制免进位硬件来执行带有小校正的十进制加法。我们提出的设计包含了对先前发表的设计的一些新颖改进。其中包括改进的约简树组织,以减少乘法器的面积和延迟,改进的约简树组件利用冗余十进制编码来帮助减少延迟。本文还介绍了一种新的分割约简树结构,该结构在减小二进制积延迟的同时,只增加了很小的总面积。提出的面积和延迟估计表明,所提出的设计比单独的二进制和十进制乘法器有显着的面积改进,同时仍然保持十进制和二进制操作的相似延迟。
{"title":"Improved combined binary/decimal fixed-point multipliers","authors":"Brian J. Hickmann, M. Schulte, M. A. Erle","doi":"10.1109/ICCD.2008.4751845","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751845","url":null,"abstract":"Decimal multiplication is important in many commercial applications including banking, tax calculation, currency conversion, and other financial areas. This paper presents several combined binary/decimal fixed-point multipliers that use the BCD-4221 recoding for the decimal digits. This allows the use of binary carry-save hardware to perform decimal addition with a small correction. Our proposed designs contain several novel improvements over previously published designs. These include an improved reduction tree organization to reduce the area and delay of the multiplier and improved reduction tree components that leverage the redundant decimal encodings to help reduce delay. A novel split reduction tree architecture is also introduced that reduces the delay of the binary product with only a small increase in total area. Area and delay estimates are presented that show that the proposed designs have significant area improvements over separate binary and decimal multipliers while still maintaining similar latencies for both decimal and binary operations.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117349617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
The 2D DBM: An attractive alternative to the simple 2D mesh topology for on-chip networks 2D DBM:对于片上网络来说,简单的2D网格拓扑是一个有吸引力的替代方案
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751905
R. Sabbaghi‐Nadooshan, M. Modarressi, H. Sarbazi-Azad
During the recent years, 2D mesh network-onchip has attracted much attention due to its suitability for VLSI implementation. The 2-dimensional de Bruijn topology for network-on-chip is introduced in this paper as an attractive alternative to the popular simple 2D mesh NoC. Its cost is equal to that of the simple 2D mesh but it has a logarithmic diameter. We compare the proposed network and the popular mesh network in terms of power consumption and network performance. Compared to the equal sized simple mesh NoC, the proposed de Bruijn-based network has better performance while consuming less energy.
近年来,二维网格片上网络因其适合大规模集成电路的实现而备受关注。本文介绍了用于片上网络的二维de Bruijn拓扑,作为流行的简单二维网格NoC的一个有吸引力的替代方案。它的成本等于简单的二维网格,但它的直径是对数的。我们在功耗和网络性能方面比较了所提出的网络和流行的网状网络。与相同大小的简单网格NoC相比,本文提出的基于de bruijnn的网络具有更好的性能和更低的能耗。
{"title":"The 2D DBM: An attractive alternative to the simple 2D mesh topology for on-chip networks","authors":"R. Sabbaghi‐Nadooshan, M. Modarressi, H. Sarbazi-Azad","doi":"10.1109/ICCD.2008.4751905","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751905","url":null,"abstract":"During the recent years, 2D mesh network-onchip has attracted much attention due to its suitability for VLSI implementation. The 2-dimensional de Bruijn topology for network-on-chip is introduced in this paper as an attractive alternative to the popular simple 2D mesh NoC. Its cost is equal to that of the simple 2D mesh but it has a logarithmic diameter. We compare the proposed network and the popular mesh network in terms of power consumption and network performance. Compared to the equal sized simple mesh NoC, the proposed de Bruijn-based network has better performance while consuming less energy.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122099603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Reversi: Post-silicon validation system for modern microprocessors 用于现代微处理器的后硅验证系统
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751878
I. Wagner, V. Bertacco
Verification remains an integral and crucial phase of todaypsilas microprocessor design and manufacturing process. Unfortunately, with soaring design complexities and decreasing time-to-market windows, todaypsilas verification approaches are incapable of fully validating a microprocessor before its release to the public. Increasingly, post-silicon validation is deployed to detect complex functional bugs in addition to exposing electrical and manufacturing defects. This is due to the significantly higher execution performance offered by post-silicon methods, compared to pre-silicon approaches. Validation in the post-silicon domain is predominantly carried out by executing constrained-random test instruction sequences directly on a hardware prototype. However, to identify errors, the state obtained from executing tests directly in hardware must be compared to the one produced by an architectural simulation of the designpsilas golden model. Therefore, the speed of validation is severely limited by the necessity of a costly simulation step. In this work we address this bottleneck in the traditional flow and present a novel solution for post-silicon validation that exposes its native high performance. Our framework, called Reversi, generates random programs in such a way that their correct final state is known at generation time, eliminating the need for architectural simulations. Our experiments show that Reversi generates tests exposing more bugs faster, and can speed up post-silicon validation by 20x compared to traditional flows.
验证仍然是当今微处理器设计和制造过程中不可或缺的关键阶段。不幸的是,随着设计复杂性的飙升和上市时间的缩短,目前的验证方法无法在微处理器向公众发布之前对其进行完全验证。除了暴露电气和制造缺陷外,越来越多的后硅验证被用于检测复杂的功能缺陷。这是由于与前硅方法相比,后硅方法提供了更高的执行性能。后硅领域的验证主要是通过直接在硬件原型上执行约束随机测试指令序列来实现的。然而,为了识别错误,必须将直接在硬件中执行测试获得的状态与designsilas黄金模型的体系结构模拟产生的状态进行比较。因此,验证的速度受到昂贵的仿真步骤的限制。在这项工作中,我们解决了传统流程中的这一瓶颈,并提出了一种新的后硅验证解决方案,该解决方案暴露了其固有的高性能。我们的框架,称为Reversi,以这样一种方式生成随机程序,即在生成时知道它们的正确最终状态,从而消除了对架构模拟的需要。我们的实验表明,与传统流程相比,Reversi可以更快地生成暴露更多错误的测试,并且可以将硅后验证速度提高20倍。
{"title":"Reversi: Post-silicon validation system for modern microprocessors","authors":"I. Wagner, V. Bertacco","doi":"10.1109/ICCD.2008.4751878","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751878","url":null,"abstract":"Verification remains an integral and crucial phase of todaypsilas microprocessor design and manufacturing process. Unfortunately, with soaring design complexities and decreasing time-to-market windows, todaypsilas verification approaches are incapable of fully validating a microprocessor before its release to the public. Increasingly, post-silicon validation is deployed to detect complex functional bugs in addition to exposing electrical and manufacturing defects. This is due to the significantly higher execution performance offered by post-silicon methods, compared to pre-silicon approaches. Validation in the post-silicon domain is predominantly carried out by executing constrained-random test instruction sequences directly on a hardware prototype. However, to identify errors, the state obtained from executing tests directly in hardware must be compared to the one produced by an architectural simulation of the designpsilas golden model. Therefore, the speed of validation is severely limited by the necessity of a costly simulation step. In this work we address this bottleneck in the traditional flow and present a novel solution for post-silicon validation that exposes its native high performance. Our framework, called Reversi, generates random programs in such a way that their correct final state is known at generation time, eliminating the need for architectural simulations. Our experiments show that Reversi generates tests exposing more bugs faster, and can speed up post-silicon validation by 20x compared to traditional flows.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127939067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Exploiting producer patterns and L2 cache for timely dependence-based prefetching 利用生产者模式和二级缓存进行及时的基于依赖的预取
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751935
C. Lim, G. Byrd
This paper proposes an architecture that efficiently prefetches for loads whose effective addresses are directly dependent on previously-loaded values. This dependence-based prefetching scheme covers most frequently missed loads in programs that contain linked data structures (LDS). For timely prefetches, memory access patterns of producing loads are dynamically learned. These patterns (such as strides) are used to prefetch well ahead of the consumer load. The proposed prefetcher is placed near the processor core and targets L1 cache misses, because removing L1 cache misses has greater performance potential than removing L2 cache misses. We also examine how to capture pointers in LDS with pure hardware implementation. We find that the space requirement can be reduced, compared to previous work, if we selectively record patterns. Still, to make the prefetching scheme generally applicable, a large table is required for storing pointers. We show that storing the prefetch table in a partition of the L2 cache outperforms using the L2 cache conventionally.
本文提出了一种有效预取结构,可以有效地预取那些有效地址直接依赖于先前加载值的负载。这种基于依赖的预取方案覆盖了包含链接数据结构(LDS)的程序中最常丢失的加载。对于及时预取,动态学习产生负载的内存访问模式。这些模式(例如strides)用于在使用者负载之前进行预取。建议的预取器放置在处理器核心附近,目标是L1缓存缺失,因为删除L1缓存缺失比删除L2缓存缺失具有更大的性能潜力。我们还研究了如何使用纯硬件实现在LDS中捕获指针。我们发现,与以前的工作相比,如果我们有选择地记录模式,空间需求可以减少。然而,为了使预取方案普遍适用,需要一个大表来存储指针。我们表明,将预取表存储在二级缓存的分区中比传统地使用二级缓存要好。
{"title":"Exploiting producer patterns and L2 cache for timely dependence-based prefetching","authors":"C. Lim, G. Byrd","doi":"10.1109/ICCD.2008.4751935","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751935","url":null,"abstract":"This paper proposes an architecture that efficiently prefetches for loads whose effective addresses are directly dependent on previously-loaded values. This dependence-based prefetching scheme covers most frequently missed loads in programs that contain linked data structures (LDS). For timely prefetches, memory access patterns of producing loads are dynamically learned. These patterns (such as strides) are used to prefetch well ahead of the consumer load. The proposed prefetcher is placed near the processor core and targets L1 cache misses, because removing L1 cache misses has greater performance potential than removing L2 cache misses. We also examine how to capture pointers in LDS with pure hardware implementation. We find that the space requirement can be reduced, compared to previous work, if we selectively record patterns. Still, to make the prefetching scheme generally applicable, a large table is required for storing pointers. We show that storing the prefetch table in a partition of the L2 cache outperforms using the L2 cache conventionally.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128874884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Energy-precision tradeoffs in mobile Graphics Processing Units 移动图形处理单元中的能量-精度权衡
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751841
Jeff Pool, A. Lastra, Montek Singh
In mobile devices, limiting the Graphics Processing Unitpsilas (GPUpsilas) energy usage is of great importance to extending battery life. This paper focuses on the first stage of the graphics processor pipeline - the vertex transformation stage - and introduces an approach to lowering its switching activity by reducing the precision of arithmetic operations. As a result, the approach enables a tradeoff between energy efficiency and the quality of the rendered image. This paper makes the following specific contributions: 1) a transition-based energy model for quantifying energy consumed as a function of arithmetic precision, and 2) detailed simulation results on several real-world graphics applications to evaluate the tradeoff between energy and precision. In most examples, over 23% of the energy can be saved by lowering arithmetic precision while still maintaining a faithful reproduction of the full-precision image. Pushing the idea further, over 36% energy can be saved by further lowering the precision while preserving acceptable result accuracy. We assert that this represents a significant energy savings that warrants further investigation and extension of our approach to the remaining stages of the graphics processor pipeline.
在移动设备中,限制图形处理单元(GPUpsilas)的能量使用对于延长电池寿命非常重要。本文重点讨论了图形处理器流水线的第一阶段——顶点转换阶段,并介绍了一种通过降低算术运算精度来降低其切换活动的方法。因此,该方法能够在能源效率和渲染图像质量之间进行权衡。本文做出了以下具体贡献:1)基于转换的能量模型,用于将消耗的能量作为算术精度的函数进行量化;2)在几个现实世界的图形应用中进行了详细的仿真结果,以评估能量和精度之间的权衡。在大多数情况下,降低算术精度可以节省超过23%的能量,同时仍然保持全精度图像的忠实再现。进一步推动这一想法,通过进一步降低精度,同时保持可接受的结果精度,可以节省超过36%的能源。我们断言,这代表了显著的能源节约,值得进一步调查和扩展我们的方法到图形处理器管道的其余阶段。
{"title":"Energy-precision tradeoffs in mobile Graphics Processing Units","authors":"Jeff Pool, A. Lastra, Montek Singh","doi":"10.1109/ICCD.2008.4751841","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751841","url":null,"abstract":"In mobile devices, limiting the Graphics Processing Unitpsilas (GPUpsilas) energy usage is of great importance to extending battery life. This paper focuses on the first stage of the graphics processor pipeline - the vertex transformation stage - and introduces an approach to lowering its switching activity by reducing the precision of arithmetic operations. As a result, the approach enables a tradeoff between energy efficiency and the quality of the rendered image. This paper makes the following specific contributions: 1) a transition-based energy model for quantifying energy consumed as a function of arithmetic precision, and 2) detailed simulation results on several real-world graphics applications to evaluate the tradeoff between energy and precision. In most examples, over 23% of the energy can be saved by lowering arithmetic precision while still maintaining a faithful reproduction of the full-precision image. Pushing the idea further, over 36% energy can be saved by further lowering the precision while preserving acceptable result accuracy. We assert that this represents a significant energy savings that warrants further investigation and extension of our approach to the remaining stages of the graphics processor pipeline.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121362716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Gate planning during placement for gated clock network 门控时钟网络布置期间的门规划
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751851
Weixiang Shen, Yici Cai, Xianlong Hong, Jiang Hu
Clock gating is a popular technique for reducing power dissipation in clock network. Although there have been numerous research efforts on clock gating, the previous approaches still have a significant weakness. That is, they usually construct a gated clock tree after cell placement, i.e., cell placement is performed without considering clock gating and may generate a solution unfriendly to subsequent gated clock tree construction. As a result, the control gates inserted in the tree construction is very likely to cause cell overlap. Even though the overlap can be eventually removed in placement legalization, remarkable wirelength/power overhead is incurred. In this paper, we propose a gate planning technique which is integrated with a partition-based cell placer. During cell placement, the planning judiciously inserts clock gates based on power estimation. In addition, pseudo edges are inserted between clock gates and registers in order to reduce clock wirelength and enable long shut-off periods. At the end, when a relatively detailed placement is obtained, a post-processing is performed to degrade the inefficient clock gates to clock buffers. We compared our approach with recent previous works on ISCAS89 benchmark circuits. Our method reduces the clock tree wirelength and power by 22.06% and 40.80%, respectively, with a very limited increase on signal nets wirelength and power compared with the conventional (register-oblivious) placement. The results also indicate that our algorithm outperforms the clock-gating-oblivious placement on power reduction and performance improvement.
时钟门控是时钟网络中降低功耗的一种常用技术。尽管对时钟门控进行了大量的研究,但以前的方法仍然有明显的弱点。也就是说,它们通常在单元放置之后构建一个门控时钟树,也就是说,在进行单元放置时不考虑时钟门控,并且可能产生对后续门控时钟树构建不友好的解决方案。因此,插入到树结构中的控制门很可能导致细胞重叠。即使重叠最终可以在放置合法化中消除,也会产生显着的无线/功率开销。在本文中,我们提出了一种集成了基于分区的栅极规划技术。在单元放置期间,规划会根据功率估计明智地插入时钟门。此外,在时钟门和寄存器之间插入伪边,以减少时钟长度并启用长关断周期。最后,当获得相对详细的放置位置时,执行后处理以将低效的时钟门降级为时钟缓冲区。我们将我们的方法与最近在ISCAS89基准电路上的工作进行了比较。我们的方法将时钟树的长度和功率分别降低了22.06%和40.80%,与传统的(寄存器无关的)放置相比,信号网的长度和功率的增加非常有限。结果还表明,我们的算法在降低功耗和提高性能方面优于时钟无关放置。
{"title":"Gate planning during placement for gated clock network","authors":"Weixiang Shen, Yici Cai, Xianlong Hong, Jiang Hu","doi":"10.1109/ICCD.2008.4751851","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751851","url":null,"abstract":"Clock gating is a popular technique for reducing power dissipation in clock network. Although there have been numerous research efforts on clock gating, the previous approaches still have a significant weakness. That is, they usually construct a gated clock tree after cell placement, i.e., cell placement is performed without considering clock gating and may generate a solution unfriendly to subsequent gated clock tree construction. As a result, the control gates inserted in the tree construction is very likely to cause cell overlap. Even though the overlap can be eventually removed in placement legalization, remarkable wirelength/power overhead is incurred. In this paper, we propose a gate planning technique which is integrated with a partition-based cell placer. During cell placement, the planning judiciously inserts clock gates based on power estimation. In addition, pseudo edges are inserted between clock gates and registers in order to reduce clock wirelength and enable long shut-off periods. At the end, when a relatively detailed placement is obtained, a post-processing is performed to degrade the inefficient clock gates to clock buffers. We compared our approach with recent previous works on ISCAS89 benchmark circuits. Our method reduces the clock tree wirelength and power by 22.06% and 40.80%, respectively, with a very limited increase on signal nets wirelength and power compared with the conventional (register-oblivious) placement. The results also indicate that our algorithm outperforms the clock-gating-oblivious placement on power reduction and performance improvement.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125237064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A random and pseudo-gradient approach for analog circuit sizing with non-uniformly discretized parameters 具有非均匀离散参数的模拟电路尺寸的随机和伪梯度方法
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751860
Michael Pehl, Tobias Massier, H. Graeb, Ulf Schlichtmann
Many methods for analog circuit sizing are available as commercial, in-house and academic tools. They are based on continuous optimization, e.g., of transistor geometries, although the subsequent layout step requires values on a pre-defined grid. In addition, sizing of transistors for bipolar and RF circuits frequently necessitates the use of multiples of predefined values for the design parameters. This paper presents a novel method for solving this type of discrete optimization problem. An iterative approach is presented, which is based on pseudo-gradients and a randomized calculation of search regions and steps. Experimental comparisons with simulated annealing and a continuous sizing approach with subsequent discretization clearly show the effectivity and efficiency of the presented method.
模拟电路尺寸的许多方法可作为商业,内部和学术工具。它们基于连续优化,例如晶体管几何形状,尽管随后的布局步骤需要预定义网格上的值。此外,用于双极和射频电路的晶体管的尺寸经常需要使用预先定义的设计参数值的倍数。本文提出了一种求解这类离散优化问题的新方法。提出了一种基于伪梯度和随机计算搜索区域和步骤的迭代方法。通过与模拟退火法和离散化连续施胶法的实验比较,表明了该方法的有效性和高效性。
{"title":"A random and pseudo-gradient approach for analog circuit sizing with non-uniformly discretized parameters","authors":"Michael Pehl, Tobias Massier, H. Graeb, Ulf Schlichtmann","doi":"10.1109/ICCD.2008.4751860","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751860","url":null,"abstract":"Many methods for analog circuit sizing are available as commercial, in-house and academic tools. They are based on continuous optimization, e.g., of transistor geometries, although the subsequent layout step requires values on a pre-defined grid. In addition, sizing of transistors for bipolar and RF circuits frequently necessitates the use of multiples of predefined values for the design parameters. This paper presents a novel method for solving this type of discrete optimization problem. An iterative approach is presented, which is based on pseudo-gradients and a randomized calculation of search regions and steps. Experimental comparisons with simulated annealing and a continuous sizing approach with subsequent discretization clearly show the effectivity and efficiency of the presented method.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132641066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Is there always performance overhead for regular fabric? 普通织物是否总是有性能开销?
Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751916
Yi-Wei Lin, M. Marek-Sadowska, W. Maly, A. Pfitzner, D. Kasprowicz
In this paper, we study the circuits built from super-regular, high-density transistor arrays that can be prefabricated and customized using an OPC-free interconnect manufacturing process. The super-regular layout style greatly enhances the chippsilas manufacturability. Unlike other regular fabrics that sacrifice area and performance to improve regularity, the new layout style, combined with a new 3-D geometry transistor, enables to produce circuits with timing and power density comparable to or better than that of conventional CMOS circuits and using less chip area.
在本文中,我们研究了由超规则高密度晶体管阵列构建的电路,这些电路可以使用无opc互连制造工艺预制和定制。超规则的布局方式大大提高了芯片的可制造性。不像其他常规结构,牺牲面积和性能来提高规律性,新的布局风格,结合一个新的3-D几何晶体管,使生产电路的时序和功率密度与传统的CMOS电路相当或更好,使用更少的芯片面积。
{"title":"Is there always performance overhead for regular fabric?","authors":"Yi-Wei Lin, M. Marek-Sadowska, W. Maly, A. Pfitzner, D. Kasprowicz","doi":"10.1109/ICCD.2008.4751916","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751916","url":null,"abstract":"In this paper, we study the circuits built from super-regular, high-density transistor arrays that can be prefabricated and customized using an OPC-free interconnect manufacturing process. The super-regular layout style greatly enhances the chippsilas manufacturability. Unlike other regular fabrics that sacrifice area and performance to improve regularity, the new layout style, combined with a new 3-D geometry transistor, enables to produce circuits with timing and power density comparable to or better than that of conventional CMOS circuits and using less chip area.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131930787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
期刊
2008 IEEE International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1