首页 > 最新文献

2007 25th International Conference on Computer Design最新文献

英文 中文
A technique for selecting CMOS transistor orders 一种选择CMOS晶体管阶数的技术
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601936
T. Chiang, C. Y. Chen, Weiyu Chen
Transistor reordering has been known to be effective in reducing delays of a circuit with nearly zero penalties. However, techniques to determine good transistor orders have not been proposed in literature. Previous work on this has to resort to running SPICE for all meaningful transistor orders and selecting a best one, which is extremely time-consuming. This paper proposes an efficient and accurate technique for determining best transistor orders without running SPICE simulations. Experimental results from SPICE3 show that the predictions are very accurate.
众所周知,晶体管重新排序在减少电路延迟方面几乎是零损失的有效方法。然而,确定良好晶体管顺序的技术尚未在文献中提出。以前在这方面的工作必须诉诸于运行SPICE所有有意义的晶体管订单和选择一个最好的,这是非常耗时的。本文提出了一种有效而准确的技术,可以在不运行SPICE模拟的情况下确定最佳晶体管顺序。SPICE3的实验结果表明,预测是非常准确的。
{"title":"A technique for selecting CMOS transistor orders","authors":"T. Chiang, C. Y. Chen, Weiyu Chen","doi":"10.1109/ICCD.2007.4601936","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601936","url":null,"abstract":"Transistor reordering has been known to be effective in reducing delays of a circuit with nearly zero penalties. However, techniques to determine good transistor orders have not been proposed in literature. Previous work on this has to resort to running SPICE for all meaningful transistor orders and selecting a best one, which is extremely time-consuming. This paper proposes an efficient and accurate technique for determining best transistor orders without running SPICE simulations. Experimental results from SPICE3 show that the predictions are very accurate.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"5 1","pages":"438-443"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84995446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Dynamically compressible context architecture for low power coarse-grained reconfigurable array 低功耗粗粒度可重构阵列的动态压缩上下文架构
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601930
Yoonjin Kim, R. Mahapatra
Most of the coarse-grained reconfigurable array architectures (CGRAs) are composed of reconfigurable ALU arrays and configuration cache (or context memory) to achieve high performance and flexibility. Specially, configuration cache is the main component in CGRA that provides distinct feature for dynamic reconfiguration in every cycle. However, frequent memory-read operations for dynamic reconfiguration cause much power consumption. Thus, reducing power in configuration cache has become critical for CGRA to be more competitive and reliable for its use in embedded systems. In this paper, we propose dynamically compressible context architecture for power saving in configuration cache. This power-efficient design of context architecture works without degrading the performance and flexibility of CGRA. Experimental results show that the proposed approach saves up to 39.72% power in configuration cache with negligible area overhead.
大多数粗粒度可重构阵列架构(CGRAs)由可重构ALU阵列和配置缓存(或上下文内存)组成,以实现高性能和灵活性。特别地,配置缓存是CGRA的主要组成部分,它为每个周期的动态重新配置提供了独特的特性。但是,动态重新配置的频繁内存读取操作会导致大量的功耗。因此,降低配置缓存中的功耗对于CGRA在嵌入式系统中使用时更具竞争力和可靠性变得至关重要。在本文中,我们提出了动态压缩上下文架构,以节省配置缓存的功耗。这种高效的上下文体系结构设计不会降低CGRA的性能和灵活性。实验结果表明,该方法在配置缓存时节省了39.72%的功耗,而面积开销可以忽略不计。
{"title":"Dynamically compressible context architecture for low power coarse-grained reconfigurable array","authors":"Yoonjin Kim, R. Mahapatra","doi":"10.1109/ICCD.2007.4601930","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601930","url":null,"abstract":"Most of the coarse-grained reconfigurable array architectures (CGRAs) are composed of reconfigurable ALU arrays and configuration cache (or context memory) to achieve high performance and flexibility. Specially, configuration cache is the main component in CGRA that provides distinct feature for dynamic reconfiguration in every cycle. However, frequent memory-read operations for dynamic reconfiguration cause much power consumption. Thus, reducing power in configuration cache has become critical for CGRA to be more competitive and reliable for its use in embedded systems. In this paper, we propose dynamically compressible context architecture for power saving in configuration cache. This power-efficient design of context architecture works without degrading the performance and flexibility of CGRA. Experimental results show that the proposed approach saves up to 39.72% power in configuration cache with negligible area overhead.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"1 1","pages":"395-400"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85328436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Cache replacement based on reuse-distance prediction 基于重用距离预测的缓存替换
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601909
G. Keramidas, Pavlos Petoumenos, S. Kaxiras
Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some ldquodecay intervalrdquo we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.
已经提出了一些缓存管理技术,它们间接地尝试基于缓存的重用距离来做出决策,比如缓存衰减,它是一个重用距离的后置:如果一个缓存在一定的时间间隔内没有被访问,我们知道它的重用距离至少和这个衰减间隔一样大。在这项工作中,我们建议通过基于指令(PC)的预测直接预测重用距离,并将此信息用于缓存级优化。在本文中,我们选择优化L2缓存的替换策略作为我们的目标,因为对于L2缓存,LRU与理论最优替换算法之间的差距比较大。这表明,在许多情况下,有很大的改进余地。我们使用内存最密集的SPEC2000的一个子集来评估基于重用距离的替换策略,我们的结果显示出全面的显著优势。
{"title":"Cache replacement based on reuse-distance prediction","authors":"G. Keramidas, Pavlos Petoumenos, S. Kaxiras","doi":"10.1109/ICCD.2007.4601909","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601909","url":null,"abstract":"Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some ldquodecay intervalrdquo we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"97 1","pages":"245-250"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90815539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 128
Power reduction of chip multi-processors using shared resource control cooperating with DVFS 利用共享资源控制与DVFS合作降低芯片多处理器功耗
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601961
Ryoma Watanabe, Masaaki Kondo, Hiroshi Nakamura, T. Nanya
This paper presents a novel power reduction method for chip multi-processors (CMPs) under real-time constraints. While the power consumption of processing units (PUs) on CMPs can be reduced without violating real-time constraints by dynamic voltage and frequency scaling (DVFS), the clock frequency of each PU cannot be determined independently because of the performance impact caused by the conflict for the shared resources. To minimize power consumption in this situation, we first derive an analytical model which provides the optimal priority and clock frequency setting, and then propose a method of controlling the priority of shared resource accesses in cooperation with DVFS. From the analytical model, in dual-core CMPs, we reveal that the total power consumption is minimized when the clock frequency of two PUs becomes the same. An experiment with a synthetic benchmark supports the validity of the analytical model and the evaluation results with real applications show that the proposed method reduces the power consumption by up to 15% and 6.7% on average compared with a conventional DVFS technique.
本文提出了一种基于实时约束的芯片多处理器(cmp)功耗降低方法。动态电压和频率缩放(DVFS)可以在不违反实时约束的情况下降低cmp上处理器(PU)的功耗,但由于共享资源的冲突会影响性能,因此无法独立确定每个PU的时钟频率。为了在这种情况下最大限度地降低功耗,我们首先推导了一个提供最优优先级和时钟频率设置的分析模型,然后提出了一种与DVFS合作控制共享资源访问优先级的方法。从分析模型来看,在双核cmp中,我们发现当两个pu的时钟频率相同时,总功耗最小。综合基准实验验证了分析模型的有效性,实际应用的评价结果表明,与传统的DVFS技术相比,该方法的功耗平均降低了15%和6.7%。
{"title":"Power reduction of chip multi-processors using shared resource control cooperating with DVFS","authors":"Ryoma Watanabe, Masaaki Kondo, Hiroshi Nakamura, T. Nanya","doi":"10.1109/ICCD.2007.4601961","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601961","url":null,"abstract":"This paper presents a novel power reduction method for chip multi-processors (CMPs) under real-time constraints. While the power consumption of processing units (PUs) on CMPs can be reduced without violating real-time constraints by dynamic voltage and frequency scaling (DVFS), the clock frequency of each PU cannot be determined independently because of the performance impact caused by the conflict for the shared resources. To minimize power consumption in this situation, we first derive an analytical model which provides the optimal priority and clock frequency setting, and then propose a method of controlling the priority of shared resource accesses in cooperation with DVFS. From the analytical model, in dual-core CMPs, we reveal that the total power consumption is minimized when the clock frequency of two PUs becomes the same. An experiment with a synthetic benchmark supports the validity of the analytical model and the evaluation results with real applications show that the proposed method reduces the power consumption by up to 15% and 6.7% on average compared with a conventional DVFS technique.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"96 1","pages":"615-622"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86609224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
FPGA routing architecture analysis under variations FPGA路由结构变化分析
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601894
S. Srinivasan, P. Mangalagiri, Yuan Xie, N. Vijaykrishnan
Systems with the combined features of ASICs and field programmable gate arrays(FPGAs) are increasingly being considered as technology forerunners looking at their extraordinary benefits. This drags FPGAs into the technology scaling race along with ASICs exposing the FPGA industries to the problems associated with scaling. Extensive process variations is one such issue which directly impacts the profit margins of hardware design beyond 65 nm gate length technology. Since the resources in FPGAs are primarily dominated by the interconnect fabric, variations in the interconnect impacting the critical path timing and leakage yield needs rigorous analysis. In this work we provide a statistical modeling of individual routing components in an FPGA followed by a statistical methodology to analyze the timing and leakage distribution. This statistical model is incorporated into the routing algorithm to model a new statistically intelligent routing algorithm (SIRA), which simultaneously optimizes the leakage and timing yield of the FPGA device. We demonstrate and average leakage yield increase of 9% and timing yield by 11% using our final algorithm.
集成集成电路(asic)和现场可编程门阵列(fpga)相结合的系统越来越被认为是技术先驱,因为它们具有非凡的优势。这将FPGA与asic一起拖入了技术扩展竞赛,使FPGA行业暴露于与扩展相关的问题。广泛的工艺变化就是这样一个问题,它直接影响到65纳米栅极长度技术以外硬件设计的利润空间。由于fpga中的资源主要由互连结构控制,因此需要严格分析互连中影响关键路径时序和泄漏率的变化。在这项工作中,我们提供了FPGA中单个路由组件的统计建模,然后使用统计方法分析时序和泄漏分布。将该统计模型引入到路由算法中,建立了一种新的统计智能路由算法(SIRA),该算法同时优化了FPGA器件的漏率和时序良率。我们证明,使用我们的最终算法,泄漏率平均提高9%,时序率提高11%。
{"title":"FPGA routing architecture analysis under variations","authors":"S. Srinivasan, P. Mangalagiri, Yuan Xie, N. Vijaykrishnan","doi":"10.1109/ICCD.2007.4601894","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601894","url":null,"abstract":"Systems with the combined features of ASICs and field programmable gate arrays(FPGAs) are increasingly being considered as technology forerunners looking at their extraordinary benefits. This drags FPGAs into the technology scaling race along with ASICs exposing the FPGA industries to the problems associated with scaling. Extensive process variations is one such issue which directly impacts the profit margins of hardware design beyond 65 nm gate length technology. Since the resources in FPGAs are primarily dominated by the interconnect fabric, variations in the interconnect impacting the critical path timing and leakage yield needs rigorous analysis. In this work we provide a statistical modeling of individual routing components in an FPGA followed by a statistical methodology to analyze the timing and leakage distribution. This statistical model is incorporated into the routing algorithm to model a new statistically intelligent routing algorithm (SIRA), which simultaneously optimizes the leakage and timing yield of the FPGA device. We demonstrate and average leakage yield increase of 9% and timing yield by 11% using our final algorithm.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"100 1","pages":"152-157"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87002578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Negative-skewed shadow registers for at-speed delay variation characterization 高速延迟变化特性的负偏斜阴影寄存器
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601924
Jie Li, J. Lach
The increased process, voltage, and temperature (PVT) variability that comes with integrated circuit (IC) technology scaling has become a major problem in the semiconductor industry. In order to refine manufacturing processes and develop circuit design techniques to cope with variability, we must be able to accurately and precisely characterize the variations that occur. In this paper, we introduce a technique for characterizing combinational path delay variations by measuring a designer-controlled number of register-to-register delays in manufactured ICs with negative-skewed shadow registers. This technique enables delay measurements to be performed with at-speed tests that are run in parallel with and are orthogonal to other testing techniques, and therefore does not add combinatorial complexity to the testing process. This technique can be implemented cost-effectively on a large number of otherwise unobservable internal combinational paths to get accurate, precise data about delay variability.
集成电路(IC)技术缩放带来的工艺、电压和温度(PVT)可变性增加已经成为半导体行业的一个主要问题。为了改进制造工艺和开发电路设计技术以应对变异性,我们必须能够准确地描述发生的变化。在本文中,我们介绍了一种技术,通过测量具有负倾斜阴影寄存器的制造ic中设计人员控制的寄存器到寄存器延迟数来表征组合路径延迟变化。该技术允许使用与其他测试技术并行运行且与其他测试技术正交的高速测试来执行延迟测量,因此不会给测试过程增加组合复杂性。该技术可以在大量不可观察的内部组合路径上经济有效地实现,以获得关于延迟可变性的准确数据。
{"title":"Negative-skewed shadow registers for at-speed delay variation characterization","authors":"Jie Li, J. Lach","doi":"10.1109/ICCD.2007.4601924","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601924","url":null,"abstract":"The increased process, voltage, and temperature (PVT) variability that comes with integrated circuit (IC) technology scaling has become a major problem in the semiconductor industry. In order to refine manufacturing processes and develop circuit design techniques to cope with variability, we must be able to accurately and precisely characterize the variations that occur. In this paper, we introduce a technique for characterizing combinational path delay variations by measuring a designer-controlled number of register-to-register delays in manufactured ICs with negative-skewed shadow registers. This technique enables delay measurements to be performed with at-speed tests that are run in parallel with and are orthogonal to other testing techniques, and therefore does not add combinatorial complexity to the testing process. This technique can be implemented cost-effectively on a large number of otherwise unobservable internal combinational paths to get accurate, precise data about delay variability.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"32 1","pages":"354-359"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88482713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Constraint satisfaction in incremental placement with application to performance optimization under power constraints 在功率约束下的性能优化应用中,增量布局的约束满足
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601910
Huan Ren, S. Dutt
We present new techniques for explicit constraint satisfaction in the incremental placement process. Our algorithm employs a Lagrangian relaxation (LR) type approach in the analytical global placement stage to solve the constrained optimization problem. We establish theoretical results that prove the optimality of this stage. In the detailed placement stage, we develop a constraint-monitoring and satisfaction mechanism in a network (n/w) flow based detailed placement framework proposed recently, and empirically show its near-optimality. We establish the effectiveness of our general constraint-satisfaction methods by applying them to the problem of timing-driven optimization under power constraints. We overlay our algorithms on a recently developed unconstrained timing-driven incremental placement method flow-place. On a large number of benchmarks with up to 210K cells, our constraint satisfaction algorithms obtain an average timing improvement of 12.4% under a 3% power increase limit (the actual average power increase incurred is only 2.1%), while the original unconstrained method gives an average power increase of 8.4% for a timing improvement of 17.3%. Our techniques thus yield a tradeoff of 75% power improvement to 28% timing deterioration for the given constraint. Our constraint-satisfying incremental placer is also quite fast, e.g., its run time for the 210 K-cell circuit ibm18 is only 1541 secs.
我们提出了在增量放置过程中满足显式约束的新技术。该算法采用拉格朗日松弛(LR)型方法在解析全局布局阶段解决约束优化问题。建立了理论结果,证明了这一阶段的最优性。在详细安置阶段,我们在最近提出的基于网络(n/w)流的详细安置框架中建立了约束监测和满意度机制,并实证证明了其接近最优性。将一般约束满足方法应用于功率约束下的时间驱动优化问题,验证了其有效性。我们将我们的算法覆盖在最近开发的无约束时间驱动的增量放置方法流放置上。在多达210K单元的大量基准测试中,我们的约束满足算法在3%的功率增长限制下获得了12.4%的平均时间改进(实际平均功率增长仅为2.1%),而原始的无约束方法在17.3%的时间改进下平均功率增加了8.4%。因此,在给定的约束条件下,我们的技术产生了75%的功率改进和28%的时间退化的折衷。我们的满足约束的增量放置器也非常快,例如,它在210 k单元电路ibm18上的运行时间仅为1541秒。
{"title":"Constraint satisfaction in incremental placement with application to performance optimization under power constraints","authors":"Huan Ren, S. Dutt","doi":"10.1109/ICCD.2007.4601910","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601910","url":null,"abstract":"We present new techniques for explicit constraint satisfaction in the incremental placement process. Our algorithm employs a Lagrangian relaxation (LR) type approach in the analytical global placement stage to solve the constrained optimization problem. We establish theoretical results that prove the optimality of this stage. In the detailed placement stage, we develop a constraint-monitoring and satisfaction mechanism in a network (n/w) flow based detailed placement framework proposed recently, and empirically show its near-optimality. We establish the effectiveness of our general constraint-satisfaction methods by applying them to the problem of timing-driven optimization under power constraints. We overlay our algorithms on a recently developed unconstrained timing-driven incremental placement method flow-place. On a large number of benchmarks with up to 210K cells, our constraint satisfaction algorithms obtain an average timing improvement of 12.4% under a 3% power increase limit (the actual average power increase incurred is only 2.1%), while the original unconstrained method gives an average power increase of 8.4% for a timing improvement of 17.3%. Our techniques thus yield a tradeoff of 75% power improvement to 28% timing deterioration for the given constraint. Our constraint-satisfying incremental placer is also quite fast, e.g., its run time for the 210 K-cell circuit ibm18 is only 1541 secs.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"66 1","pages":"251-258"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79532801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Exploring the interplay of yield, area, and performance in processor caches 探索处理器缓存的产量、面积和性能之间的相互作用
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601905
Hyunjin Lee, Sangyeun Cho, B. Childers
The deployment of future deep submicron technology calls for a careful review of existing cache organizations and design practices in terms of yield and performance. This paper presents a cache design flow that enables processor architects to consider yield, area, and performance (YAP) together in a unified framework. Since there is a complex, changing trade-off between these metrics depending on the technology, the cache organization, and the yield enhancement scheme employed, such a design flow becomes invaluable to processor architects when they assess a design and explore the design space quickly at an early stage. We develop a complete set of tools supporting the proposed design flow, from injecting defects into a wafer to evaluating program performance of individual processors in the wafer. A case study is presented to demonstrate the effectiveness of the proposed design flow and developed tools.
未来深亚微米技术的部署要求对现有的缓存组织和设计实践在产量和性能方面进行仔细的审查。本文提出了一个缓存设计流程,使处理器架构师能够在一个统一的框架中同时考虑产量、面积和性能(YAP)。由于这些指标之间存在复杂的、不断变化的权衡,这取决于所采用的技术、缓存组织和良率增强方案,因此当处理器架构师在早期阶段评估设计并快速探索设计空间时,这样的设计流程对他们来说变得非常宝贵。我们开发了一套完整的工具来支持所提出的设计流程,从向晶圆中注入缺陷到评估晶圆中单个处理器的程序性能。通过一个案例研究来证明所提出的设计流程和开发的工具的有效性。
{"title":"Exploring the interplay of yield, area, and performance in processor caches","authors":"Hyunjin Lee, Sangyeun Cho, B. Childers","doi":"10.1109/ICCD.2007.4601905","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601905","url":null,"abstract":"The deployment of future deep submicron technology calls for a careful review of existing cache organizations and design practices in terms of yield and performance. This paper presents a cache design flow that enables processor architects to consider yield, area, and performance (YAP) together in a unified framework. Since there is a complex, changing trade-off between these metrics depending on the technology, the cache organization, and the yield enhancement scheme employed, such a design flow becomes invaluable to processor architects when they assess a design and explore the design space quickly at an early stage. We develop a complete set of tools supporting the proposed design flow, from injecting defects into a wafer to evaluating program performance of individual processors in the wafer. A case study is presented to demonstrate the effectiveness of the proposed design flow and developed tools.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"51 1","pages":"216-223"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86469673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Automatic SystemC TLM generation for custom communication platforms 自定义通信平台的自动SystemC TLM生成
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601878
Lochi Yu, S. Abdi
This paper presents a tool for automatic generation of transaction level models (TLMs) in SystemC for MPSoC designs with custom communication platforms. The MPSoC platform is captured as a graphical net-list of components, busses and bridge elements. The application is captured as C processes mapped to the platform components. Once the platform is decided, a set of transaction level communication APIs is automatically generated for each application C process. After the C code is input, an executable SystemC TLM of the design is automatically generated using our tool. This TLM can be executed using standard SystemC simulators for early functional verification of the design. Although, several TLM styles and standards have been proposed in the past, our approach differs in the fact that the designers do not need to understand the underlying SystemC code or TLM modeling style to verify that their application executes on the selected platform. Another key advantage of our tool is that the platform can be easily customized for the application and a new TLM for that platform can be automatically generated. The TLM can be used to program the custom platform early in the design cycle before the components are available. Our experimental results demonstrate that for large industrial applications such as MP3 decoder and H.264, high-speed TLMs can be generated for several platforms in a few seconds.
本文提出了一种在SystemC中自动生成事务级模型(tlm)的工具,用于具有自定义通信平台的MPSoC设计。MPSoC平台被捕获为组件、总线和桥接元件的图形网络列表。应用程序被捕获为映射到平台组件的C进程。一旦确定了平台,就会为每个应用程序C进程自动生成一组事务级通信api。输入C代码后,使用我们的工具自动生成设计的可执行SystemC TLM。该TLM可以使用标准的SystemC模拟器执行,以便对设计进行早期功能验证。尽管过去已经提出了几种TLM风格和标准,但我们的方法不同之处在于,设计人员不需要了解底层SystemC代码或TLM建模风格,就可以验证他们的应用程序在选定的平台上执行。我们的工具的另一个关键优势是,可以很容易地为应用程序定制平台,并且可以自动生成该平台的新TLM。在组件可用之前,TLM可用于在设计周期的早期对定制平台进行编程。实验结果表明,对于MP3解码器和H.264等大型工业应用,可以在几秒钟内为多个平台生成高速tlm。
{"title":"Automatic SystemC TLM generation for custom communication platforms","authors":"Lochi Yu, S. Abdi","doi":"10.1109/ICCD.2007.4601878","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601878","url":null,"abstract":"This paper presents a tool for automatic generation of transaction level models (TLMs) in SystemC for MPSoC designs with custom communication platforms. The MPSoC platform is captured as a graphical net-list of components, busses and bridge elements. The application is captured as C processes mapped to the platform components. Once the platform is decided, a set of transaction level communication APIs is automatically generated for each application C process. After the C code is input, an executable SystemC TLM of the design is automatically generated using our tool. This TLM can be executed using standard SystemC simulators for early functional verification of the design. Although, several TLM styles and standards have been proposed in the past, our approach differs in the fact that the designers do not need to understand the underlying SystemC code or TLM modeling style to verify that their application executes on the selected platform. Another key advantage of our tool is that the platform can be easily customized for the application and a new TLM for that platform can be automatically generated. The TLM can be used to program the custom platform early in the design cycle before the components are available. Our experimental results demonstrate that for large industrial applications such as MP3 decoder and H.264, high-speed TLMs can be generated for several platforms in a few seconds.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"59 1","pages":"41-46"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91538619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A radix-10 SRT divider based on alternative BCD codings 基于备选BCD编码的基数-10 SRT除法器
Pub Date : 2007-10-01 DOI: 10.1109/ICCD.2007.4601914
Álvaro Vázquez, E. Antelo, P. Montuschi
In this paper we present the algorithm and architecture a radix-10 floating-point divider based on an SRT non-restoring digit-by-digit algorithm. The algorithm uses conventional techniques developed to speed-up radix-2k division such as signed-digit (SD) redundant quotient and digit selection by constant comparison using a carry-save estimate of the partial remainder. To optimize area and latency for decimal, we include novel features such as the use of alternative BCD codings to represent decimal operands, estimates by truncation at any binary position inside a decimal digit, a single customized fast carry propagate decimal adder for partial remainder computation, initial odd multiple generation and final normalization with rounding, and register placement to exploit advanced high fanin mux-latch circuits. The rough area-delay estimations performed show that the proposed divider has a similar latency but less hardware complexity (1.3 area ratio) than a recently published high performance digit-by-digit implementation.
本文提出了一种基于SRT逐位非恢复算法的基数-10浮点除法的算法和结构。该算法使用传统的技术来加速基数-2k除法,如有符号数字冗余商和通过使用部分余数的免进位估计进行常数比较的数字选择。为了优化十进制的面积和延迟,我们包含了一些新功能,例如使用替代BCD编码来表示十进制操作数,通过截断十进制数字内任何二进制位置进行估计,用于部分余数计算的单个定制快速进位传播十进制加法器,初始奇倍数生成和最终四舍五入归一化,以及利用先进的高fanin多路锁存电路的寄存器放置。粗略的面积延迟估计表明,所提出的分频器具有类似的延迟,但比最近发布的高性能数位分频器具有更低的硬件复杂性(1.3面积比)。
{"title":"A radix-10 SRT divider based on alternative BCD codings","authors":"Álvaro Vázquez, E. Antelo, P. Montuschi","doi":"10.1109/ICCD.2007.4601914","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601914","url":null,"abstract":"In this paper we present the algorithm and architecture a radix-10 floating-point divider based on an SRT non-restoring digit-by-digit algorithm. The algorithm uses conventional techniques developed to speed-up radix-2k division such as signed-digit (SD) redundant quotient and digit selection by constant comparison using a carry-save estimate of the partial remainder. To optimize area and latency for decimal, we include novel features such as the use of alternative BCD codings to represent decimal operands, estimates by truncation at any binary position inside a decimal digit, a single customized fast carry propagate decimal adder for partial remainder computation, initial odd multiple generation and final normalization with rounding, and register placement to exploit advanced high fanin mux-latch circuits. The rough area-delay estimations performed show that the proposed divider has a similar latency but less hardware complexity (1.3 area ratio) than a recently published high performance digit-by-digit implementation.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"69 1","pages":"280-287"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91176668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
期刊
2007 25th International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1