Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors最新文献

英文中文

Methodologies and tools for pipelined on-chip interconnect 流水线片上互连的方法和工具

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106763

L. Scheffer

As processes shrink, gate delay improves much faster than the delay in long wires. Therefore, the long wires increasingly determine the maximum clock rate, and hence performance, of more and more chips. One solution to this problem is to pipeline the global interconnect, enabling the whole chip to run at the speed of local operations. While known to work well, this optimization is seldom used because of practical difficulties - it is hard to change the RTL, test vectors become invalid, and it's hard to prove correctness of any changes. Here we look at some ways these difficulties could be overcome.

随着过程的缩小，门延迟的改善要比长导线中的延迟快得多。因此，长导线越来越多地决定了最大时钟速率，从而决定了越来越多的芯片的性能。这个问题的一个解决方案是通过管道实现全球互连，使整个芯片能够以本地操作的速度运行。虽然已知这种优化可以很好地工作，但由于实际困难，很少使用这种优化—很难更改RTL，测试向量变得无效，并且很难证明任何更改的正确性。下面我们来看看克服这些困难的一些方法。

引用次数: 43

Branch behavior of a commercial OLTP workload on Intel IA32 processors Intel IA32处理器上商业OLTP工作负载的分支行为

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106777

M. Annavaram, T. Diep, John Paul Shen

This paper presents a detailed branch characterization of an Oracle based commercial on-line transaction processing workload, Oracle Database Benchmark (ODB), running on an IA32 processor. We ran a well-tuned ODB on Simics, a full system simulator, to collect the instruction traces used in this study. We compare the branch behavior of ODB with the branch behaviors of gcc, gzip and mcf from the SPECINT 2000 benchmark suite. Contrary to the popular belief that databases have unpredictable branches, we show that using larger predictors that capture enough branch history information, and using branch prediction schemes that reduce aliasing, conditional branches in ODB are more predictable than in gcc, gzip and mcf Due to frequent context switching in ODB, a hardware return address stack is ineffective in predicting return addresses for ODB. Based on further analysis, we propose and evaluate an enhanced return address predictor, which reduces return address mispredictions in ODB by 40%.

本文介绍了在IA32处理器上运行的基于Oracle的商业在线事务处理工作负载Oracle Database Benchmark (ODB)的详细分支特征。我们在Simics(一个完整的系统模拟器)上运行了一个调优的ODB，以收集本研究中使用的指令跟踪。我们将ODB的分支行为与SPECINT 2000基准套件中的gcc、gzip和mcf的分支行为进行了比较。与普遍认为数据库具有不可预测的分支的观点相反，我们展示了使用更大的预测器来捕获足够的分支历史信息，并使用减少别名的分支预测方案，ODB中的条件分支比gcc、gzip和mcf中的条件分支更具可预测性。由于ODB中频繁的上下文切换，硬件返回地址堆栈在预测ODB的返回地址方面是无效的。在进一步分析的基础上，我们提出并评估了一个增强的返回地址预测器，它将ODB中的返回地址错误预测减少了40%。

引用次数: 15

Power-constrained microprocessor design 功耗受限的微处理器设计

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106740

H. P. Hofstee

Power dissipation and power density have become first-order design constraints, even for high-performance systems. For future designs it will be the dominant constraint. In this paper we suggest a systematic approach to optimizing a processor design under (only) a power constraint. The approach uses the energy-performance ratio (EPR) of the various design parameters as the key to identifying opportunities for improving energy-efficiency.

功耗和功率密度已成为一阶设计限制，即使对于高性能系统也是如此。对于未来的设计，这将是主要的限制。在本文中，我们提出了一个系统的方法来优化处理器设计下(仅)功率限制。该方法使用各种设计参数的能量性能比(EPR)作为识别提高能源效率机会的关键。

引用次数: 29

A 10 Gbps full-AES crypto design with a twisted-BDD S-Box architecture 采用扭曲bdd S-Box架构的10gbps全aes加密设计

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106754

S. Morioka, Akashi Satoh

In this paper, we present a high-speed AES IP-core, which runs at 780 MHz on a 0. 13 /spl mu/m CMOS standard cell library, and which achieves 10 Gbps throughput in all encryption modes, including CBC mode. Although the CBC mode is the most widely used and important, achieving such high throughput was difficult because pipelining techniques cannot be applied. To reduce the propagation delays of the S-Box, the most critical function block, we developed a special circuit architecture that we call twisted-BDD, where the fanout of signals is distributed in the S-Box circuit. Our S-Box is 1.5 to 2 times faster than the conventional S-Box implementations. The T-Box algorithm, which merges the S-Box and another primitive function (MixColumns) into a single function, is also used for an additional speedup.

在本文中，我们提出了一个高速AES ip核，它在0。13 /spl mu/m CMOS标准小区库，在包括CBC模式在内的所有加密模式下实现10gbps的吞吐量。尽管CBC模式是应用最广泛和最重要的，但由于无法应用流水线技术，实现如此高的吞吐量是困难的。为了减少S-Box(最关键的功能块)的传播延迟，我们开发了一种特殊的电路架构，我们称之为twisted-BDD，其中信号的扇出分布在S-Box电路中。我们的S-Box比传统的S-Box实现快1.5到2倍。T-Box算法将S-Box和另一个原语函数(MixColumns)合并为一个函数，也用于额外的加速。

引用次数: 55

GPE: a new representation for VLSI floorplan problem GPE:超大规模集成电路平面设计问题的新表述

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106745

Chang-Tzu Lin, De-Sheng Chen, Yiwen Wang

In this paper, we propose a new representation of VLSI floorplan and building block problem. The representation is the generalization of Polish expression. By proposing a new relational operator, the representation can efficiently reuse some area that cannot be utilized if only having vertical and horizontal operators defined in Polish expression, and is able to present non-slicing structural floorplan. The experimental results show that the representation achieves promising area utilization in commonly used MCNC benchmark circuits.

在本文中，我们提出了一种新的超大规模集成电路平面布局和积木问题的表示方法。这种表示法是波兰语表达的概括。通过提出一种新的关系运算符，该表示可以有效地重用波兰表达式中定义的垂直和水平运算符所不能利用的区域，并且能够呈现非切片的结构平面图。实验结果表明，该表示在常用的MCNC基准电路中具有良好的面积利用率。

引用次数: 15

Requirements for automotive system engineering tools 汽车系统工程工具要求

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106795

Joachim Schlosser

The requirements to system and software development tools brought up by the automotive industry differ from the requirements that other customers have. The important catchwords here are heterogeneity of suppliers, tools, technical background of the engineers, and - partially resulting from the just mentioned - the overall complexity of the systems that are built up. There are multiple suppliers delivering multiple programs and units, and all these are to be integrated into a car that has to meet a huge number of constraints regarding safety, reliability and consumer demands. This paper shows what the design of electric and electronic car systems is and has to be like, and what qualifications the methodology and the process therefore has to meet. From these two points a collection of requirements to the tools and the tool chain is derived, with a special focus on simulation tools.

汽车行业对系统和软件开发工具的需求不同于其他客户的需求。这里的重要关键词是供应商、工具、工程师的技术背景的异质性，以及(部分源于刚刚提到的)所构建系统的整体复杂性。有多个供应商提供多个程序和单元，所有这些都要集成到一辆汽车中，必须满足有关安全性、可靠性和消费者需求的大量限制。本文展示了电动汽车和电子汽车系统的设计是什么，必须是什么样子，以及方法和过程必须满足什么条件。从这两点出发，导出了对工具和工具链的需求集合，特别关注仿真工具。

引用次数: 9

Power-performance trade-offs for energy-efficient architectures: A quantitative study 节能架构的功率性能权衡:一项定量研究

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106766

Hongbo Yang, R. Govindarajan, G. Gao, K. B. Theobald

The drastic increase in power consumption by modern processors emphasizes the need for power-performance trade-offs in architecture design space exploration and compiler optimizations. This paper reports a quantitative study on the power-performance trade-offs in software pipelined schedules for an Itanium-like EPIC architecture with dual-speed pipelines, in which functional units are partitioned into fast ones and slow ones. We have developed an integer linear programming formulation to capture the power-performance tradeoffs for software pipelined loops. The proposed integer linear programming formulation and its solution method have been implemented and tested on a set of SPEC2000 benchmarks. The results are compared with an Itanium-like architecture (baseline) in which there are four functional units (FUs) and all of them are fast units. Our quantitative study reveals that by introducing a few slow FUs in place of fast FUs in the baseline architecture, the total energy consumed by FUs can be considerably reduced. When 2 out of 4 FUs are set as slow, the total energy consumed by FUs is reduced by up to 31.1% (with an average reduction of 25.2%) compared with the baseline configuration, while the performance degradation caused by using slow FUs is small. If performance demand is less critical, then energy reduction of up to 40.3% compared with the baseline configuration can be achieved.

现代处理器功耗的急剧增加强调了在架构设计、空间探索和编译器优化中需要权衡功耗和性能。本文定量研究了一种类似itanium的具有双速度管道的EPIC架构的软件流水线调度的功率性能权衡，其中功能单元被划分为快速和慢速管道。我们已经开发了一个整数线性规划公式来捕获软件流水线循环的功率性能权衡。所提出的整数线性规划公式及其求解方法已在一组SPEC2000基准上实现并进行了测试。结果与类似itanium的架构(基线)进行了比较，其中有四个功能单元(FUs)，并且它们都是快速单元。我们的定量研究表明，通过在基线架构中引入一些慢速的FUs来代替快速的FUs，可以大大降低FUs消耗的总能量。当4个FUs中有2个被设置为慢速时，与基线配置相比，FUs消耗的总能量减少了31.1%(平均减少25.2%)，而使用慢速FUs引起的性能下降很小。如果性能需求不那么关键，那么与基线配置相比，可以实现高达40.3%的能耗降低。

{"title":"Power-performance trade-offs for energy-efficient architectures: A quantitative study","authors":"Hongbo Yang, R. Govindarajan, G. Gao, K. B. Theobald","doi":"10.1109/ICCD.2002.1106766","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106766","url":null,"abstract":"The drastic increase in power consumption by modern processors emphasizes the need for power-performance trade-offs in architecture design space exploration and compiler optimizations. This paper reports a quantitative study on the power-performance trade-offs in software pipelined schedules for an Itanium-like EPIC architecture with dual-speed pipelines, in which functional units are partitioned into fast ones and slow ones. We have developed an integer linear programming formulation to capture the power-performance tradeoffs for software pipelined loops. The proposed integer linear programming formulation and its solution method have been implemented and tested on a set of SPEC2000 benchmarks. The results are compared with an Itanium-like architecture (baseline) in which there are four functional units (FUs) and all of them are fast units. Our quantitative study reveals that by introducing a few slow FUs in place of fast FUs in the baseline architecture, the total energy consumed by FUs can be considerably reduced. When 2 out of 4 FUs are set as slow, the total energy consumed by FUs is reduced by up to 31.1% (with an average reduction of 25.2%) compared with the baseline configuration, while the performance degradation caused by using slow FUs is small. If performance demand is less critical, then energy reduction of up to 40.3% compared with the baseline configuration can be achieved.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126100608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Cache design for eliminating the address translation bottleneck and reducing the tag area cost 缓存设计消除了地址转换瓶颈，降低了标签面积成本

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106791

Yen-Jen Chang, F. Lai, S. Ruan

For physical caches, the address translation delay can be partially masked, but it is hard to avoid completely. In this paper, we propose a cache partition architecture, called paged cache, which not only masks the address translation delay completely but also reduces the tag area dramatically. In the paged cache, we divide the entire cache into a set of partitions, and each partition is dedicated to only one page cached in the TLB. By restricting the range in which the cached block can be placed, we can eliminate the total or partial tag depending on the partition size. In addition, because the paged cache can be accessed without waiting for the generation of physical address, i.e., the paged cache and the TLB are accessed in parallel, the extended cache access time can be reduced significantly. We use SimpleScalar to simulate SPEC2000 benchmarks and perform HSPICE simulations (with a 0.18 /spl mu/m technology and 1.8 V voltage supply) to evaluate the proposed architecture. Experimental results show that the paged cache is very effective in reducing tag area of the on-chip Ll caches, while the average extended cache access time can be improved dramatically.

对于物理缓存，地址转换延迟可以部分屏蔽，但很难完全避免。在本文中，我们提出了一种称为分页缓存的缓存分区架构，它不仅完全掩盖了地址转换延迟，而且大大减少了标签面积。在分页缓存中，我们将整个缓存划分为一组分区，每个分区仅专用于TLB中缓存的一个页面。通过限制可以放置缓存块的范围，我们可以根据分区大小消除全部或部分标记。此外，由于可以在不等待物理地址生成的情况下访问分页缓存，即并行访问分页缓存和TLB，因此可以显著减少扩展缓存访问时间。我们使用SimpleScalar来模拟SPEC2000基准测试，并执行HSPICE模拟(使用0.18 /spl mu/m技术和1.8 V电压电源)来评估所提出的架构。实验结果表明，分页缓存在减少片上l缓存的标签面积方面非常有效，同时可以显著提高平均扩展缓存访问时间。

{"title":"Cache design for eliminating the address translation bottleneck and reducing the tag area cost","authors":"Yen-Jen Chang, F. Lai, S. Ruan","doi":"10.1109/ICCD.2002.1106791","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106791","url":null,"abstract":"For physical caches, the address translation delay can be partially masked, but it is hard to avoid completely. In this paper, we propose a cache partition architecture, called paged cache, which not only masks the address translation delay completely but also reduces the tag area dramatically. In the paged cache, we divide the entire cache into a set of partitions, and each partition is dedicated to only one page cached in the TLB. By restricting the range in which the cached block can be placed, we can eliminate the total or partial tag depending on the partition size. In addition, because the paged cache can be accessed without waiting for the generation of physical address, i.e., the paged cache and the TLB are accessed in parallel, the extended cache access time can be reduced significantly. We use SimpleScalar to simulate SPEC2000 benchmarks and perform HSPICE simulations (with a 0.18 /spl mu/m technology and 1.8 V voltage supply) to evaluate the proposed architecture. Experimental results show that the paged cache is very effective in reducing tag area of the on-chip Ll caches, while the average extended cache access time can be improved dramatically.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131585799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Analysis of the tradeoffs for the implementation of a high-radix logarithm 分析了实现高基数对数的权衡

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106760

José-Alejandro Piñeiro, M. Ercegovac, J. Bruguera

An analysis of the tradeoffs between area and speed for a sequential implementation of a high-radix recurrence for logarithm computation is presented in this paper The high-radix algorithm is outlined and a sequential architecture is proposed, with the use of selection by rounding of the digits and redundant representation. Estimates of the execution time and total area are obtained for n = 16, 32 and 64 bits of precision and for radix values from r = 8 to r = 1024. An analysis of the tradeoffs between area and speed is presented, showing that the most efficient implementations are obtained for radices r = 256 for 16, 32 bit and r = 128 for 64 bit computations.

本文分析了对数计算高基数递归的顺序实现在面积和速度之间的权衡。概述了高基数算法，并提出了一种顺序结构，使用四舍五入和冗余表示进行选择。对于n = 16,32和64位精度以及基数r = 8到r = 1024，可以获得执行时间和总面积的估计值。对面积和速度之间的权衡进行了分析，结果表明，对于16,32位的计算，r = 256和64位的计算，r = 128是最有效的实现。

引用次数: 3

Physical design challenges for billion transistor chips 十亿晶体管芯片的物理设计挑战

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

Pub Date : 2002-09-16 DOI: 10.1109/ICCD.2002.1106751

P. Groeneveld

Advancing process technology will necessitate and even more rigorous automation of the IC design trajectory. The design scale will increase with Moore's law, approaching 1,000,000,000 transistors in the coming years. This enables the design of SoC systems with complexities unprecedented unhuman history. At the same time the physics of silicon manufacturing is increasing the 'silicon complexity'. Additional design steps are required to address cross talk, voltage drop, antenna rules and others. Much more so than in previous technology nodes, the effects of parasitics must be addressed at various stages of the IC design flow. Nothing less than a full automation of the silicon complexity issues is required to stop the design productivity gap from growing.

先进的工艺技术将需要更严格的IC设计轨迹自动化。设计规模将随着摩尔定律而增加，在未来几年接近10亿个晶体管。这使得SoC系统的设计具有前所未有的非人类历史复杂性。与此同时，硅制造的物理特性正在增加“硅的复杂性”。需要额外的设计步骤来解决串扰、电压降、天线规则和其他问题。与以前的技术节点相比，寄生效应的影响必须在IC设计流程的各个阶段加以解决。只有完全自动化的硅复杂性问题才能阻止设计生产力差距的扩大。

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀