Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106821
S. Vlaovic, E. Davidson
Although x86 processors have been around for a long time and are the most ubiquitous processors in the world, the amount of academic research regarding details of their performance has been minimal. We introduce an x86 simulation environment, called TAXI (Trace Analysis for X86 Interpretation), and use it to present results for eight Win32 applications. In this paper, we explain the design and implementation of TAXI.
尽管x86处理器已经存在了很长时间,并且是世界上最普遍的处理器,但是关于其性能细节的学术研究却很少。我们介绍了一个x86仿真环境,称为TAXI (Trace Analysis for x86 Interpretation),并使用它来呈现八个Win32应用程序的结果。在本文中,我们解释了TAXI的设计和实现。
{"title":"TAXI: Trace Analysis for x86 Interpretation","authors":"S. Vlaovic, E. Davidson","doi":"10.1109/ICCD.2002.1106821","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106821","url":null,"abstract":"Although x86 processors have been around for a long time and are the most ubiquitous processors in the world, the amount of academic research regarding details of their performance has been minimal. We introduce an x86 simulation environment, called TAXI (Trace Analysis for X86 Interpretation), and use it to present results for eight Win32 applications. In this paper, we explain the design and implementation of TAXI.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130800408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106754
S. Morioka, Akashi Satoh
In this paper, we present a high-speed AES IP-core, which runs at 780 MHz on a 0. 13 /spl mu/m CMOS standard cell library, and which achieves 10 Gbps throughput in all encryption modes, including CBC mode. Although the CBC mode is the most widely used and important, achieving such high throughput was difficult because pipelining techniques cannot be applied. To reduce the propagation delays of the S-Box, the most critical function block, we developed a special circuit architecture that we call twisted-BDD, where the fanout of signals is distributed in the S-Box circuit. Our S-Box is 1.5 to 2 times faster than the conventional S-Box implementations. The T-Box algorithm, which merges the S-Box and another primitive function (MixColumns) into a single function, is also used for an additional speedup.
{"title":"A 10 Gbps full-AES crypto design with a twisted-BDD S-Box architecture","authors":"S. Morioka, Akashi Satoh","doi":"10.1109/ICCD.2002.1106754","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106754","url":null,"abstract":"In this paper, we present a high-speed AES IP-core, which runs at 780 MHz on a 0. 13 /spl mu/m CMOS standard cell library, and which achieves 10 Gbps throughput in all encryption modes, including CBC mode. Although the CBC mode is the most widely used and important, achieving such high throughput was difficult because pipelining techniques cannot be applied. To reduce the propagation delays of the S-Box, the most critical function block, we developed a special circuit architecture that we call twisted-BDD, where the fanout of signals is distributed in the S-Box circuit. Our S-Box is 1.5 to 2 times faster than the conventional S-Box implementations. The T-Box algorithm, which merges the S-Box and another primitive function (MixColumns) into a single function, is also used for an additional speedup.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126260936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106779
S. Ozev, A. Orailoglu
Concurrent detection of failures in analog circuits is becoming increasingly more important as safety-critical systems become more widespread. A methodology for the automatic design of concurrent failure detection circuitry for linear analog systems is discussed in this paper In contrast to previous approaches, the methodology aims at providing coverage in terms of all the circuit components while minimizing the loading overhead by reducing the number of internal circuit nodes that need to be tapped Parameter tolerances are incorporated through either statistical or mathematical analysis to determine the threshold for failure alarm. Experimental results confirm that full coverage can be attained while keeping the hardware overhead within a pre-specified budget.
{"title":"Cost-effective concurrent test hardware design for linear analog circuits","authors":"S. Ozev, A. Orailoglu","doi":"10.1109/ICCD.2002.1106779","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106779","url":null,"abstract":"Concurrent detection of failures in analog circuits is becoming increasingly more important as safety-critical systems become more widespread. A methodology for the automatic design of concurrent failure detection circuitry for linear analog systems is discussed in this paper In contrast to previous approaches, the methodology aims at providing coverage in terms of all the circuit components while minimizing the loading overhead by reducing the number of internal circuit nodes that need to be tapped Parameter tolerances are incorporated through either statistical or mathematical analysis to determine the threshold for failure alarm. Experimental results confirm that full coverage can be attained while keeping the hardware overhead within a pre-specified budget.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125618593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106798
D. Duarte, N. Vijaykrishnan, M. J. Irwin, Hyun Suk Kim, G. McFarland
Power is considered to be the major limiter to the design of faster and more complex processors in the near future. In order to address this challenge, a combination of process, circuit design and micro-architectural changes are required Consequently, to focus optimization efforts in the right direction, the models proposed and studies performed in this work are a first step for understanding the relative importance of leakage and dynamic energy in future technologies. Further, we analyze the effectiveness of two energy reduction mechanisms that employ voltage scaling, namely, supply and threshold voltage selection. We consider the impact of imminent technology changes and packaging improvements while showing that neglecting the impact of temperature may lead to underestimating power savings by up to 19.5%.
{"title":"Impact of scaling on the effectiveness of dynamic power reduction schemes","authors":"D. Duarte, N. Vijaykrishnan, M. J. Irwin, Hyun Suk Kim, G. McFarland","doi":"10.1109/ICCD.2002.1106798","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106798","url":null,"abstract":"Power is considered to be the major limiter to the design of faster and more complex processors in the near future. In order to address this challenge, a combination of process, circuit design and micro-architectural changes are required Consequently, to focus optimization efforts in the right direction, the models proposed and studies performed in this work are a first step for understanding the relative importance of leakage and dynamic energy in future technologies. Further, we analyze the effectiveness of two energy reduction mechanisms that employ voltage scaling, namely, supply and threshold voltage selection. We consider the impact of imminent technology changes and packaging improvements while showing that neglecting the impact of temperature may lead to underestimating power savings by up to 19.5%.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129647927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106766
Hongbo Yang, R. Govindarajan, G. Gao, K. B. Theobald
The drastic increase in power consumption by modern processors emphasizes the need for power-performance trade-offs in architecture design space exploration and compiler optimizations. This paper reports a quantitative study on the power-performance trade-offs in software pipelined schedules for an Itanium-like EPIC architecture with dual-speed pipelines, in which functional units are partitioned into fast ones and slow ones. We have developed an integer linear programming formulation to capture the power-performance tradeoffs for software pipelined loops. The proposed integer linear programming formulation and its solution method have been implemented and tested on a set of SPEC2000 benchmarks. The results are compared with an Itanium-like architecture (baseline) in which there are four functional units (FUs) and all of them are fast units. Our quantitative study reveals that by introducing a few slow FUs in place of fast FUs in the baseline architecture, the total energy consumed by FUs can be considerably reduced. When 2 out of 4 FUs are set as slow, the total energy consumed by FUs is reduced by up to 31.1% (with an average reduction of 25.2%) compared with the baseline configuration, while the performance degradation caused by using slow FUs is small. If performance demand is less critical, then energy reduction of up to 40.3% compared with the baseline configuration can be achieved.
{"title":"Power-performance trade-offs for energy-efficient architectures: A quantitative study","authors":"Hongbo Yang, R. Govindarajan, G. Gao, K. B. Theobald","doi":"10.1109/ICCD.2002.1106766","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106766","url":null,"abstract":"The drastic increase in power consumption by modern processors emphasizes the need for power-performance trade-offs in architecture design space exploration and compiler optimizations. This paper reports a quantitative study on the power-performance trade-offs in software pipelined schedules for an Itanium-like EPIC architecture with dual-speed pipelines, in which functional units are partitioned into fast ones and slow ones. We have developed an integer linear programming formulation to capture the power-performance tradeoffs for software pipelined loops. The proposed integer linear programming formulation and its solution method have been implemented and tested on a set of SPEC2000 benchmarks. The results are compared with an Itanium-like architecture (baseline) in which there are four functional units (FUs) and all of them are fast units. Our quantitative study reveals that by introducing a few slow FUs in place of fast FUs in the baseline architecture, the total energy consumed by FUs can be considerably reduced. When 2 out of 4 FUs are set as slow, the total energy consumed by FUs is reduced by up to 31.1% (with an average reduction of 25.2%) compared with the baseline configuration, while the performance degradation caused by using slow FUs is small. If performance demand is less critical, then energy reduction of up to 40.3% compared with the baseline configuration can be achieved.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126100608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106745
Chang-Tzu Lin, De-Sheng Chen, Yiwen Wang
In this paper, we propose a new representation of VLSI floorplan and building block problem. The representation is the generalization of Polish expression. By proposing a new relational operator, the representation can efficiently reuse some area that cannot be utilized if only having vertical and horizontal operators defined in Polish expression, and is able to present non-slicing structural floorplan. The experimental results show that the representation achieves promising area utilization in commonly used MCNC benchmark circuits.
{"title":"GPE: a new representation for VLSI floorplan problem","authors":"Chang-Tzu Lin, De-Sheng Chen, Yiwen Wang","doi":"10.1109/ICCD.2002.1106745","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106745","url":null,"abstract":"In this paper, we propose a new representation of VLSI floorplan and building block problem. The representation is the generalization of Polish expression. By proposing a new relational operator, the representation can efficiently reuse some area that cannot be utilized if only having vertical and horizontal operators defined in Polish expression, and is able to present non-slicing structural floorplan. The experimental results show that the representation achieves promising area utilization in commonly used MCNC benchmark circuits.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127267843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106791
Yen-Jen Chang, F. Lai, S. Ruan
For physical caches, the address translation delay can be partially masked, but it is hard to avoid completely. In this paper, we propose a cache partition architecture, called paged cache, which not only masks the address translation delay completely but also reduces the tag area dramatically. In the paged cache, we divide the entire cache into a set of partitions, and each partition is dedicated to only one page cached in the TLB. By restricting the range in which the cached block can be placed, we can eliminate the total or partial tag depending on the partition size. In addition, because the paged cache can be accessed without waiting for the generation of physical address, i.e., the paged cache and the TLB are accessed in parallel, the extended cache access time can be reduced significantly. We use SimpleScalar to simulate SPEC2000 benchmarks and perform HSPICE simulations (with a 0.18 /spl mu/m technology and 1.8 V voltage supply) to evaluate the proposed architecture. Experimental results show that the paged cache is very effective in reducing tag area of the on-chip Ll caches, while the average extended cache access time can be improved dramatically.
{"title":"Cache design for eliminating the address translation bottleneck and reducing the tag area cost","authors":"Yen-Jen Chang, F. Lai, S. Ruan","doi":"10.1109/ICCD.2002.1106791","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106791","url":null,"abstract":"For physical caches, the address translation delay can be partially masked, but it is hard to avoid completely. In this paper, we propose a cache partition architecture, called paged cache, which not only masks the address translation delay completely but also reduces the tag area dramatically. In the paged cache, we divide the entire cache into a set of partitions, and each partition is dedicated to only one page cached in the TLB. By restricting the range in which the cached block can be placed, we can eliminate the total or partial tag depending on the partition size. In addition, because the paged cache can be accessed without waiting for the generation of physical address, i.e., the paged cache and the TLB are accessed in parallel, the extended cache access time can be reduced significantly. We use SimpleScalar to simulate SPEC2000 benchmarks and perform HSPICE simulations (with a 0.18 /spl mu/m technology and 1.8 V voltage supply) to evaluate the proposed architecture. Experimental results show that the paged cache is very effective in reducing tag area of the on-chip Ll caches, while the average extended cache access time can be improved dramatically.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131585799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106760
José-Alejandro Piñeiro, M. Ercegovac, J. Bruguera
An analysis of the tradeoffs between area and speed for a sequential implementation of a high-radix recurrence for logarithm computation is presented in this paper The high-radix algorithm is outlined and a sequential architecture is proposed, with the use of selection by rounding of the digits and redundant representation. Estimates of the execution time and total area are obtained for n = 16, 32 and 64 bits of precision and for radix values from r = 8 to r = 1024. An analysis of the tradeoffs between area and speed is presented, showing that the most efficient implementations are obtained for radices r = 256 for 16, 32 bit and r = 128 for 64 bit computations.
{"title":"Analysis of the tradeoffs for the implementation of a high-radix logarithm","authors":"José-Alejandro Piñeiro, M. Ercegovac, J. Bruguera","doi":"10.1109/ICCD.2002.1106760","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106760","url":null,"abstract":"An analysis of the tradeoffs between area and speed for a sequential implementation of a high-radix recurrence for logarithm computation is presented in this paper The high-radix algorithm is outlined and a sequential architecture is proposed, with the use of selection by rounding of the digits and redundant representation. Estimates of the execution time and total area are obtained for n = 16, 32 and 64 bits of precision and for radix values from r = 8 to r = 1024. An analysis of the tradeoffs between area and speed is presented, showing that the most efficient implementations are obtained for radices r = 256 for 16, 32 bit and r = 128 for 64 bit computations.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132837188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106795
Joachim Schlosser
The requirements to system and software development tools brought up by the automotive industry differ from the requirements that other customers have. The important catchwords here are heterogeneity of suppliers, tools, technical background of the engineers, and - partially resulting from the just mentioned - the overall complexity of the systems that are built up. There are multiple suppliers delivering multiple programs and units, and all these are to be integrated into a car that has to meet a huge number of constraints regarding safety, reliability and consumer demands. This paper shows what the design of electric and electronic car systems is and has to be like, and what qualifications the methodology and the process therefore has to meet. From these two points a collection of requirements to the tools and the tool chain is derived, with a special focus on simulation tools.
{"title":"Requirements for automotive system engineering tools","authors":"Joachim Schlosser","doi":"10.1109/ICCD.2002.1106795","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106795","url":null,"abstract":"The requirements to system and software development tools brought up by the automotive industry differ from the requirements that other customers have. The important catchwords here are heterogeneity of suppliers, tools, technical background of the engineers, and - partially resulting from the just mentioned - the overall complexity of the systems that are built up. There are multiple suppliers delivering multiple programs and units, and all these are to be integrated into a car that has to meet a huge number of constraints regarding safety, reliability and consumer demands. This paper shows what the design of electric and electronic car systems is and has to be like, and what qualifications the methodology and the process therefore has to meet. From these two points a collection of requirements to the tools and the tool chain is derived, with a special focus on simulation tools.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129810650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106751
P. Groeneveld
Advancing process technology will necessitate and even more rigorous automation of the IC design trajectory. The design scale will increase with Moore's law, approaching 1,000,000,000 transistors in the coming years. This enables the design of SoC systems with complexities unprecedented unhuman history. At the same time the physics of silicon manufacturing is increasing the 'silicon complexity'. Additional design steps are required to address cross talk, voltage drop, antenna rules and others. Much more so than in previous technology nodes, the effects of parasitics must be addressed at various stages of the IC design flow. Nothing less than a full automation of the silicon complexity issues is required to stop the design productivity gap from growing.
{"title":"Physical design challenges for billion transistor chips","authors":"P. Groeneveld","doi":"10.1109/ICCD.2002.1106751","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106751","url":null,"abstract":"Advancing process technology will necessitate and even more rigorous automation of the IC design trajectory. The design scale will increase with Moore's law, approaching 1,000,000,000 transistors in the coming years. This enables the design of SoC systems with complexities unprecedented unhuman history. At the same time the physics of silicon manufacturing is increasing the 'silicon complexity'. Additional design steps are required to address cross talk, voltage drop, antenna rules and others. Much more so than in previous technology nodes, the effects of parasitics must be addressed at various stages of the IC design flow. Nothing less than a full automation of the silicon complexity issues is required to stop the design productivity gap from growing.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123807600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}