Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106763
L. Scheffer
As processes shrink, gate delay improves much faster than the delay in long wires. Therefore, the long wires increasingly determine the maximum clock rate, and hence performance, of more and more chips. One solution to this problem is to pipeline the global interconnect, enabling the whole chip to run at the speed of local operations. While known to work well, this optimization is seldom used because of practical difficulties - it is hard to change the RTL, test vectors become invalid, and it's hard to prove correctness of any changes. Here we look at some ways these difficulties could be overcome.
{"title":"Methodologies and tools for pipelined on-chip interconnect","authors":"L. Scheffer","doi":"10.1109/ICCD.2002.1106763","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106763","url":null,"abstract":"As processes shrink, gate delay improves much faster than the delay in long wires. Therefore, the long wires increasingly determine the maximum clock rate, and hence performance, of more and more chips. One solution to this problem is to pipeline the global interconnect, enabling the whole chip to run at the speed of local operations. While known to work well, this optimization is seldom used because of practical difficulties - it is hard to change the RTL, test vectors become invalid, and it's hard to prove correctness of any changes. Here we look at some ways these difficulties could be overcome.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116105560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106777
M. Annavaram, T. Diep, John Paul Shen
This paper presents a detailed branch characterization of an Oracle based commercial on-line transaction processing workload, Oracle Database Benchmark (ODB), running on an IA32 processor. We ran a well-tuned ODB on Simics, a full system simulator, to collect the instruction traces used in this study. We compare the branch behavior of ODB with the branch behaviors of gcc, gzip and mcf from the SPECINT 2000 benchmark suite. Contrary to the popular belief that databases have unpredictable branches, we show that using larger predictors that capture enough branch history information, and using branch prediction schemes that reduce aliasing, conditional branches in ODB are more predictable than in gcc, gzip and mcf Due to frequent context switching in ODB, a hardware return address stack is ineffective in predicting return addresses for ODB. Based on further analysis, we propose and evaluate an enhanced return address predictor, which reduces return address mispredictions in ODB by 40%.
{"title":"Branch behavior of a commercial OLTP workload on Intel IA32 processors","authors":"M. Annavaram, T. Diep, John Paul Shen","doi":"10.1109/ICCD.2002.1106777","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106777","url":null,"abstract":"This paper presents a detailed branch characterization of an Oracle based commercial on-line transaction processing workload, Oracle Database Benchmark (ODB), running on an IA32 processor. We ran a well-tuned ODB on Simics, a full system simulator, to collect the instruction traces used in this study. We compare the branch behavior of ODB with the branch behaviors of gcc, gzip and mcf from the SPECINT 2000 benchmark suite. Contrary to the popular belief that databases have unpredictable branches, we show that using larger predictors that capture enough branch history information, and using branch prediction schemes that reduce aliasing, conditional branches in ODB are more predictable than in gcc, gzip and mcf Due to frequent context switching in ODB, a hardware return address stack is ineffective in predicting return addresses for ODB. Based on further analysis, we propose and evaluate an enhanced return address predictor, which reduces return address mispredictions in ODB by 40%.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116897515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106740
H. P. Hofstee
Power dissipation and power density have become first-order design constraints, even for high-performance systems. For future designs it will be the dominant constraint. In this paper we suggest a systematic approach to optimizing a processor design under (only) a power constraint. The approach uses the energy-performance ratio (EPR) of the various design parameters as the key to identifying opportunities for improving energy-efficiency.
{"title":"Power-constrained microprocessor design","authors":"H. P. Hofstee","doi":"10.1109/ICCD.2002.1106740","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106740","url":null,"abstract":"Power dissipation and power density have become first-order design constraints, even for high-performance systems. For future designs it will be the dominant constraint. In this paper we suggest a systematic approach to optimizing a processor design under (only) a power constraint. The approach uses the energy-performance ratio (EPR) of the various design parameters as the key to identifying opportunities for improving energy-efficiency.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132269001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106754
S. Morioka, Akashi Satoh
In this paper, we present a high-speed AES IP-core, which runs at 780 MHz on a 0. 13 /spl mu/m CMOS standard cell library, and which achieves 10 Gbps throughput in all encryption modes, including CBC mode. Although the CBC mode is the most widely used and important, achieving such high throughput was difficult because pipelining techniques cannot be applied. To reduce the propagation delays of the S-Box, the most critical function block, we developed a special circuit architecture that we call twisted-BDD, where the fanout of signals is distributed in the S-Box circuit. Our S-Box is 1.5 to 2 times faster than the conventional S-Box implementations. The T-Box algorithm, which merges the S-Box and another primitive function (MixColumns) into a single function, is also used for an additional speedup.
{"title":"A 10 Gbps full-AES crypto design with a twisted-BDD S-Box architecture","authors":"S. Morioka, Akashi Satoh","doi":"10.1109/ICCD.2002.1106754","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106754","url":null,"abstract":"In this paper, we present a high-speed AES IP-core, which runs at 780 MHz on a 0. 13 /spl mu/m CMOS standard cell library, and which achieves 10 Gbps throughput in all encryption modes, including CBC mode. Although the CBC mode is the most widely used and important, achieving such high throughput was difficult because pipelining techniques cannot be applied. To reduce the propagation delays of the S-Box, the most critical function block, we developed a special circuit architecture that we call twisted-BDD, where the fanout of signals is distributed in the S-Box circuit. Our S-Box is 1.5 to 2 times faster than the conventional S-Box implementations. The T-Box algorithm, which merges the S-Box and another primitive function (MixColumns) into a single function, is also used for an additional speedup.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126260936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106745
Chang-Tzu Lin, De-Sheng Chen, Yiwen Wang
In this paper, we propose a new representation of VLSI floorplan and building block problem. The representation is the generalization of Polish expression. By proposing a new relational operator, the representation can efficiently reuse some area that cannot be utilized if only having vertical and horizontal operators defined in Polish expression, and is able to present non-slicing structural floorplan. The experimental results show that the representation achieves promising area utilization in commonly used MCNC benchmark circuits.
{"title":"GPE: a new representation for VLSI floorplan problem","authors":"Chang-Tzu Lin, De-Sheng Chen, Yiwen Wang","doi":"10.1109/ICCD.2002.1106745","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106745","url":null,"abstract":"In this paper, we propose a new representation of VLSI floorplan and building block problem. The representation is the generalization of Polish expression. By proposing a new relational operator, the representation can efficiently reuse some area that cannot be utilized if only having vertical and horizontal operators defined in Polish expression, and is able to present non-slicing structural floorplan. The experimental results show that the representation achieves promising area utilization in commonly used MCNC benchmark circuits.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127267843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106795
Joachim Schlosser
The requirements to system and software development tools brought up by the automotive industry differ from the requirements that other customers have. The important catchwords here are heterogeneity of suppliers, tools, technical background of the engineers, and - partially resulting from the just mentioned - the overall complexity of the systems that are built up. There are multiple suppliers delivering multiple programs and units, and all these are to be integrated into a car that has to meet a huge number of constraints regarding safety, reliability and consumer demands. This paper shows what the design of electric and electronic car systems is and has to be like, and what qualifications the methodology and the process therefore has to meet. From these two points a collection of requirements to the tools and the tool chain is derived, with a special focus on simulation tools.
{"title":"Requirements for automotive system engineering tools","authors":"Joachim Schlosser","doi":"10.1109/ICCD.2002.1106795","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106795","url":null,"abstract":"The requirements to system and software development tools brought up by the automotive industry differ from the requirements that other customers have. The important catchwords here are heterogeneity of suppliers, tools, technical background of the engineers, and - partially resulting from the just mentioned - the overall complexity of the systems that are built up. There are multiple suppliers delivering multiple programs and units, and all these are to be integrated into a car that has to meet a huge number of constraints regarding safety, reliability and consumer demands. This paper shows what the design of electric and electronic car systems is and has to be like, and what qualifications the methodology and the process therefore has to meet. From these two points a collection of requirements to the tools and the tool chain is derived, with a special focus on simulation tools.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129810650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106766
Hongbo Yang, R. Govindarajan, G. Gao, K. B. Theobald
The drastic increase in power consumption by modern processors emphasizes the need for power-performance trade-offs in architecture design space exploration and compiler optimizations. This paper reports a quantitative study on the power-performance trade-offs in software pipelined schedules for an Itanium-like EPIC architecture with dual-speed pipelines, in which functional units are partitioned into fast ones and slow ones. We have developed an integer linear programming formulation to capture the power-performance tradeoffs for software pipelined loops. The proposed integer linear programming formulation and its solution method have been implemented and tested on a set of SPEC2000 benchmarks. The results are compared with an Itanium-like architecture (baseline) in which there are four functional units (FUs) and all of them are fast units. Our quantitative study reveals that by introducing a few slow FUs in place of fast FUs in the baseline architecture, the total energy consumed by FUs can be considerably reduced. When 2 out of 4 FUs are set as slow, the total energy consumed by FUs is reduced by up to 31.1% (with an average reduction of 25.2%) compared with the baseline configuration, while the performance degradation caused by using slow FUs is small. If performance demand is less critical, then energy reduction of up to 40.3% compared with the baseline configuration can be achieved.
{"title":"Power-performance trade-offs for energy-efficient architectures: A quantitative study","authors":"Hongbo Yang, R. Govindarajan, G. Gao, K. B. Theobald","doi":"10.1109/ICCD.2002.1106766","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106766","url":null,"abstract":"The drastic increase in power consumption by modern processors emphasizes the need for power-performance trade-offs in architecture design space exploration and compiler optimizations. This paper reports a quantitative study on the power-performance trade-offs in software pipelined schedules for an Itanium-like EPIC architecture with dual-speed pipelines, in which functional units are partitioned into fast ones and slow ones. We have developed an integer linear programming formulation to capture the power-performance tradeoffs for software pipelined loops. The proposed integer linear programming formulation and its solution method have been implemented and tested on a set of SPEC2000 benchmarks. The results are compared with an Itanium-like architecture (baseline) in which there are four functional units (FUs) and all of them are fast units. Our quantitative study reveals that by introducing a few slow FUs in place of fast FUs in the baseline architecture, the total energy consumed by FUs can be considerably reduced. When 2 out of 4 FUs are set as slow, the total energy consumed by FUs is reduced by up to 31.1% (with an average reduction of 25.2%) compared with the baseline configuration, while the performance degradation caused by using slow FUs is small. If performance demand is less critical, then energy reduction of up to 40.3% compared with the baseline configuration can be achieved.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126100608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106791
Yen-Jen Chang, F. Lai, S. Ruan
For physical caches, the address translation delay can be partially masked, but it is hard to avoid completely. In this paper, we propose a cache partition architecture, called paged cache, which not only masks the address translation delay completely but also reduces the tag area dramatically. In the paged cache, we divide the entire cache into a set of partitions, and each partition is dedicated to only one page cached in the TLB. By restricting the range in which the cached block can be placed, we can eliminate the total or partial tag depending on the partition size. In addition, because the paged cache can be accessed without waiting for the generation of physical address, i.e., the paged cache and the TLB are accessed in parallel, the extended cache access time can be reduced significantly. We use SimpleScalar to simulate SPEC2000 benchmarks and perform HSPICE simulations (with a 0.18 /spl mu/m technology and 1.8 V voltage supply) to evaluate the proposed architecture. Experimental results show that the paged cache is very effective in reducing tag area of the on-chip Ll caches, while the average extended cache access time can be improved dramatically.
{"title":"Cache design for eliminating the address translation bottleneck and reducing the tag area cost","authors":"Yen-Jen Chang, F. Lai, S. Ruan","doi":"10.1109/ICCD.2002.1106791","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106791","url":null,"abstract":"For physical caches, the address translation delay can be partially masked, but it is hard to avoid completely. In this paper, we propose a cache partition architecture, called paged cache, which not only masks the address translation delay completely but also reduces the tag area dramatically. In the paged cache, we divide the entire cache into a set of partitions, and each partition is dedicated to only one page cached in the TLB. By restricting the range in which the cached block can be placed, we can eliminate the total or partial tag depending on the partition size. In addition, because the paged cache can be accessed without waiting for the generation of physical address, i.e., the paged cache and the TLB are accessed in parallel, the extended cache access time can be reduced significantly. We use SimpleScalar to simulate SPEC2000 benchmarks and perform HSPICE simulations (with a 0.18 /spl mu/m technology and 1.8 V voltage supply) to evaluate the proposed architecture. Experimental results show that the paged cache is very effective in reducing tag area of the on-chip Ll caches, while the average extended cache access time can be improved dramatically.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131585799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106760
José-Alejandro Piñeiro, M. Ercegovac, J. Bruguera
An analysis of the tradeoffs between area and speed for a sequential implementation of a high-radix recurrence for logarithm computation is presented in this paper The high-radix algorithm is outlined and a sequential architecture is proposed, with the use of selection by rounding of the digits and redundant representation. Estimates of the execution time and total area are obtained for n = 16, 32 and 64 bits of precision and for radix values from r = 8 to r = 1024. An analysis of the tradeoffs between area and speed is presented, showing that the most efficient implementations are obtained for radices r = 256 for 16, 32 bit and r = 128 for 64 bit computations.
{"title":"Analysis of the tradeoffs for the implementation of a high-radix logarithm","authors":"José-Alejandro Piñeiro, M. Ercegovac, J. Bruguera","doi":"10.1109/ICCD.2002.1106760","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106760","url":null,"abstract":"An analysis of the tradeoffs between area and speed for a sequential implementation of a high-radix recurrence for logarithm computation is presented in this paper The high-radix algorithm is outlined and a sequential architecture is proposed, with the use of selection by rounding of the digits and redundant representation. Estimates of the execution time and total area are obtained for n = 16, 32 and 64 bits of precision and for radix values from r = 8 to r = 1024. An analysis of the tradeoffs between area and speed is presented, showing that the most efficient implementations are obtained for radices r = 256 for 16, 32 bit and r = 128 for 64 bit computations.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132837188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106751
P. Groeneveld
Advancing process technology will necessitate and even more rigorous automation of the IC design trajectory. The design scale will increase with Moore's law, approaching 1,000,000,000 transistors in the coming years. This enables the design of SoC systems with complexities unprecedented unhuman history. At the same time the physics of silicon manufacturing is increasing the 'silicon complexity'. Additional design steps are required to address cross talk, voltage drop, antenna rules and others. Much more so than in previous technology nodes, the effects of parasitics must be addressed at various stages of the IC design flow. Nothing less than a full automation of the silicon complexity issues is required to stop the design productivity gap from growing.
{"title":"Physical design challenges for billion transistor chips","authors":"P. Groeneveld","doi":"10.1109/ICCD.2002.1106751","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106751","url":null,"abstract":"Advancing process technology will necessitate and even more rigorous automation of the IC design trajectory. The design scale will increase with Moore's law, approaching 1,000,000,000 transistors in the coming years. This enables the design of SoC systems with complexities unprecedented unhuman history. At the same time the physics of silicon manufacturing is increasing the 'silicon complexity'. Additional design steps are required to address cross talk, voltage drop, antenna rules and others. Much more so than in previous technology nodes, the effects of parasitics must be addressed at various stages of the IC design flow. Nothing less than a full automation of the silicon complexity issues is required to stop the design productivity gap from growing.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123807600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}