Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106748
Christoph Scholl, B. Becker
We consider the problem of checking whether an implementation which contains parts with incomplete information is equivalent to a given full specification. We study implementations which are not completely specified, but contain boxes which are associated with incompletely specified functions (called Incompletely Specified Boxes or IS-Boxes). After motivating the use of implementations with Incompletely Specified Boxes we define our notion of equivalence for this kind of implementations and present a method to solve the problem. A series of experimental results demonstrates the effectiveness and feasibility of the methods presented.
{"title":"Checking equivalence for circuits containing incompletely specified boxes","authors":"Christoph Scholl, B. Becker","doi":"10.1109/ICCD.2002.1106748","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106748","url":null,"abstract":"We consider the problem of checking whether an implementation which contains parts with incomplete information is equivalent to a given full specification. We study implementations which are not completely specified, but contain boxes which are associated with incompletely specified functions (called Incompletely Specified Boxes or IS-Boxes). After motivating the use of implementations with Incompletely Specified Boxes we define our notion of equivalence for this kind of implementations and present a method to solve the problem. A series of experimental results demonstrates the effectiveness and feasibility of the methods presented.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116092083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106749
F. Aloul, I. Markov, K. Sakallah
Boolean functions are fundamental to synthesis and verification of digital logic, and compact representations of Boolean functions have great practical significance. Popular representations, such as CNF, DNF, circuits and ROBDDs [4], offer different advantages and are preferred for different tasks. Conversion between those representations is common, especially when one is used to represent the input and another speeds up relevant algorithms. Our work addresses the construction of ROBDDs that represent outputs of a given Boolean circuit. It is used in synthesis and verification. Earlier works (Fujita, Fujisawa, and Kawato, 1988. Malik et al., 1988.) proposed ordering circuit inputs and gates by graph traversals. We contribute orderings based on circuit partitioning and placement, leveraging the progress in recursive bisection and multi-level min-cut partitioning achieved in late 1990s. Our empirical results show that the proposed orderings based on circuit partitioning and placement are more successful than straightforward DFS and BFS, as well as related heuristics.
{"title":"Improving the efficiency of circuit-to-BDD conversion by gate and input ordering","authors":"F. Aloul, I. Markov, K. Sakallah","doi":"10.1109/ICCD.2002.1106749","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106749","url":null,"abstract":"Boolean functions are fundamental to synthesis and verification of digital logic, and compact representations of Boolean functions have great practical significance. Popular representations, such as CNF, DNF, circuits and ROBDDs [4], offer different advantages and are preferred for different tasks. Conversion between those representations is common, especially when one is used to represent the input and another speeds up relevant algorithms. Our work addresses the construction of ROBDDs that represent outputs of a given Boolean circuit. It is used in synthesis and verification. Earlier works (Fujita, Fujisawa, and Kawato, 1988. Malik et al., 1988.) proposed ordering circuit inputs and gates by graph traversals. We contribute orderings based on circuit partitioning and placement, leveraging the progress in recursive bisection and multi-level min-cut partitioning achieved in late 1990s. Our empirical results show that the proposed orderings based on circuit partitioning and placement are more successful than straightforward DFS and BFS, as well as related heuristics.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":" 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120828846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106793
A. Hossain, D. Pease, James S. Burns, N. Parveen
Instruction fetch mechanism is a performance bottleneck of a Superscalar Processor. The fetch performance of the processor can be improved with the aid of an instruction memory structure known as Trace Cache. This paper presents parameters and analytical expressions, which describe instruction fetch performance of a Trace Cache microarchitecture. The instruction fetch rates predicted by the expressions differ by seven percent from the simulated fetch rates for SPEC2000 benchmark programs. Presented analytical expressions are implemented in a computer program named Tulip. Tulip is used to explore parameters, and their influence on fetch performance. Tulip is also used to understand Trace Cache performance tradeoffs.
{"title":"Trace Cache performance parameters","authors":"A. Hossain, D. Pease, James S. Burns, N. Parveen","doi":"10.1109/ICCD.2002.1106793","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106793","url":null,"abstract":"Instruction fetch mechanism is a performance bottleneck of a Superscalar Processor. The fetch performance of the processor can be improved with the aid of an instruction memory structure known as Trace Cache. This paper presents parameters and analytical expressions, which describe instruction fetch performance of a Trace Cache microarchitecture. The instruction fetch rates predicted by the expressions differ by seven percent from the simulated fetch rates for SPEC2000 benchmark programs. Presented analytical expressions are implemented in a computer program named Tulip. Tulip is used to explore parameters, and their influence on fetch performance. Tulip is also used to understand Trace Cache performance tradeoffs.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127217452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106765
B. Chappell, Xinning Wang, Priyadarsan Patra, Prashant Saxena, J. Vendrell, Satyanarayan Gupta, S. Varadarajan, W. Gomes, S. Hussain, H. Krishnamurthy, M. Venkateshmurthy, S. Jain
System structure and a taped out 0.18u 2 GHz product application result are described for a domino synthesis capability that covers all aspects of domino design, from estimation to silicon-ready layout, with custom-class optimization. The described optimization flow, abstraction modes, and key cost factors deliver power-optimized, noise-correct domino performance on complex logic.
{"title":"A system-level solution to domino synthesis with 2 GHz application","authors":"B. Chappell, Xinning Wang, Priyadarsan Patra, Prashant Saxena, J. Vendrell, Satyanarayan Gupta, S. Varadarajan, W. Gomes, S. Hussain, H. Krishnamurthy, M. Venkateshmurthy, S. Jain","doi":"10.1109/ICCD.2002.1106765","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106765","url":null,"abstract":"System structure and a taped out 0.18u 2 GHz product application result are described for a domino synthesis capability that covers all aspects of domino design, from estimation to silicon-ready layout, with custom-class optimization. The described optimization flow, abstraction modes, and key cost factors deliver power-optimized, noise-correct domino performance on complex logic.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134071061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106758
T. Thorp, D. Liu, P. Trivedi
In order for dynamic circuits to operate correctly, their inputs must be monotonically rising during evaluation. Blocking dynamic circuits satisfy this constraint by delaying evaluation until all inputs have been properly setup relative to the evaluation clock. By viewing dynamic gates as latches, we demonstrate that the optimal delay of a blocking dynamic gate may occur when the setup time is negative. With blocking dynamic circuits, cascading low-skew dynamic gates allows each dynamic gate to tolerate a degraded input level. The larger noise margin provides greater flexibility with the delay vs. noise margin trade-off (i.e. the circuit robustness vs. speed tradeoff). This paper generalizes blocking dynamic circuits and provides a systematic approach for assigning clock phases, given delay and noise margin constraints. Using this framework, one can analyze any logic network consisting of blocking dynamic circuits.
{"title":"Analysis of blocking dynamic circuits","authors":"T. Thorp, D. Liu, P. Trivedi","doi":"10.1109/ICCD.2002.1106758","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106758","url":null,"abstract":"In order for dynamic circuits to operate correctly, their inputs must be monotonically rising during evaluation. Blocking dynamic circuits satisfy this constraint by delaying evaluation until all inputs have been properly setup relative to the evaluation clock. By viewing dynamic gates as latches, we demonstrate that the optimal delay of a blocking dynamic gate may occur when the setup time is negative. With blocking dynamic circuits, cascading low-skew dynamic gates allows each dynamic gate to tolerate a degraded input level. The larger noise margin provides greater flexibility with the delay vs. noise margin trade-off (i.e. the circuit robustness vs. speed tradeoff). This paper generalizes blocking dynamic circuits and provides a systematic approach for assigning clock phases, given delay and noise margin constraints. Using this framework, one can analyze any logic network consisting of blocking dynamic circuits.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128644315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106777
M. Annavaram, T. Diep, John Paul Shen
This paper presents a detailed branch characterization of an Oracle based commercial on-line transaction processing workload, Oracle Database Benchmark (ODB), running on an IA32 processor. We ran a well-tuned ODB on Simics, a full system simulator, to collect the instruction traces used in this study. We compare the branch behavior of ODB with the branch behaviors of gcc, gzip and mcf from the SPECINT 2000 benchmark suite. Contrary to the popular belief that databases have unpredictable branches, we show that using larger predictors that capture enough branch history information, and using branch prediction schemes that reduce aliasing, conditional branches in ODB are more predictable than in gcc, gzip and mcf Due to frequent context switching in ODB, a hardware return address stack is ineffective in predicting return addresses for ODB. Based on further analysis, we propose and evaluate an enhanced return address predictor, which reduces return address mispredictions in ODB by 40%.
{"title":"Branch behavior of a commercial OLTP workload on Intel IA32 processors","authors":"M. Annavaram, T. Diep, John Paul Shen","doi":"10.1109/ICCD.2002.1106777","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106777","url":null,"abstract":"This paper presents a detailed branch characterization of an Oracle based commercial on-line transaction processing workload, Oracle Database Benchmark (ODB), running on an IA32 processor. We ran a well-tuned ODB on Simics, a full system simulator, to collect the instruction traces used in this study. We compare the branch behavior of ODB with the branch behaviors of gcc, gzip and mcf from the SPECINT 2000 benchmark suite. Contrary to the popular belief that databases have unpredictable branches, we show that using larger predictors that capture enough branch history information, and using branch prediction schemes that reduce aliasing, conditional branches in ODB are more predictable than in gcc, gzip and mcf Due to frequent context switching in ODB, a hardware return address stack is ineffective in predicting return addresses for ODB. Based on further analysis, we propose and evaluate an enhanced return address predictor, which reduces return address mispredictions in ODB by 40%.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116897515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106824
Panit Watcharawitch, S. Moore
Embedded processors are increasingly deployed in applications requiring high performance with good real-time characteristics whilst being low power. Parallelism has to be extracted in order to improve the performance at an architectural level. Extracting instruction level parallelism requires extensive speculation which adds complexity and increases power consumption. Alternatively, parallelism can be provided at the thread level. Many embedded applications can be written in a threaded manner in Java which can be directly translated to use hardware-level multithreaded operations. This paper presents an architectural study of JMA, a high-performance multithreaded architecture which supports Java-multithreading and realtime scheduling whilst remaining low-power.
{"title":"JMA: the Java-multithreading architecture for embedded processors","authors":"Panit Watcharawitch, S. Moore","doi":"10.1109/ICCD.2002.1106824","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106824","url":null,"abstract":"Embedded processors are increasingly deployed in applications requiring high performance with good real-time characteristics whilst being low power. Parallelism has to be extracted in order to improve the performance at an architectural level. Extracting instruction level parallelism requires extensive speculation which adds complexity and increases power consumption. Alternatively, parallelism can be provided at the thread level. Many embedded applications can be written in a threaded manner in Java which can be directly translated to use hardware-level multithreaded operations. This paper presents an architectural study of JMA, a high-performance multithreaded architecture which supports Java-multithreading and realtime scheduling whilst remaining low-power.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":" 48","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113952600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106746
Xiaojian Yang, Bo-Kyung Choi, M. Sarrafzadeh
In this paper we study the correlation between wirelength and routability for standard-cell placement problem, under the modern place-and-route environment. We present a placement tool named Dragon (version 2.1), and show its ability to produce good quality placement for designs with high row utilization. Compared to an industrial placer and an academic state-of-the-art placer, Dragon can produce placement with better routability and shorter total wirelength. We describe many novel algorithmic details and implementation details of this placement tool. Experimental results show that minimizing wirelength improves routability and layout quality.
{"title":"A standard-cell placement tool for designs with high row utilization","authors":"Xiaojian Yang, Bo-Kyung Choi, M. Sarrafzadeh","doi":"10.1109/ICCD.2002.1106746","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106746","url":null,"abstract":"In this paper we study the correlation between wirelength and routability for standard-cell placement problem, under the modern place-and-route environment. We present a placement tool named Dragon (version 2.1), and show its ability to produce good quality placement for designs with high row utilization. Compared to an industrial placer and an academic state-of-the-art placer, Dragon can produce placement with better routability and shorter total wirelength. We describe many novel algorithmic details and implementation details of this placement tool. Experimental results show that minimizing wirelength improves routability and layout quality.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117098726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106763
L. Scheffer
As processes shrink, gate delay improves much faster than the delay in long wires. Therefore, the long wires increasingly determine the maximum clock rate, and hence performance, of more and more chips. One solution to this problem is to pipeline the global interconnect, enabling the whole chip to run at the speed of local operations. While known to work well, this optimization is seldom used because of practical difficulties - it is hard to change the RTL, test vectors become invalid, and it's hard to prove correctness of any changes. Here we look at some ways these difficulties could be overcome.
{"title":"Methodologies and tools for pipelined on-chip interconnect","authors":"L. Scheffer","doi":"10.1109/ICCD.2002.1106763","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106763","url":null,"abstract":"As processes shrink, gate delay improves much faster than the delay in long wires. Therefore, the long wires increasingly determine the maximum clock rate, and hence performance, of more and more chips. One solution to this problem is to pipeline the global interconnect, enabling the whole chip to run at the speed of local operations. While known to work well, this optimization is seldom used because of practical difficulties - it is hard to change the RTL, test vectors become invalid, and it's hard to prove correctness of any changes. Here we look at some ways these difficulties could be overcome.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116105560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106740
H. P. Hofstee
Power dissipation and power density have become first-order design constraints, even for high-performance systems. For future designs it will be the dominant constraint. In this paper we suggest a systematic approach to optimizing a processor design under (only) a power constraint. The approach uses the energy-performance ratio (EPR) of the various design parameters as the key to identifying opportunities for improving energy-efficiency.
{"title":"Power-constrained microprocessor design","authors":"H. P. Hofstee","doi":"10.1109/ICCD.2002.1106740","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106740","url":null,"abstract":"Power dissipation and power density have become first-order design constraints, even for high-performance systems. For future designs it will be the dominant constraint. In this paper we suggest a systematic approach to optimizing a processor design under (only) a power constraint. The approach uses the energy-performance ratio (EPR) of the various design parameters as the key to identifying opportunities for improving energy-efficiency.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132269001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}