Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528797
K. Shimamura, Shigeya Tanaka, Tetsuya Shimomura, T. Hotta, E. Kamada, H. Sawamoto, Teruhisa Shimizu, K. Nakazawa
A novel architectural extension, in which floating-point data are transferred directly from main memory to floating-point registers, has been successfully implemented in a superscalar RISC processor. This extension allows main memory access throughput of 1.2 Gbyte/s, and effective performance reaches 267 MFLOPS (89% of the peak performance) for typical floating-point applications. The processor utilizes 0.3-micron 4-level metal CMOS technology with 2.5 V power supply and contains 3.9 million transistors in 15.7 mm/spl times/15.7 mm die size. Only 4.5% of the die area is used for the extension. Pipeline stage optimization and scoreboard-based dependency check method allow the extension to be realized without affecting the operating frequency.
{"title":"A superscalar RISC processor with pseudo vector processing feature","authors":"K. Shimamura, Shigeya Tanaka, Tetsuya Shimomura, T. Hotta, E. Kamada, H. Sawamoto, Teruhisa Shimizu, K. Nakazawa","doi":"10.1109/ICCD.1995.528797","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528797","url":null,"abstract":"A novel architectural extension, in which floating-point data are transferred directly from main memory to floating-point registers, has been successfully implemented in a superscalar RISC processor. This extension allows main memory access throughput of 1.2 Gbyte/s, and effective performance reaches 267 MFLOPS (89% of the peak performance) for typical floating-point applications. The processor utilizes 0.3-micron 4-level metal CMOS technology with 2.5 V power supply and contains 3.9 million transistors in 15.7 mm/spl times/15.7 mm die size. Only 4.5% of the die area is used for the extension. Pipeline stage optimization and scoreboard-based dependency check method allow the extension to be realized without affecting the operating frequency.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125066723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528806
Yu Fang, A. Albicki
We propose a novel testability enhancement scheme based on XOR Chain Structure. The structure is effective for improving both controllability and observability. The insertion points are selected by fast testability analysis and random pattern resistant node source tracking. Experiments with ISCAS85 benchmark circuits show that the scheme is effective. The incurred hardware overhead and performance penalty is relatively low.
{"title":"Efficient testability enhancement for combinational circuit","authors":"Yu Fang, A. Albicki","doi":"10.1109/ICCD.1995.528806","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528806","url":null,"abstract":"We propose a novel testability enhancement scheme based on XOR Chain Structure. The structure is effective for improving both controllability and observability. The insertion points are selected by fast testability analysis and random pattern resistant node source tracking. Experiments with ISCAS85 benchmark circuits show that the scheme is effective. The incurred hardware overhead and performance penalty is relatively low.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124444177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528909
H. Yamada, T. Hotta, T. Nishiyama, F. Murabayashi, T. Yamauchi, H. Sawamoto
One-bit pre-shifting before alignment shift, normalization with anticipated leading '1' bit and pre-rounding techniques have been developed for a floating-point arithmetic logic unit (ALU). In addition, carry select addition and pre-rounding techniques have been developed for a floating-point multiplier. A noise tolerant precharge (NTP) circuit was designed and applied to the ALU and multiplier. These techniques reduced the delay time of the critical path by 24%. Each unit was fabricated in 0.3 /spl mu/m 2.5 V four-layer-metal CMOS technology and achieved a two-cycle latency at 150 MHz.
{"title":"A 13.3ns double-precision floating-point ALU and multiplier","authors":"H. Yamada, T. Hotta, T. Nishiyama, F. Murabayashi, T. Yamauchi, H. Sawamoto","doi":"10.1109/ICCD.1995.528909","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528909","url":null,"abstract":"One-bit pre-shifting before alignment shift, normalization with anticipated leading '1' bit and pre-rounding techniques have been developed for a floating-point arithmetic logic unit (ALU). In addition, carry select addition and pre-rounding techniques have been developed for a floating-point multiplier. A noise tolerant precharge (NTP) circuit was designed and applied to the ALU and multiplier. These techniques reduced the delay time of the critical path by 24%. Each unit was fabricated in 0.3 /spl mu/m 2.5 V four-layer-metal CMOS technology and achieved a two-cycle latency at 150 MHz.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125698159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528804
J. Kumar, N. Strader, J. Freeman, Michael Miller
Large-scale hardware logic emulation using software configurable hardware provides a new means to significantly improve verification of complex integrated circuits such as today's advanced microprocessors. The essence of hardware logic emulation is the provision of a hardware prototype of the circuit being designed. Such a hardware prototype can execute both pseudo-random verification vectors and software application programs up to six orders-of-magnitude faster than conventional software logic simulators. Trillions of verification vectors can be run on the emulation model for verification in only a few weeks compared to the prior best practice of running only billions of verification vectors in many months. Application of hardware logic emulation requires a sound design methodology with an HDL model (RTL or at least gate-level), an unlimited source of vectors or software applications intended to exercise the design in a target system.
{"title":"Emulation verification of the Motorola 68060","authors":"J. Kumar, N. Strader, J. Freeman, Michael Miller","doi":"10.1109/ICCD.1995.528804","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528804","url":null,"abstract":"Large-scale hardware logic emulation using software configurable hardware provides a new means to significantly improve verification of complex integrated circuits such as today's advanced microprocessors. The essence of hardware logic emulation is the provision of a hardware prototype of the circuit being designed. Such a hardware prototype can execute both pseudo-random verification vectors and software application programs up to six orders-of-magnitude faster than conventional software logic simulators. Trillions of verification vectors can be run on the emulation model for verification in only a few weeks compared to the prior best practice of running only billions of verification vectors in many months. Application of hardware logic emulation requires a sound design methodology with an HDL model (RTL or at least gate-level), an unlimited source of vectors or software applications intended to exercise the design in a target system.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132861093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528943
C. Ykman-Couvreur, Bill Lin
This paper presents a new efficient state assignment framework for synthesizing asynchronous state graphs. This framework operates purely at the state graph level and is applicable to a broad class of behaviors. In this paper we focus the framework for solving the complete state coding problem. This method has been automated and applied to a large set of asynchronous circuits. It achieves significant improvements in terms of both circuit area and computation time.
{"title":"Efficient state assignment framework for asynchronous state graphs","authors":"C. Ykman-Couvreur, Bill Lin","doi":"10.1109/ICCD.1995.528943","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528943","url":null,"abstract":"This paper presents a new efficient state assignment framework for synthesizing asynchronous state graphs. This framework operates purely at the state graph level and is applicable to a broad class of behaviors. In this paper we focus the framework for solving the complete state coding problem. This method has been automated and applied to a large set of asynchronous circuits. It achieves significant improvements in terms of both circuit area and computation time.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133571029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528931
M. Amin, B. Vinnakota
Fault simulation is a compute intensive problem. Data parallel simulation on multiple processors is one method to reduce fault simulation time. We discuss a novel technique to partition the fault set for data parallel fault simulation. When applied statically, the technique can scale well for up to eight processors. The fault set partitioning technique is simple, can itself be parallelized, and can be implemented with extreme ease. Therefore, the technique can be used on a low cost parallel resource, such as a network of workstations.
{"title":"Data parallel fault simulation","authors":"M. Amin, B. Vinnakota","doi":"10.1109/ICCD.1995.528931","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528931","url":null,"abstract":"Fault simulation is a compute intensive problem. Data parallel simulation on multiple processors is one method to reduce fault simulation time. We discuss a novel technique to partition the fault set for data parallel fault simulation. When applied statically, the technique can scale well for up to eight processors. The fault set partitioning technique is simple, can itself be parallelized, and can be implemented with extreme ease. Therefore, the technique can be used on a low cost parallel resource, such as a network of workstations.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116318113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528840
Shashidhar Thakur, D. F. Wong
We address the technology mapping problem for lookup table FPGAs. The area minimization problem for mapping K-bounded networks, consisting of nodes with at most K inputs using K-input lookup tables is known to be NP-complete for K/spl ges/5. The complexity was unknown for K=2, 3, and 4. The corresponding delay minimization problem (under the constant delay model) was solved in polynomial time by the flow-map algorithm, for arbitrary values of K. We study the class of K-bounded networks, where all nodes have exactly K inputs. We call such networks K-exact. We give a characterization of mapping solutions for such networks. This leads to a polynomial time algorithm for computing the simultaneous area and delay minimum mapping for such networks using K-input lookup tables. We also show that the flow-map algorithm minimizes the area of the mapped network as well, for K-exact networks. We then show that for K=2 the mapping solution for a 2-bounded network, minimizing the area and delay simultaneously, can be easily obtained from that of a 2-exact network derived from it by eliminating single input nodes. Thus the area minimization problem for 2-input lookup tables can be solved in polynomial time, resolving an open problem.
{"title":"Simultaneous area and delay minimum K-LUT mapping for K-exact networks","authors":"Shashidhar Thakur, D. F. Wong","doi":"10.1109/ICCD.1995.528840","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528840","url":null,"abstract":"We address the technology mapping problem for lookup table FPGAs. The area minimization problem for mapping K-bounded networks, consisting of nodes with at most K inputs using K-input lookup tables is known to be NP-complete for K/spl ges/5. The complexity was unknown for K=2, 3, and 4. The corresponding delay minimization problem (under the constant delay model) was solved in polynomial time by the flow-map algorithm, for arbitrary values of K. We study the class of K-bounded networks, where all nodes have exactly K inputs. We call such networks K-exact. We give a characterization of mapping solutions for such networks. This leads to a polynomial time algorithm for computing the simultaneous area and delay minimum mapping for such networks using K-input lookup tables. We also show that the flow-map algorithm minimizes the area of the mapped network as well, for K-exact networks. We then show that for K=2 the mapping solution for a 2-bounded network, minimizing the area and delay simultaneously, can be easily obtained from that of a 2-exact network derived from it by eliminating single input nodes. Thus the area minimization problem for 2-input lookup tables can be solved in polynomial time, resolving an open problem.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116330029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528811
M. Allen, W. Lewchuk, J. Coddington
The PowerPC 620 microprocessor introduces a new integrated secondary cache controller and system bus interface. The secondary cache interface is 128 bits wide, supports L2 sizes from 1 MB to 128 MB, is ECC protected, can transfer 2.0 GB/sec at 133 MHz and supports an optional co-processor mode. The 620 bus is optimized for server-class systems requiring significant multiprocessing capability and supports the 64-bit PowerPC architecture with a 40-bit physical address bus and a separate 128-bit data bus. Address transfer rates of up to 33 M Addresses/sec at 66 MHz are achieved by pipelining the address snoop response with the address bus. The address and data buses are explicitly tagged allowing data transfers to be reordered with respect to the addresses. The data bus can transfer up to 1.0 GB/sec at 66 MHz. The bus protocol and the integrated L2 controller presented support the snoop-based MESI cache coherency protocol and direct cache-to-cache data transfers.
{"title":"A high performance bus and cache controller for PowerPC multiprocessing systems","authors":"M. Allen, W. Lewchuk, J. Coddington","doi":"10.1109/ICCD.1995.528811","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528811","url":null,"abstract":"The PowerPC 620 microprocessor introduces a new integrated secondary cache controller and system bus interface. The secondary cache interface is 128 bits wide, supports L2 sizes from 1 MB to 128 MB, is ECC protected, can transfer 2.0 GB/sec at 133 MHz and supports an optional co-processor mode. The 620 bus is optimized for server-class systems requiring significant multiprocessing capability and supports the 64-bit PowerPC architecture with a 40-bit physical address bus and a separate 128-bit data bus. Address transfer rates of up to 33 M Addresses/sec at 66 MHz are achieved by pipelining the address snoop response with the address bus. The address and data buses are explicitly tagged allowing data transfers to be reordered with respect to the addresses. The data bus can transfer up to 1.0 GB/sec at 66 MHz. The bus protocol and the integrated L2 controller presented support the snoop-based MESI cache coherency protocol and direct cache-to-cache data transfers.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121572342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528819
G. Cabodi, S. Quer, P. Camurati
Computing equivalence classes for finite state machines (FSMs) has several applications to synthesis and verification problems, like state minimization, automata reduction, and logic optimization with don't cares. Symbolic traversal techniques are applicable to medium-small circuits. This paper extends their use to large FSMs by means of cofactor-based enhancements to the state-of-the-art approaches and of underestimations of equivalence classes. The key to success is pruning the search space by constraining it. Experimental results on some of the larger ISCAS'89 and MCNC circuits show its applicability.
{"title":"Extending equivalence class computation to large FSMs","authors":"G. Cabodi, S. Quer, P. Camurati","doi":"10.1109/ICCD.1995.528819","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528819","url":null,"abstract":"Computing equivalence classes for finite state machines (FSMs) has several applications to synthesis and verification problems, like state minimization, automata reduction, and logic optimization with don't cares. Symbolic traversal techniques are applicable to medium-small circuits. This paper extends their use to large FSMs by means of cofactor-based enhancements to the state-of-the-art approaches and of underestimations of equivalence classes. The key to success is pruning the search space by constraining it. Experimental results on some of the larger ISCAS'89 and MCNC circuits show its applicability.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124297506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528919
Y. Hoskote, Dinos Moundanos, J. Abraham
Simulation is still the primary, although inadequate, resource for verifying the conformity of a design to its functional specification. Fortunately, most errors in the early stages of design involve only the control flow in the circuit. We define the functional coverage of a given sequence of verification vectors as the amount of control behavior exercised by them. We present a novel technique for automatically extracting the control flow of a design on the basis of the underlying mathematical model. Significantly, this extraction is independent of the circuit description style. The Extracted Control Flow Machine (ECFM) is then used for estimation of functional coverage and to provide information that will help the designer improve the quality of his or her tests.
{"title":"Automatic extraction of the control flow machine and application to evaluating coverage of verification vectors","authors":"Y. Hoskote, Dinos Moundanos, J. Abraham","doi":"10.1109/ICCD.1995.528919","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528919","url":null,"abstract":"Simulation is still the primary, although inadequate, resource for verifying the conformity of a design to its functional specification. Fortunately, most errors in the early stages of design involve only the control flow in the circuit. We define the functional coverage of a given sequence of verification vectors as the amount of control behavior exercised by them. We present a novel technique for automatically extracting the control flow of a design on the basis of the underlying mathematical model. Significantly, this extraction is independent of the circuit description style. The Extracted Control Flow Machine (ECFM) is then used for estimation of functional coverage and to provide information that will help the designer improve the quality of his or her tests.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115628103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}