Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528926
T. Chou, K. Roy
We present an exact and an approximate method for estimating signal activity at the internal nodes of sequential logic circuits. The methodology takes spatial and temporal correlations of logic signals into consideration. Given the state transition graph (STG) of a finite state machine (FSM), we create an extended state transition graph (ESTG), where the temporal correlations of the input signals are explicitly represented. From the graph we derive the equations to calculate exact signal probabilities and activities. For large circuits an approximate method for calculating the activities by unrolling the next state logic is proposed. Experimental results show that if temporal and spatial correlations are not considered, the switching activities of the internal nodes can be off by more than 40% compared to simulation based techniques. However, the results of the approximate method proposed in the paper is within 5% of logic simulation results.
{"title":"Estimation of sequential circuit activity considering spatial and temporal correlations","authors":"T. Chou, K. Roy","doi":"10.1109/ICCD.1995.528926","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528926","url":null,"abstract":"We present an exact and an approximate method for estimating signal activity at the internal nodes of sequential logic circuits. The methodology takes spatial and temporal correlations of logic signals into consideration. Given the state transition graph (STG) of a finite state machine (FSM), we create an extended state transition graph (ESTG), where the temporal correlations of the input signals are explicitly represented. From the graph we derive the equations to calculate exact signal probabilities and activities. For large circuits an approximate method for calculating the activities by unrolling the next state logic is proposed. Experimental results show that if temporal and spatial correlations are not considered, the switching activities of the internal nodes can be off by more than 40% compared to simulation based techniques. However, the results of the approximate method proposed in the paper is within 5% of logic simulation results.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130446464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528814
G. Holt, A. Tyagi
This paper reports our experiences with incorporating energy (or switched capacitance) based algorithms into an automated layout synthesis system based on standard cells. Our experimental results show an average savings of 18.5% in interconnect energy at a cost of about 6.2% area increase relative to area-minimized layouts on MCNC Logic Synthesis '93 benchmarks. The basic premise is that the wires with high switching should be made short even if it involves stretching several low switching wires. We modified an existing layout system, VPNR, to include these techniques during the placement and global routing phases. Attempts to include switching probabilities into channel routing did not produce appreciable results. Our experiments also lend insight into the composition of the solution space for VLSI energy minimization problems.
{"title":"EPNR: an energy-efficient automated layout synthesis package","authors":"G. Holt, A. Tyagi","doi":"10.1109/ICCD.1995.528814","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528814","url":null,"abstract":"This paper reports our experiences with incorporating energy (or switched capacitance) based algorithms into an automated layout synthesis system based on standard cells. Our experimental results show an average savings of 18.5% in interconnect energy at a cost of about 6.2% area increase relative to area-minimized layouts on MCNC Logic Synthesis '93 benchmarks. The basic premise is that the wires with high switching should be made short even if it involves stretching several low switching wires. We modified an existing layout system, VPNR, to include these techniques during the placement and global routing phases. Attempts to include switching probabilities into channel routing did not produce appreciable results. Our experiments also lend insight into the composition of the solution space for VLSI energy minimization problems.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129217425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528789
K. Yun, D. Dill
We describe the design of a high performance asynchronous SCSI (small computer systems interface) controller data path and the associated control circuits. The data path is an asynchronous pipeline and the control circuits for the data path are built out of extended burst-mode machines. This design is functionally compatible with a widely used commercial SCSI controller and was simulated correctly with respect to all of the applicable test vectors used for the commercial design. The technology used for this design is a 0.8 /spl mu/m CMOS standard cell. The performance is limited by the SCSI specification, not the design itself, and the area is competitive with the commercial design. This design improves the data transfer throughput by up to 2.5 times from previous work by incorporating a FIFO and a distributed control scheme based on extended burst-mode state machines.
{"title":"A high-performance asynchronous SCSI controller","authors":"K. Yun, D. Dill","doi":"10.1109/ICCD.1995.528789","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528789","url":null,"abstract":"We describe the design of a high performance asynchronous SCSI (small computer systems interface) controller data path and the associated control circuits. The data path is an asynchronous pipeline and the control circuits for the data path are built out of extended burst-mode machines. This design is functionally compatible with a widely used commercial SCSI controller and was simulated correctly with respect to all of the applicable test vectors used for the commercial design. The technology used for this design is a 0.8 /spl mu/m CMOS standard cell. The performance is limited by the SCSI specification, not the design itself, and the area is competitive with the commercial design. This design improves the data transfer throughput by up to 2.5 times from previous work by incorporating a FIFO and a distributed control scheme based on extended burst-mode state machines.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131438935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528941
H. Hsieh, Wentai Liu, R. Cavin, C. T. Gray
Many techniques have been proposed to optimize digital system timing. Each technique can be advantageous in particular applications, however they are most often applied individually rather than concurrently. The framework presented here allows for concurrent timing optimization using retiming, intentional clock skew, and wave pipelining for latch-based designed systems with single or multi-phase clocking. This optimization is formulated as a mixed integer linear program. Our integrated framework also includes a new optimization technique called resynchronization which allows for the insertion of latches in the shortest paths and thus avoids race conditions. Our work has been applied to several designs and is able to significantly reduce the clock period.
{"title":"Concurrent timing optimization of latch-based digital systems","authors":"H. Hsieh, Wentai Liu, R. Cavin, C. T. Gray","doi":"10.1109/ICCD.1995.528941","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528941","url":null,"abstract":"Many techniques have been proposed to optimize digital system timing. Each technique can be advantageous in particular applications, however they are most often applied individually rather than concurrently. The framework presented here allows for concurrent timing optimization using retiming, intentional clock skew, and wave pipelining for latch-based designed systems with single or multi-phase clocking. This optimization is formulated as a mixed integer linear program. Our integrated framework also includes a new optimization technique called resynchronization which allows for the insertion of latches in the shortest paths and thus avoids race conditions. Our work has been applied to several designs and is able to significantly reduce the clock period.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132230045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528797
K. Shimamura, Shigeya Tanaka, Tetsuya Shimomura, T. Hotta, E. Kamada, H. Sawamoto, Teruhisa Shimizu, K. Nakazawa
A novel architectural extension, in which floating-point data are transferred directly from main memory to floating-point registers, has been successfully implemented in a superscalar RISC processor. This extension allows main memory access throughput of 1.2 Gbyte/s, and effective performance reaches 267 MFLOPS (89% of the peak performance) for typical floating-point applications. The processor utilizes 0.3-micron 4-level metal CMOS technology with 2.5 V power supply and contains 3.9 million transistors in 15.7 mm/spl times/15.7 mm die size. Only 4.5% of the die area is used for the extension. Pipeline stage optimization and scoreboard-based dependency check method allow the extension to be realized without affecting the operating frequency.
{"title":"A superscalar RISC processor with pseudo vector processing feature","authors":"K. Shimamura, Shigeya Tanaka, Tetsuya Shimomura, T. Hotta, E. Kamada, H. Sawamoto, Teruhisa Shimizu, K. Nakazawa","doi":"10.1109/ICCD.1995.528797","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528797","url":null,"abstract":"A novel architectural extension, in which floating-point data are transferred directly from main memory to floating-point registers, has been successfully implemented in a superscalar RISC processor. This extension allows main memory access throughput of 1.2 Gbyte/s, and effective performance reaches 267 MFLOPS (89% of the peak performance) for typical floating-point applications. The processor utilizes 0.3-micron 4-level metal CMOS technology with 2.5 V power supply and contains 3.9 million transistors in 15.7 mm/spl times/15.7 mm die size. Only 4.5% of the die area is used for the extension. Pipeline stage optimization and scoreboard-based dependency check method allow the extension to be realized without affecting the operating frequency.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125066723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528934
M. Hsiao, J. Patel
A new technique is proposed to handle fault simulation at the architectural level. The technique bypasses the need for complete gate level structure and efficiently uses the architectural information. Symbolic data representing groups of stuck at faults, known as fault effects, are propagated across the circuit with intelligent propagation prediction. Fault effects may combine and form new groups in the process. Automated behavioral simulation using only three data types is used to propagate fault effects at the architectural level by propagation prediction; no additional high level constraints or precomputation of faulty behavior are needed for simulation. Although not a fully deterministic algorithm, the results of ALFSIM, Architectural Level Fault Simulation, show high accuracy when compared with the gate level fault simulation.
{"title":"A new architectural-level fault simulation using propagation prediction of grouped fault-effects","authors":"M. Hsiao, J. Patel","doi":"10.1109/ICCD.1995.528934","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528934","url":null,"abstract":"A new technique is proposed to handle fault simulation at the architectural level. The technique bypasses the need for complete gate level structure and efficiently uses the architectural information. Symbolic data representing groups of stuck at faults, known as fault effects, are propagated across the circuit with intelligent propagation prediction. Fault effects may combine and form new groups in the process. Automated behavioral simulation using only three data types is used to propagate fault effects at the architectural level by propagation prediction; no additional high level constraints or precomputation of faulty behavior are needed for simulation. Although not a fully deterministic algorithm, the results of ALFSIM, Architectural Level Fault Simulation, show high accuracy when compared with the gate level fault simulation.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125373919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528806
Yu Fang, A. Albicki
We propose a novel testability enhancement scheme based on XOR Chain Structure. The structure is effective for improving both controllability and observability. The insertion points are selected by fast testability analysis and random pattern resistant node source tracking. Experiments with ISCAS85 benchmark circuits show that the scheme is effective. The incurred hardware overhead and performance penalty is relatively low.
{"title":"Efficient testability enhancement for combinational circuit","authors":"Yu Fang, A. Albicki","doi":"10.1109/ICCD.1995.528806","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528806","url":null,"abstract":"We propose a novel testability enhancement scheme based on XOR Chain Structure. The structure is effective for improving both controllability and observability. The insertion points are selected by fast testability analysis and random pattern resistant node source tracking. Experiments with ISCAS85 benchmark circuits show that the scheme is effective. The incurred hardware overhead and performance penalty is relatively low.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124444177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528788
M. Greenstreet
STARI is a high-speed signaling technique that uses both synchronous and self-timed circuits. To demonstrate STARI, a chip has been fabricated using the MOSIS 2/spl mu/ CMOS process. In a simple test fixture, it operates at data rates of 120 Mbits/sec over a pair of wires. Because STARl uses both synchronous and self-timed circuits, it provides an opportunity to compare these two design methods. The synchronous circuits of the STARI chip achieve rates of operation two to three times those of the self-timed circuits. However, the self-timed FIFO in the receiver provides robust compensation for clock skew that could not be achieved with synchronous circuitry alone. Thus, the STARI chip demonstrates advantages of combining these two design techniques.
{"title":"Implementing a STARI chip","authors":"M. Greenstreet","doi":"10.1109/ICCD.1995.528788","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528788","url":null,"abstract":"STARI is a high-speed signaling technique that uses both synchronous and self-timed circuits. To demonstrate STARI, a chip has been fabricated using the MOSIS 2/spl mu/ CMOS process. In a simple test fixture, it operates at data rates of 120 Mbits/sec over a pair of wires. Because STARl uses both synchronous and self-timed circuits, it provides an opportunity to compare these two design methods. The synchronous circuits of the STARI chip achieve rates of operation two to three times those of the self-timed circuits. However, the self-timed FIFO in the receiver provides robust compensation for clock skew that could not be achieved with synchronous circuitry alone. Thus, the STARI chip demonstrates advantages of combining these two design techniques.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122704374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528916
Chi-Hung Chi, Siu-Chung Lau
In the latest processor architectures such as IBM PowerPC and HP Precision Architecture (PA), it is found that certain important compound opcodes such as LOAD-UPDATE and LOAD-MODIFY contain accurate information about how data will be referenced in the near future. Furthermore, these opcodes have been fully utilized by the compiler in the program code generation. With the migration of data cache onto the processor chip, it is now possible for the on-chip cache controller to perform intelligent data prefetching based on the information from the instruction decode unit. In this paper, a novel hardware-driven data prefetching scheme, called the Instruction Opcode-Based Prefetching (IOBP), is proposed. Our simulation shows that this IOBP scheme is very effective in reducing processor stall time due to memory accesses, especially for array or pointer references with constant strides.
{"title":"Reducing data access penalty using intelligent opcode-driven cache prefetching","authors":"Chi-Hung Chi, Siu-Chung Lau","doi":"10.1109/ICCD.1995.528916","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528916","url":null,"abstract":"In the latest processor architectures such as IBM PowerPC and HP Precision Architecture (PA), it is found that certain important compound opcodes such as LOAD-UPDATE and LOAD-MODIFY contain accurate information about how data will be referenced in the near future. Furthermore, these opcodes have been fully utilized by the compiler in the program code generation. With the migration of data cache onto the processor chip, it is now possible for the on-chip cache controller to perform intelligent data prefetching based on the information from the instruction decode unit. In this paper, a novel hardware-driven data prefetching scheme, called the Instruction Opcode-Based Prefetching (IOBP), is proposed. Our simulation shows that this IOBP scheme is very effective in reducing processor stall time due to memory accesses, especially for array or pointer references with constant strides.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114636509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528822
T. Yokota, H. Matsuoka, K. Okamoto, Hideo Hirono, A. Hori, S. Sakai
The RWC-1 is a massively parallel computer based on a multi-threaded architecture. This architecture requires extremely high communication performance with reasonable hardware cost. ln this paper, we first introduce a new class of direct interconnection networks called MDCE (Multidimensional Directed Cycles Ensemble extension). MDCE has many desirable features for RWC-1 including small degree, low latency, and high throughput. MDCE is thus adopted for a RWC-1 network. We have designed an MDCE router and fabricated an experimental VLSI chip. We explain the design details in this paper. The chip employs operating system support features as well as communication functions, and enables advanced resource management, A prototype chip with about 125,000 gates has been fabricated using 0.6-/spl mu/m CMOS gate array technology. Its clock runs at 50 MHz and a transmission rate of 300 M bytes per second per communication port is achieved.
{"title":"A prototype router for the massively parallel computer RWC-1","authors":"T. Yokota, H. Matsuoka, K. Okamoto, Hideo Hirono, A. Hori, S. Sakai","doi":"10.1109/ICCD.1995.528822","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528822","url":null,"abstract":"The RWC-1 is a massively parallel computer based on a multi-threaded architecture. This architecture requires extremely high communication performance with reasonable hardware cost. ln this paper, we first introduce a new class of direct interconnection networks called MDCE (Multidimensional Directed Cycles Ensemble extension). MDCE has many desirable features for RWC-1 including small degree, low latency, and high throughput. MDCE is thus adopted for a RWC-1 network. We have designed an MDCE router and fabricated an experimental VLSI chip. We explain the design details in this paper. The chip employs operating system support features as well as communication functions, and enables advanced resource management, A prototype chip with about 125,000 gates has been fabricated using 0.6-/spl mu/m CMOS gate array technology. Its clock runs at 50 MHz and a transmission rate of 300 M bytes per second per communication port is achieved.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126868076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}