Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528926
T. Chou, K. Roy
We present an exact and an approximate method for estimating signal activity at the internal nodes of sequential logic circuits. The methodology takes spatial and temporal correlations of logic signals into consideration. Given the state transition graph (STG) of a finite state machine (FSM), we create an extended state transition graph (ESTG), where the temporal correlations of the input signals are explicitly represented. From the graph we derive the equations to calculate exact signal probabilities and activities. For large circuits an approximate method for calculating the activities by unrolling the next state logic is proposed. Experimental results show that if temporal and spatial correlations are not considered, the switching activities of the internal nodes can be off by more than 40% compared to simulation based techniques. However, the results of the approximate method proposed in the paper is within 5% of logic simulation results.
{"title":"Estimation of sequential circuit activity considering spatial and temporal correlations","authors":"T. Chou, K. Roy","doi":"10.1109/ICCD.1995.528926","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528926","url":null,"abstract":"We present an exact and an approximate method for estimating signal activity at the internal nodes of sequential logic circuits. The methodology takes spatial and temporal correlations of logic signals into consideration. Given the state transition graph (STG) of a finite state machine (FSM), we create an extended state transition graph (ESTG), where the temporal correlations of the input signals are explicitly represented. From the graph we derive the equations to calculate exact signal probabilities and activities. For large circuits an approximate method for calculating the activities by unrolling the next state logic is proposed. Experimental results show that if temporal and spatial correlations are not considered, the switching activities of the internal nodes can be off by more than 40% compared to simulation based techniques. However, the results of the approximate method proposed in the paper is within 5% of logic simulation results.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130446464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528789
K. Yun, D. Dill
We describe the design of a high performance asynchronous SCSI (small computer systems interface) controller data path and the associated control circuits. The data path is an asynchronous pipeline and the control circuits for the data path are built out of extended burst-mode machines. This design is functionally compatible with a widely used commercial SCSI controller and was simulated correctly with respect to all of the applicable test vectors used for the commercial design. The technology used for this design is a 0.8 /spl mu/m CMOS standard cell. The performance is limited by the SCSI specification, not the design itself, and the area is competitive with the commercial design. This design improves the data transfer throughput by up to 2.5 times from previous work by incorporating a FIFO and a distributed control scheme based on extended burst-mode state machines.
{"title":"A high-performance asynchronous SCSI controller","authors":"K. Yun, D. Dill","doi":"10.1109/ICCD.1995.528789","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528789","url":null,"abstract":"We describe the design of a high performance asynchronous SCSI (small computer systems interface) controller data path and the associated control circuits. The data path is an asynchronous pipeline and the control circuits for the data path are built out of extended burst-mode machines. This design is functionally compatible with a widely used commercial SCSI controller and was simulated correctly with respect to all of the applicable test vectors used for the commercial design. The technology used for this design is a 0.8 /spl mu/m CMOS standard cell. The performance is limited by the SCSI specification, not the design itself, and the area is competitive with the commercial design. This design improves the data transfer throughput by up to 2.5 times from previous work by incorporating a FIFO and a distributed control scheme based on extended burst-mode state machines.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131438935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528941
H. Hsieh, Wentai Liu, R. Cavin, C. T. Gray
Many techniques have been proposed to optimize digital system timing. Each technique can be advantageous in particular applications, however they are most often applied individually rather than concurrently. The framework presented here allows for concurrent timing optimization using retiming, intentional clock skew, and wave pipelining for latch-based designed systems with single or multi-phase clocking. This optimization is formulated as a mixed integer linear program. Our integrated framework also includes a new optimization technique called resynchronization which allows for the insertion of latches in the shortest paths and thus avoids race conditions. Our work has been applied to several designs and is able to significantly reduce the clock period.
{"title":"Concurrent timing optimization of latch-based digital systems","authors":"H. Hsieh, Wentai Liu, R. Cavin, C. T. Gray","doi":"10.1109/ICCD.1995.528941","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528941","url":null,"abstract":"Many techniques have been proposed to optimize digital system timing. Each technique can be advantageous in particular applications, however they are most often applied individually rather than concurrently. The framework presented here allows for concurrent timing optimization using retiming, intentional clock skew, and wave pipelining for latch-based designed systems with single or multi-phase clocking. This optimization is formulated as a mixed integer linear program. Our integrated framework also includes a new optimization technique called resynchronization which allows for the insertion of latches in the shortest paths and thus avoids race conditions. Our work has been applied to several designs and is able to significantly reduce the clock period.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132230045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528832
L. Cherkasova, V. Kotov, Tomas Rokicki
The fibre channel standard, developed by the ANSI X3T9.3 task group, defines a serial I/O channel for interconnecting a number of peripheral devices and computer systems. In this paper we consider how fibre channel switches can be cascaded to form a fibre channel fabric. We begin with an analytical model of topology performance that provides a theoretical upper bound on fabric performance and a method for the practical evaluation of fabric topologies. Next, we present simulation results for a single fibre channel switch having 16 ports and a specific high-level architecture. Finally, we consider cascades of this switch, and discuss some subtleties, such as different routing strategies, deadlocks and unfairness.
{"title":"Designing fibre channel fabrics","authors":"L. Cherkasova, V. Kotov, Tomas Rokicki","doi":"10.1109/ICCD.1995.528832","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528832","url":null,"abstract":"The fibre channel standard, developed by the ANSI X3T9.3 task group, defines a serial I/O channel for interconnecting a number of peripheral devices and computer systems. In this paper we consider how fibre channel switches can be cascaded to form a fibre channel fabric. We begin with an analytical model of topology performance that provides a theoretical upper bound on fabric performance and a method for the practical evaluation of fabric topologies. Next, we present simulation results for a single fibre channel switch having 16 ports and a specific high-level architecture. Finally, we consider cascades of this switch, and discuss some subtleties, such as different routing strategies, deadlocks and unfairness.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123072255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528838
C. Wey, Haiyan Wang, Cheng-Ping Wang
This paper presents a self-timed converter circuit which converts an n-digit redundant binary number to an (n+1)-bit binary number. Self-timed refers to the fact that the conversion is problem-dependent and requires variable conversion time to complete the operation. The propagation delay of the proposed converter circuit does not increase with the number of digits to be converted, but it is determined by the maximum number of consecutive 0's in that number. This study shows that the statistical upper bound of the average maximum number of consecutive 0's is log/sub 3/n, or 3.78 for 64-digits. This implies that the proposed self-time circuit can be approximately 17 times faster than the ripple-type converter. Thus the proposed converter is well-suited to high-speed, long-word digital arithmetic processors.
{"title":"A self-timed redundant-binary number to binary number converter for digital arithmetic processors","authors":"C. Wey, Haiyan Wang, Cheng-Ping Wang","doi":"10.1109/ICCD.1995.528838","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528838","url":null,"abstract":"This paper presents a self-timed converter circuit which converts an n-digit redundant binary number to an (n+1)-bit binary number. Self-timed refers to the fact that the conversion is problem-dependent and requires variable conversion time to complete the operation. The propagation delay of the proposed converter circuit does not increase with the number of digits to be converted, but it is determined by the maximum number of consecutive 0's in that number. This study shows that the statistical upper bound of the average maximum number of consecutive 0's is log/sub 3/n, or 3.78 for 64-digits. This implies that the proposed self-time circuit can be approximately 17 times faster than the ripple-type converter. Thus the proposed converter is well-suited to high-speed, long-word digital arithmetic processors.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125095812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528822
T. Yokota, H. Matsuoka, K. Okamoto, Hideo Hirono, A. Hori, S. Sakai
The RWC-1 is a massively parallel computer based on a multi-threaded architecture. This architecture requires extremely high communication performance with reasonable hardware cost. ln this paper, we first introduce a new class of direct interconnection networks called MDCE (Multidimensional Directed Cycles Ensemble extension). MDCE has many desirable features for RWC-1 including small degree, low latency, and high throughput. MDCE is thus adopted for a RWC-1 network. We have designed an MDCE router and fabricated an experimental VLSI chip. We explain the design details in this paper. The chip employs operating system support features as well as communication functions, and enables advanced resource management, A prototype chip with about 125,000 gates has been fabricated using 0.6-/spl mu/m CMOS gate array technology. Its clock runs at 50 MHz and a transmission rate of 300 M bytes per second per communication port is achieved.
{"title":"A prototype router for the massively parallel computer RWC-1","authors":"T. Yokota, H. Matsuoka, K. Okamoto, Hideo Hirono, A. Hori, S. Sakai","doi":"10.1109/ICCD.1995.528822","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528822","url":null,"abstract":"The RWC-1 is a massively parallel computer based on a multi-threaded architecture. This architecture requires extremely high communication performance with reasonable hardware cost. ln this paper, we first introduce a new class of direct interconnection networks called MDCE (Multidimensional Directed Cycles Ensemble extension). MDCE has many desirable features for RWC-1 including small degree, low latency, and high throughput. MDCE is thus adopted for a RWC-1 network. We have designed an MDCE router and fabricated an experimental VLSI chip. We explain the design details in this paper. The chip employs operating system support features as well as communication functions, and enables advanced resource management, A prototype chip with about 125,000 gates has been fabricated using 0.6-/spl mu/m CMOS gate array technology. Its clock runs at 50 MHz and a transmission rate of 300 M bytes per second per communication port is achieved.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126868076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528916
Chi-Hung Chi, Siu-Chung Lau
In the latest processor architectures such as IBM PowerPC and HP Precision Architecture (PA), it is found that certain important compound opcodes such as LOAD-UPDATE and LOAD-MODIFY contain accurate information about how data will be referenced in the near future. Furthermore, these opcodes have been fully utilized by the compiler in the program code generation. With the migration of data cache onto the processor chip, it is now possible for the on-chip cache controller to perform intelligent data prefetching based on the information from the instruction decode unit. In this paper, a novel hardware-driven data prefetching scheme, called the Instruction Opcode-Based Prefetching (IOBP), is proposed. Our simulation shows that this IOBP scheme is very effective in reducing processor stall time due to memory accesses, especially for array or pointer references with constant strides.
{"title":"Reducing data access penalty using intelligent opcode-driven cache prefetching","authors":"Chi-Hung Chi, Siu-Chung Lau","doi":"10.1109/ICCD.1995.528916","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528916","url":null,"abstract":"In the latest processor architectures such as IBM PowerPC and HP Precision Architecture (PA), it is found that certain important compound opcodes such as LOAD-UPDATE and LOAD-MODIFY contain accurate information about how data will be referenced in the near future. Furthermore, these opcodes have been fully utilized by the compiler in the program code generation. With the migration of data cache onto the processor chip, it is now possible for the on-chip cache controller to perform intelligent data prefetching based on the information from the instruction decode unit. In this paper, a novel hardware-driven data prefetching scheme, called the Instruction Opcode-Based Prefetching (IOBP), is proposed. Our simulation shows that this IOBP scheme is very effective in reducing processor stall time due to memory accesses, especially for array or pointer references with constant strides.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114636509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528914
S. Dutta, W. Wolf, A. Wolfe
This paper addresses the design of memory-system architectures for video signal processors. The memory subsystem is the bottleneck of most video computing systems and demands a careful analysis of the design tradeoffs related to area, cycle time, and utilization. We emphasize the need to consider technological and circuit-level issues during the design of a system architecture, particularly that of a video processor, and present a method whereby the conceptual organization of the memory architecture can be evaluated before a detailed design is undertaken. Our analysis suggests that the organization of an efficient memory hierarchy for video signal processors is different from the register-cache based hierarchy of general-purpose programmable microprocessors.
{"title":"VLSI issues in memory-system design for video signal processors","authors":"S. Dutta, W. Wolf, A. Wolfe","doi":"10.1109/ICCD.1995.528914","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528914","url":null,"abstract":"This paper addresses the design of memory-system architectures for video signal processors. The memory subsystem is the bottleneck of most video computing systems and demands a careful analysis of the design tradeoffs related to area, cycle time, and utilization. We emphasize the need to consider technological and circuit-level issues during the design of a system architecture, particularly that of a video processor, and present a method whereby the conceptual organization of the memory architecture can be evaluated before a detailed design is undertaken. Our analysis suggests that the organization of an efficient memory hierarchy for video signal processors is different from the register-cache based hierarchy of general-purpose programmable microprocessors.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130014827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528934
M. Hsiao, J. Patel
A new technique is proposed to handle fault simulation at the architectural level. The technique bypasses the need for complete gate level structure and efficiently uses the architectural information. Symbolic data representing groups of stuck at faults, known as fault effects, are propagated across the circuit with intelligent propagation prediction. Fault effects may combine and form new groups in the process. Automated behavioral simulation using only three data types is used to propagate fault effects at the architectural level by propagation prediction; no additional high level constraints or precomputation of faulty behavior are needed for simulation. Although not a fully deterministic algorithm, the results of ALFSIM, Architectural Level Fault Simulation, show high accuracy when compared with the gate level fault simulation.
{"title":"A new architectural-level fault simulation using propagation prediction of grouped fault-effects","authors":"M. Hsiao, J. Patel","doi":"10.1109/ICCD.1995.528934","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528934","url":null,"abstract":"A new technique is proposed to handle fault simulation at the architectural level. The technique bypasses the need for complete gate level structure and efficiently uses the architectural information. Symbolic data representing groups of stuck at faults, known as fault effects, are propagated across the circuit with intelligent propagation prediction. Fault effects may combine and form new groups in the process. Automated behavioral simulation using only three data types is used to propagate fault effects at the architectural level by propagation prediction; no additional high level constraints or precomputation of faulty behavior are needed for simulation. Although not a fully deterministic algorithm, the results of ALFSIM, Architectural Level Fault Simulation, show high accuracy when compared with the gate level fault simulation.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125373919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-02DOI: 10.1109/ICCD.1995.528788
M. Greenstreet
STARI is a high-speed signaling technique that uses both synchronous and self-timed circuits. To demonstrate STARI, a chip has been fabricated using the MOSIS 2/spl mu/ CMOS process. In a simple test fixture, it operates at data rates of 120 Mbits/sec over a pair of wires. Because STARl uses both synchronous and self-timed circuits, it provides an opportunity to compare these two design methods. The synchronous circuits of the STARI chip achieve rates of operation two to three times those of the self-timed circuits. However, the self-timed FIFO in the receiver provides robust compensation for clock skew that could not be achieved with synchronous circuitry alone. Thus, the STARI chip demonstrates advantages of combining these two design techniques.
{"title":"Implementing a STARI chip","authors":"M. Greenstreet","doi":"10.1109/ICCD.1995.528788","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528788","url":null,"abstract":"STARI is a high-speed signaling technique that uses both synchronous and self-timed circuits. To demonstrate STARI, a chip has been fabricated using the MOSIS 2/spl mu/ CMOS process. In a simple test fixture, it operates at data rates of 120 Mbits/sec over a pair of wires. Because STARl uses both synchronous and self-timed circuits, it provides an opportunity to compare these two design methods. The synchronous circuits of the STARI chip achieve rates of operation two to three times those of the self-timed circuits. However, the self-timed FIFO in the receiver provides robust compensation for clock skew that could not be achieved with synchronous circuitry alone. Thus, the STARI chip demonstrates advantages of combining these two design techniques.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122704374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}