Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218567
E. Swartzlander
Wafer scale integration technology offers the promise of implementing application specific processors with significantly higher data rates, lower power, and smaller size than conventional VLSI implementations. Wafer scale integration implementations replace most of the signal lines between chips with intra-wafer lines that exhibit one to two orders of magnitude less stray capacitance so they may be driven at higher rates while consuming much less power. Application specific processors implemented with regular arrays of processing elements are attractive because their regularity simplifies the design, fabrication, and circumvention of faulty elements. This paper shows that one dimensional systolic arrays are more attractive for this application than other regular architectures. This paper also shows that (1:N) and (M:N) pooled sparing at the macrocell level is feasible to overcome the defects implicit in the fabrication process. Finally an example design for a systolic FFT processor is described to illustrate the wafer scale implementation of a signal processor.<>
{"title":"Advanced technology for improved signal processor efficiency","authors":"E. Swartzlander","doi":"10.1109/ASAP.1992.218567","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218567","url":null,"abstract":"Wafer scale integration technology offers the promise of implementing application specific processors with significantly higher data rates, lower power, and smaller size than conventional VLSI implementations. Wafer scale integration implementations replace most of the signal lines between chips with intra-wafer lines that exhibit one to two orders of magnitude less stray capacitance so they may be driven at higher rates while consuming much less power. Application specific processors implemented with regular arrays of processing elements are attractive because their regularity simplifies the design, fabrication, and circumvention of faulty elements. This paper shows that one dimensional systolic arrays are more attractive for this application than other regular architectures. This paper also shows that (1:N) and (M:N) pooled sparing at the macrocell level is feasible to overcome the defects implicit in the fabrication process. Finally an example design for a systolic FFT processor is described to illustrate the wafer scale implementation of a signal processor.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128030385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218579
M. Sauer, E. Bernard, J. Nossek
Partitioning of a class of algorithms with global data dependencies, called multistage algorithms, is investigated. Partitioning requires intermediate results of computations of a specific block of the partition to be stored in an intermediate memory. Furthermore a decomposition of the global interconnection structure of the algorithm is necessary. The authors outline a design methodology for the intermediate memories which perform the data rearrangements according to the interconnection relation and that consist of locally connected synchronous modules. Additionally procedures for deriving control signals for the intermediate memory are presented, which can serve as a basis for control minimization.<>
{"title":"On partitioning of multistage algorithms and design of intermediate memories","authors":"M. Sauer, E. Bernard, J. Nossek","doi":"10.1109/ASAP.1992.218579","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218579","url":null,"abstract":"Partitioning of a class of algorithms with global data dependencies, called multistage algorithms, is investigated. Partitioning requires intermediate results of computations of a specific block of the partition to be stored in an intermediate memory. Furthermore a decomposition of the global interconnection structure of the algorithm is necessary. The authors outline a design methodology for the intermediate memories which perform the data rearrangements according to the interconnection relation and that consist of locally connected synchronous modules. Additionally procedures for deriving control signals for the intermediate memory are presented, which can serve as a basis for control minimization.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131721982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218546
Werner Pöchmüller, A. König, M. Glesner
Associative systems provide a flexibility ranging far beyond the scope of a conventional associative memory which simply provides a parallel search within a large amount of keywords to retrieve associated information. This paper presents several approaches to associative data processing. Algorithms are discussed that can easily be implemented or supported on an array computer. By means of dedicated VLSI chips a prototype array computer was implemented at Darmstadt University of Darmstadt. Together with simulations on conventional sequential computers, this array computer serves to prove the validity of developed algorithms on a running system.<>
{"title":"Associative information processing: algorithms and system","authors":"Werner Pöchmüller, A. König, M. Glesner","doi":"10.1109/ASAP.1992.218546","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218546","url":null,"abstract":"Associative systems provide a flexibility ranging far beyond the scope of a conventional associative memory which simply provides a parallel search within a large amount of keywords to retrieve associated information. This paper presents several approaches to associative data processing. Algorithms are discussed that can easily be implemented or supported on an array computer. By means of dedicated VLSI chips a prototype array computer was implemented at Darmstadt University of Darmstadt. Together with simulations on conventional sequential computers, this array computer serves to prove the validity of developed algorithms on a running system.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125212210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218576
D. Chen, L. Guerra, E. Ng, M. Potkonjak, D. P. Schultz, J. Rabaey
A system has been developed which targets the rapid prototyping of high performance data computation units which are typical to real-time digital signal processing applications. The hardware platform of the system is a family of multiprocessor integrated circuits. The prototype chip of this family contains 8 processors connected via a dynamically controlled crossbar switch. With a maximum clock rate of 25 MHz, it can support a computation rate of 200 MIPs and can sustain a data I/O bandwidth of 400 MByte/sec. An assembler and simulator provide low-level programmability of the hardware. A compiler which takes input described in the high-level data flow language Silage, and performs estimation, transformations, partitioning, assignment, and scheduling before generating assembly code, provides an automated software compilation path.<>
{"title":"An integrated system for rapid prototyping of high performance algorithm specific data paths","authors":"D. Chen, L. Guerra, E. Ng, M. Potkonjak, D. P. Schultz, J. Rabaey","doi":"10.1109/ASAP.1992.218576","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218576","url":null,"abstract":"A system has been developed which targets the rapid prototyping of high performance data computation units which are typical to real-time digital signal processing applications. The hardware platform of the system is a family of multiprocessor integrated circuits. The prototype chip of this family contains 8 processors connected via a dynamically controlled crossbar switch. With a maximum clock rate of 25 MHz, it can support a computation rate of 200 MIPs and can sustain a data I/O bandwidth of 400 MByte/sec. An assembler and simulator provide low-level programmability of the hardware. A compiler which takes input described in the high-level data flow language Silage, and performs estimation, transformations, partitioning, assignment, and scheduling before generating assembly code, provides an automated software compilation path.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125815005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218575
W. Burleson, Bongjin Jung
The authors present a graphical CAD tool, Array Estimator (ARREST), for VLSI array architectures. In real VLSI arrays, piece-wise regular computations are spread across space and time and occur at a fine-grain, which can make visualization quite difficult. Consequently, a graphical interface environment is desirable to enhance the design, verification, and analysis of VLSI arrays by providing feedback at all levels of the design process. ARREST reads a high level description of structured VLSI algorithms in terms of affine recurrence equations (AREs) and permits a broad range of transformations on the algorithm. The system does not target a fully automated design process, instead it provides a designer with a means to systematically explore various array architectures and evaluate design trade-offs between VLSI cost and performance. To allow a human designer better insight into the design process, ARREST uses the Xt/MOTIF window system for graphics and interfaces to the Cadence VERILOG simulator.<>
{"title":"ARREST: an interactive graphic analysis tool for VLSI arrays","authors":"W. Burleson, Bongjin Jung","doi":"10.1109/ASAP.1992.218575","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218575","url":null,"abstract":"The authors present a graphical CAD tool, Array Estimator (ARREST), for VLSI array architectures. In real VLSI arrays, piece-wise regular computations are spread across space and time and occur at a fine-grain, which can make visualization quite difficult. Consequently, a graphical interface environment is desirable to enhance the design, verification, and analysis of VLSI arrays by providing feedback at all levels of the design process. ARREST reads a high level description of structured VLSI algorithms in terms of affine recurrence equations (AREs) and permits a broad range of transformations on the algorithm. The system does not target a fully automated design process, instead it provides a designer with a means to systematically explore various array architectures and evaluate design trade-offs between VLSI cost and performance. To allow a human designer better insight into the design process, ARREST uses the Xt/MOTIF window system for graphics and interfaces to the Cadence VERILOG simulator.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115453575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218564
T. E. Curtis
Current operational UK sonars use processors with throughputs in excess of five hundred million arithmetic operations per second. Several orders of magnitude increase in computing power are required to maintain long range surveillance capabilities in the 1990s and, within the next decade, typical applications will need throughputs approaching one million, million arithmetic operations per second, significantly greater than that currently achieved with fifth generation computers. This paper discusses some of the problems in realising systems with this level of performance.<>
{"title":"Heterogeneous digital signal processing systems for sonar","authors":"T. E. Curtis","doi":"10.1109/ASAP.1992.218564","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218564","url":null,"abstract":"Current operational UK sonars use processors with throughputs in excess of five hundred million arithmetic operations per second. Several orders of magnitude increase in computing power are required to maintain long range surveillance capabilities in the 1990s and, within the next decade, typical applications will need throughputs approaching one million, million arithmetic operations per second, significantly greater than that currently achieved with fifth generation computers. This paper discusses some of the problems in realising systems with this level of performance.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128315137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218580
G. Jennings
The author considers the construction of synchronous systems having components driven at different rates by different, but commensurable, clocks. Furthermore these systems are to be constructed using level-sensitive latches with the intent of exploiting cycle borrowing over the entire system. The author presents a framework in which the entire system is managed as a single clocked entity, and investigates a timing analysis technique for such systems. Results for small examples are presented. The interface between such chips is studied; no resynchronizers are required. Alternate clock waveforms, and their effect on analysis complexity, are discussed.<>
{"title":"On cycle borrowing analyses for interconnected chips driven by clocks having different but commensurable speeds","authors":"G. Jennings","doi":"10.1109/ASAP.1992.218580","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218580","url":null,"abstract":"The author considers the construction of synchronous systems having components driven at different rates by different, but commensurable, clocks. Furthermore these systems are to be constructed using level-sensitive latches with the intent of exploiting cycle borrowing over the entire system. The author presents a framework in which the entire system is managed as a single clocked entity, and investigates a timing analysis technique for such systems. Results for small examples are presented. The interface between such chips is studied; no resynchronizers are required. Alternate clock waveforms, and their effect on analysis complexity, are discussed.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"86 22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131030784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218585
J. Teich, L. Thiele
The paper describes the systematic design of processor arrays with a given dimension and a given number of processing elements. The unified approach to the solution of this problem called partitioning is based on the following concepts: (1) Algorithms and processor arrays are represented by (piecewise regular) programs. (2) The concept of stepwise refinement of programs is used to solve the partitioning problem by applying a sequence of provably correct program transformations. In contrary to other approaches, nonperfect tilings may be considered. The parameters of the introduced program transformations enable the realization of different partitioning schemes. (3) It is shown that the class of piecewise regular programs is closed under partitioning.<>
{"title":"A transformative approach to the partitioning of processor arrays","authors":"J. Teich, L. Thiele","doi":"10.1109/ASAP.1992.218585","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218585","url":null,"abstract":"The paper describes the systematic design of processor arrays with a given dimension and a given number of processing elements. The unified approach to the solution of this problem called partitioning is based on the following concepts: (1) Algorithms and processor arrays are represented by (piecewise regular) programs. (2) The concept of stepwise refinement of programs is used to solve the partitioning problem by applying a sequence of provably correct program transformations. In contrary to other approaches, nonperfect tilings may be considered. The parameters of the introduced program transformations enable the realization of different partitioning schemes. (3) It is shown that the class of piecewise regular programs is closed under partitioning.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121753471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218545
Emmanuel D. Frimout, J. Driessen, E. Deprettere
The paper presents a parallel architecture for a pel-recursive motion estimation algorithm. It is a linear array of processors, each consisting of an initialization part, a data-routing part and an updating part. The initializing part performs a prediction of the motion vector. The routing parts constitute a routing path along which previous-frame data is routed from processors that store to processors that request such data. A clocked version of the router is presented with some detail. The updating part calculates an update to the predicted motion vector. The architecture proposed is derived in a systematic way and is parameterized w.r.t. certain window sizes. It is thus completely different from the few existing pel-recursive motion estimation architectures.<>
{"title":"Parallel architecture for a pel-recursive motion estimation algorithm","authors":"Emmanuel D. Frimout, J. Driessen, E. Deprettere","doi":"10.1109/ASAP.1992.218545","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218545","url":null,"abstract":"The paper presents a parallel architecture for a pel-recursive motion estimation algorithm. It is a linear array of processors, each consisting of an initialization part, a data-routing part and an updating part. The initializing part performs a prediction of the motion vector. The routing parts constitute a routing path along which previous-frame data is routed from processors that store to processors that request such data. A clocked version of the router is presented with some detail. The updating part calculates an update to the predicted motion vector. The architecture proposed is derived in a systematic way and is parameterized w.r.t. certain window sizes. It is thus completely different from the few existing pel-recursive motion estimation architectures.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122585943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218574
M. Potkonjak, J. Rabaey
A simple formulation of pipelining: 'Pipelining with N stages is equivalent to retiming where the number of delays on all inputs or all outputs, but not both, is increased by N' is used as the basis for a convenient and efficient treatment of pipelining in design of application specific computers. Classification of pipelining according to the optimization goal (throughput and resource utilization) and the latency is introduced. For polynomial complexity pipelining classes, optimal algorithms are presented. For other classes both proof of NP-completeness and efficient probabilistic algorithms are presented. Both theoretical and experimental properties of pipelining are discussed. In particular, a relationship with other transformations is explored. Due to close relationship between software pipelining and pipelining presented, all results can be easily modified for use in compilers for general purpose computers. Also, as a side result, the exact bound (solution) for iteration bound is derived.<>
{"title":"Pipelining: just another transformation","authors":"M. Potkonjak, J. Rabaey","doi":"10.1109/ASAP.1992.218574","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218574","url":null,"abstract":"A simple formulation of pipelining: 'Pipelining with N stages is equivalent to retiming where the number of delays on all inputs or all outputs, but not both, is increased by N' is used as the basis for a convenient and efficient treatment of pipelining in design of application specific computers. Classification of pipelining according to the optimization goal (throughput and resource utilization) and the latency is introduced. For polynomial complexity pipelining classes, optimal algorithms are presented. For other classes both proof of NP-completeness and efficient probabilistic algorithms are presented. Both theoretical and experimental properties of pipelining are discussed. In particular, a relationship with other transformations is explored. Due to close relationship between software pipelining and pipelining presented, all results can be easily modified for use in compilers for general purpose computers. Also, as a side result, the exact bound (solution) for iteration bound is derived.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121750684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}