Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218539
K. Ganapathy, B. Wah
The authors present a parameter-based approach for synthesizing systolic architectures from uniform recurrence equations. The scheme presented is a generalization of the parameter method proposed by G.J. Li and B.W. Wah (1985). The approach synthesizes optimal arrays of any lower dimension from a general uniform recurrence description of the problem. In other previous attempts for mapping uniform recurrences into lower-dimensional arrays, optimality of the resulting designs is not guaranteed. As an illustration of the technique, optimal linear arrays for matrix multiplication are given. A detailed design for solving path-finding problems is also presented.<>
{"title":"Optimal design of lower dimensional processor arrays for uniform recurrences","authors":"K. Ganapathy, B. Wah","doi":"10.1109/ASAP.1992.218539","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218539","url":null,"abstract":"The authors present a parameter-based approach for synthesizing systolic architectures from uniform recurrence equations. The scheme presented is a generalization of the parameter method proposed by G.J. Li and B.W. Wah (1985). The approach synthesizes optimal arrays of any lower dimension from a general uniform recurrence description of the problem. In other previous attempts for mapping uniform recurrences into lower-dimensional arrays, optimality of the resulting designs is not guaranteed. As an illustration of the technique, optimal linear arrays for matrix multiplication are given. A detailed design for solving path-finding problems is also presented.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114925600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218562
R. Shively, L. J. Wu
Achieving the potential performance of highly parallel MIMD processor architectures is critically dependent on both the speed and routing capabilities of the network fabric. The routing network of the AT&T DSP3 processor is described together with an indication of how the 40 megabyte/s links can be configured to meet diverse application requirements. Scaling to very large configurations is aided by compact packaging. Silicon-on-silicon multi-chip modules together with a novel three-dimensional vertical interconnection technology are being used to repackage the DSP3 into the ultra-dense processor.<>
{"title":"Application and packaging of the AT&T DSP3 parallel signal processor","authors":"R. Shively, L. J. Wu","doi":"10.1109/ASAP.1992.218562","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218562","url":null,"abstract":"Achieving the potential performance of highly parallel MIMD processor architectures is critically dependent on both the speed and routing capabilities of the network fabric. The routing network of the AT&T DSP3 processor is described together with an indication of how the 40 megabyte/s links can be configured to meet diverse application requirements. Scaling to very large configurations is aided by compact packaging. Silicon-on-silicon multi-chip modules together with a novel three-dimensional vertical interconnection technology are being used to repackage the DSP3 into the ultra-dense processor.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122523456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218563
H. Habereder, R. Harrison
This paper describes the implementation and benchmark testing of a high performance signal processor architecture based on the alternate low level primitive structures (ALPS) concept developed by the Naval Research Laboratory. The research shows that such digital signal processor architectures are not only feasible but provide a modular solution to a wide range of signal processing applications. In addition the benchmark tests show that such architectures provide higher efficiency and lower data transfer network contentions than existing global memory-based data flow architectures. The processor system consists of high-performance, fully programmable, embedded signal processors and controllers networked on a set of high bandwidth busses to provide a processing capability far in excess of that offered by current systems. The modular array processor (MAP) is a networked multiprocessor with VLSI-based signal and control processing modules.<>
{"title":"Constant capacity signal flow signal processor architecture benchmark","authors":"H. Habereder, R. Harrison","doi":"10.1109/ASAP.1992.218563","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218563","url":null,"abstract":"This paper describes the implementation and benchmark testing of a high performance signal processor architecture based on the alternate low level primitive structures (ALPS) concept developed by the Naval Research Laboratory. The research shows that such digital signal processor architectures are not only feasible but provide a modular solution to a wide range of signal processing applications. In addition the benchmark tests show that such architectures provide higher efficiency and lower data transfer network contentions than existing global memory-based data flow architectures. The processor system consists of high-performance, fully programmable, embedded signal processors and controllers networked on a set of high bandwidth busses to provide a processing capability far in excess of that offered by current systems. The modular array processor (MAP) is a networked multiprocessor with VLSI-based signal and control processing modules.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127020600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218559
H. Dawid, H. Meyr
The CORDIC algorithm is well known as an efficient method for the computation of trigonometric/hyperbolic functions and vector rotations. The achievable throughput and the latency of CORDIC processors using conventional arithmetic are determined by the carry propagation occurring in additions/subtractions, since the CORDIC iterations are directed by the signs of intermediate results. Using a redundant number system, much higher throughput is possible due to the elimination of carry propagation, but an exact sign detection can not be implemented efficiently. The authors derive transformations of the original CORDIC algorithm which result in partially fixed iteration sequences no longer dependent on intermediate signs for the CORDIC vectoring mode as well as the rotation mode. Very fast and efficient carry-save architectures using redundant absolute value computation resulting from the transformed algorithms are described. A CORDIC processor (rotation mode) is presented as an implementation example which to the best of the authors knowledge is the fastest CMOS CORDIC realization today.<>
{"title":"High speed bit-level pipelined architectures for redundant CORDIC implementation","authors":"H. Dawid, H. Meyr","doi":"10.1109/ASAP.1992.218559","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218559","url":null,"abstract":"The CORDIC algorithm is well known as an efficient method for the computation of trigonometric/hyperbolic functions and vector rotations. The achievable throughput and the latency of CORDIC processors using conventional arithmetic are determined by the carry propagation occurring in additions/subtractions, since the CORDIC iterations are directed by the signs of intermediate results. Using a redundant number system, much higher throughput is possible due to the elimination of carry propagation, but an exact sign detection can not be implemented efficiently. The authors derive transformations of the original CORDIC algorithm which result in partially fixed iteration sequences no longer dependent on intermediate signs for the CORDIC vectoring mode as well as the rotation mode. Very fast and efficient carry-save architectures using redundant absolute value computation resulting from the transformed algorithms are described. A CORDIC processor (rotation mode) is presented as an implementation example which to the best of the authors knowledge is the fastest CMOS CORDIC realization today.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115480484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218557
Heonchul Park, V. Prasanna, Cho-Li Wang
Vector quantization (VQ) has become feasible for use in real-time applications by employing VLSI technology. The authors propose a new search algorithm and an architecture for implementing it, which can be used in real-time image processing. This search algorithm takes O(k) time units on a sequential machine, where k is the dimension of the codevectors, assuming unit time corresponds to one comparison operation. The proposed architecture employs a single processing element (PE) and O(N) external memory for storing N hyperplanes used in the search, where N is the number of codevectors. Compared with known architectures for VQ in the literature, the proposed design does not perform any multiplication operation, since the search method is independent of any L/sub q/ metric, 1>
{"title":"An architecture for tree search based vector quantization for single chip implementation","authors":"Heonchul Park, V. Prasanna, Cho-Li Wang","doi":"10.1109/ASAP.1992.218557","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218557","url":null,"abstract":"Vector quantization (VQ) has become feasible for use in real-time applications by employing VLSI technology. The authors propose a new search algorithm and an architecture for implementing it, which can be used in real-time image processing. This search algorithm takes O(k) time units on a sequential machine, where k is the dimension of the codevectors, assuming unit time corresponds to one comparison operation. The proposed architecture employs a single processing element (PE) and O(N) external memory for storing N hyperplanes used in the search, where N is the number of codevectors. Compared with known architectures for VQ in the literature, the proposed design does not perform any multiplication operation, since the search method is independent of any L/sub q/ metric, 1<or=q<or= infinity . It leads to an area efficient design with the PE consisting of a comparator and O(k) registers. Also, the memory used by the design is significantly less than those employed in the known architectures.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114095595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218555
T. Risset
The author proposes a method to synthesize modular systolic arrays with local broadcast facility (i.e. arrays containing wires of length lower than a fixed -technology dependent- constant). The synthesis is made from a dependence graph which is not uniform but 'locally broadcast'. This method aims at generalizing isolated results that have been recently reported on the acceleration of systolic algorithms by using extensions of the 'pure' systolic model (wire of length>1, wrap around, folding arrays, etc).<>
{"title":"A method to synthesize modular systolic arrays with local broadcast facility","authors":"T. Risset","doi":"10.1109/ASAP.1992.218555","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218555","url":null,"abstract":"The author proposes a method to synthesize modular systolic arrays with local broadcast facility (i.e. arrays containing wires of length lower than a fixed -technology dependent- constant). The synthesis is made from a dependence graph which is not uniform but 'locally broadcast'. This method aims at generalizing isolated results that have been recently reported on the acceleration of systolic algorithms by using extensions of the 'pure' systolic model (wire of length>1, wrap around, folding arrays, etc).<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121062181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218544
Wonyong Sung, S. Mitra, Ki-II Kum
A multiprocessor code generation method for digital signal processing algorithms represented by SFGs (signal flow graphs) is developed. For reducing the number of communication operations as well as distributing the workload evenly among the processors, a multiprocessor scheduling method based on a parallel block processing scheme, which processes multiple blocks of input data concurrently, is employed. The developed method first divides an SFG into graph segments to reduce the dependency time. A segment merging process is followed, which results less number of temporary data storages and data transfers. A multiprocessor code is generated by applying a single processor code generation method to each of these segments. The implementation result for QR-RLS algorithm using the developed method is included.<>
{"title":"Mapping locally recursive SEGs upon a multiprocessor system in a ring network","authors":"Wonyong Sung, S. Mitra, Ki-II Kum","doi":"10.1109/ASAP.1992.218544","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218544","url":null,"abstract":"A multiprocessor code generation method for digital signal processing algorithms represented by SFGs (signal flow graphs) is developed. For reducing the number of communication operations as well as distributing the workload evenly among the processors, a multiprocessor scheduling method based on a parallel block processing scheme, which processes multiple blocks of input data concurrently, is employed. The developed method first divides an SFG into graph segments to reduce the dependency time. A segment merging process is followed, which results less number of temporary data storages and data transfers. A multiprocessor code is generated by applying a single processor code generation method to each of these segments. The implementation result for QR-RLS algorithm using the developed method is included.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128506225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218581
B. Amrutur, Rajeev Joshi, N. Karmarkar
A large fraction of scientific and engineering computations involve sparse matrices. While dense matrix computations can be parallelized relatively easily, sparse matrices with arbitrary or irregular structure pose a real challenge to designers of highly parallel machines. A recent paper by N.K. Karmarkar (1991) proposed a new parallel architecture for sparse matrix computations based on finite projective geometries. Mathematical structure of these geometries plays an important role in defining the interconnections between the processors and memories in this architecture, and also aids in efficiently solving several difficult problems (such as load balancing, data-routing, memory-access conflicts, etc.) that are encountered in the design of parallel systems. The authors discuss some of the key issues in the system design of such a machine, and show how exploiting the structure of the geometry results in an efficient hardware implementation of the machine. They also present circuit designs and simulation results for key elements of the system: a 200 MHz pipelined memory; a pipelined multiplier based on an adder unit with a delay of 2 ns; and a 500 Mbit/s CMOS input/output buffer.<>
{"title":"A projective geometry architecture for scientific computation","authors":"B. Amrutur, Rajeev Joshi, N. Karmarkar","doi":"10.1109/ASAP.1992.218581","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218581","url":null,"abstract":"A large fraction of scientific and engineering computations involve sparse matrices. While dense matrix computations can be parallelized relatively easily, sparse matrices with arbitrary or irregular structure pose a real challenge to designers of highly parallel machines. A recent paper by N.K. Karmarkar (1991) proposed a new parallel architecture for sparse matrix computations based on finite projective geometries. Mathematical structure of these geometries plays an important role in defining the interconnections between the processors and memories in this architecture, and also aids in efficiently solving several difficult problems (such as load balancing, data-routing, memory-access conflicts, etc.) that are encountered in the design of parallel systems. The authors discuss some of the key issues in the system design of such a machine, and show how exploiting the structure of the geometry results in an efficient hardware implementation of the machine. They also present circuit designs and simulation results for key elements of the system: a 200 MHz pipelined memory; a pipelined multiplier based on an adder unit with a delay of 2 ns; and a 500 Mbit/s CMOS input/output buffer.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117013432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218552
K. Tanno, T. Takeda, Susumu Horoguchi
The authors deal with a new parallel sorting algorithm on an eight-neighbor processor array with wraparounds in the rows. The algorithm is very simple because it is composed of the iteration of only a primitive operation, comparing and exchanging four elements simultaneously. Each processor (processing element), arranged in a two-dimensional array can communicate with 8 neighbouring processors (if they exist). By fully making use of its communication capability and wraparounds properties, the algorithm sorts n*n elements in the row-major order, and yields the sorting time of 3(n+1)(2t/sub r/+t/sub c/), where t/sub r/ and t/sub c/ are defined as the times for a unit routing step and a comparison processing, respectively.<>
{"title":"A parallel sorting algorithm on an eight-neighbor processor array","authors":"K. Tanno, T. Takeda, Susumu Horoguchi","doi":"10.1109/ASAP.1992.218552","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218552","url":null,"abstract":"The authors deal with a new parallel sorting algorithm on an eight-neighbor processor array with wraparounds in the rows. The algorithm is very simple because it is composed of the iteration of only a primitive operation, comparing and exchanging four elements simultaneously. Each processor (processing element), arranged in a two-dimensional array can communicate with 8 neighbouring processors (if they exist). By fully making use of its communication capability and wraparounds properties, the algorithm sorts n*n elements in the row-major order, and yields the sorting time of 3(n+1)(2t/sub r/+t/sub c/), where t/sub r/ and t/sub c/ are defined as the times for a unit routing step and a comparison processing, respectively.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117290364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-08-04DOI: 10.1109/ASAP.1992.218569
S. E. McQuillan, J. McCanny
Recently, a number of most significant digit (msd) first bit parallel multipliers for recursive filtering have been reported. However, the design approach which has been used has, in general, been heuristic and consequently, optimality has not always been assured. In this paper, msd first multiply accumulate algorithms are described and important relationships governing the dependencies between latency, number representations, etc. are derived. A more systematic approach to designing recursive filters is illustrated by applying the algorithms and associated relationships to the design of cascadable modules for high sample rate IIR filtering and wave digital filtering.<>
{"title":"Algorithms and architectures for high performance recursive filtering","authors":"S. E. McQuillan, J. McCanny","doi":"10.1109/ASAP.1992.218569","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218569","url":null,"abstract":"Recently, a number of most significant digit (msd) first bit parallel multipliers for recursive filtering have been reported. However, the design approach which has been used has, in general, been heuristic and consequently, optimality has not always been assured. In this paper, msd first multiply accumulate algorithms are described and important relationships governing the dependencies between latency, number representations, etc. are derived. A more systematic approach to designing recursive filters is illustrated by applying the algorithms and associated relationships to the design of cascadable modules for high sample rate IIR filtering and wave digital filtering.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129236778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}