Pub Date : 1987-05-18DOI: 10.1109/ARITH.1987.6158715
I. Scherson, Yiming Ma
An Orthogonal Memory Access system allows a multiplicity of processors to concurrently access distinct rows or columns of a rectangular array of data elements. The resulting tightly-coupled multi-processing system is feasible with current technology and has even been suggested for VLSI as a “reduced mesh”. In this paper we introduce the architecture and concentrate on its application to a number of basic vector and numerical computations. Matrix multiplication, L-U decomposition, polynomial evaluation and solutions to linear systems and partial differential equations, all show a speed-up of 0(n) for a n-processor system. The flexibility in the choice of the number of PEs makes the architecture a strong competitor in the world of special-purpose parallel systems. Actually, we prove that the machine exhibits the same performance as any other system with the same number of processors within a factor of 3.
{"title":"Vector computations on an orthogonal memory access multiprocessing system","authors":"I. Scherson, Yiming Ma","doi":"10.1109/ARITH.1987.6158715","DOIUrl":"https://doi.org/10.1109/ARITH.1987.6158715","url":null,"abstract":"An Orthogonal Memory Access system allows a multiplicity of processors to concurrently access distinct rows or columns of a rectangular array of data elements. The resulting tightly-coupled multi-processing system is feasible with current technology and has even been suggested for VLSI as a “reduced mesh”. In this paper we introduce the architecture and concentrate on its application to a number of basic vector and numerical computations. Matrix multiplication, L-U decomposition, polynomial evaluation and solutions to linear systems and partial differential equations, all show a speed-up of 0(n) for a n-processor system. The flexibility in the choice of the number of PEs makes the architecture a strong competitor in the world of special-purpose parallel systems. Actually, we prove that the machine exhibits the same performance as any other system with the same number of processors within a factor of 3.","PeriodicalId":424620,"journal":{"name":"1987 IEEE 8th Symposium on Computer Arithmetic (ARITH)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1987-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132140890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1987-05-18DOI: 10.1109/ARITH.1987.6158696
J. Fandrianto
An algorithm to implement radix four division and radix four square-root in a shared hardware for IEEE standard for binary floating point format will be described. The algorithm is best suited to be implemented in either off-the-shelf components or being a portion of a VLSI floating-point chip. Division and square-root bits are generated by a non-restoring method while keeping the partial remainder, partial radicand, quotient and root all in redundant forms. The core iteration involves a 8-bit carry look-ahead adder, a multiplexer to convert two's complement to sign magnitude, a 19-term next quotient/root prediction PLA, a divisor/root multiple selector, and a carry save adder. At the end, two iterations of carry look-ahead adder across the length of the mantissa are required to generate the quotient/root in a correctly rounded form. Despite its simplicity in the hardware requirement, the algorithm takes only about 30 cycles to compute double precision division or square-root. Finally, extending the algorithm to radix eight or higher division/square-root will be discussed.
{"title":"Algorithm for high speed shared radix 4 division and radix 4 square-root","authors":"J. Fandrianto","doi":"10.1109/ARITH.1987.6158696","DOIUrl":"https://doi.org/10.1109/ARITH.1987.6158696","url":null,"abstract":"An algorithm to implement radix four division and radix four square-root in a shared hardware for IEEE standard for binary floating point format will be described. The algorithm is best suited to be implemented in either off-the-shelf components or being a portion of a VLSI floating-point chip. Division and square-root bits are generated by a non-restoring method while keeping the partial remainder, partial radicand, quotient and root all in redundant forms. The core iteration involves a 8-bit carry look-ahead adder, a multiplexer to convert two's complement to sign magnitude, a 19-term next quotient/root prediction PLA, a divisor/root multiple selector, and a carry save adder. At the end, two iterations of carry look-ahead adder across the length of the mantissa are required to generate the quotient/root in a correctly rounded form. Despite its simplicity in the hardware requirement, the algorithm takes only about 30 cycles to compute double precision division or square-root. Finally, extending the algorithm to radix eight or higher division/square-root will be discussed.","PeriodicalId":424620,"journal":{"name":"1987 IEEE 8th Symposium on Computer Arithmetic (ARITH)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1987-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124989761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1987-05-18DOI: 10.1109/ARITH.1987.6158699
T. Han, D. A. Carlson
In this paper, we study area-time tradeoffs in VLSI for prefix computation using graph representations of this problem. Since the problem is intimately related to binary addition, the results we obtain lead to the design of area-time efficient VLSI adders. This is a major goal of our work: to design very low latency addition circuitry that is also area efficient. To this end, we present a new graph representation for prefix computation that leads to the design of a fast, area-efficient binary adder. The new graph is a combination of previously known graph representations for prefix computation, and its area is close to known lower bounds on the VLSI area of parallel prefix graphs. Using it, we are able to design VLSI adders having area A = 0(n log n) whose delay time is the lowest possible value, i. e. the fastest possible area-efficient VLSI adder.
{"title":"Fast area-efficient VLSI adders","authors":"T. Han, D. A. Carlson","doi":"10.1109/ARITH.1987.6158699","DOIUrl":"https://doi.org/10.1109/ARITH.1987.6158699","url":null,"abstract":"In this paper, we study area-time tradeoffs in VLSI for prefix computation using graph representations of this problem. Since the problem is intimately related to binary addition, the results we obtain lead to the design of area-time efficient VLSI adders. This is a major goal of our work: to design very low latency addition circuitry that is also area efficient. To this end, we present a new graph representation for prefix computation that leads to the design of a fast, area-efficient binary adder. The new graph is a combination of previously known graph representations for prefix computation, and its area is close to known lower bounds on the VLSI area of parallel prefix graphs. Using it, we are able to design VLSI adders having area A = 0(n log n) whose delay time is the lowest possible value, i. e. the fastest possible area-efficient VLSI adder.","PeriodicalId":424620,"journal":{"name":"1987 IEEE 8th Symposium on Computer Arithmetic (ARITH)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1987-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115011280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}