The residue logarithmic number system (RLNS) represents real values as quantized logarithms which, in turn, are represented using the residue number system (RNS). Compared to the conventional logarithmic number system (LNS) in which quantized logarithms are represented as binary integers, RLNS offers faster multiplication and division times. RLNS and LNS use a table lookup involving all bits for addition. The width, dynamic range, precision and naive table size of RLNS (with careful moduli selection) is as good as those for conventional LNS. Conventional LNS can be more efficient than naive addition lookup. First, commutativity allows interchanging arguments. Second, the addition function is often essentially zero, and does not have to be tabulated. In binary, comparisons are easy. In residue, comparisons are slow. Although RLNS inherently demands comparison, this paper shows a novel way comparisons can be performed in parallel to the lookup from a small table. This paper also describes a novel tool that generates synthesizable Verilog, making RLNS viable in practical applications that can benefit from shorter multiply and divide times.
{"title":"The residue logarithmic number system: theory and implementation","authors":"M. Arnold","doi":"10.1109/ARITH.2005.44","DOIUrl":"https://doi.org/10.1109/ARITH.2005.44","url":null,"abstract":"The residue logarithmic number system (RLNS) represents real values as quantized logarithms which, in turn, are represented using the residue number system (RNS). Compared to the conventional logarithmic number system (LNS) in which quantized logarithms are represented as binary integers, RLNS offers faster multiplication and division times. RLNS and LNS use a table lookup involving all bits for addition. The width, dynamic range, precision and naive table size of RLNS (with careful moduli selection) is as good as those for conventional LNS. Conventional LNS can be more efficient than naive addition lookup. First, commutativity allows interchanging arguments. Second, the addition function is often essentially zero, and does not have to be tabulated. In binary, comparisons are easy. In residue, comparisons are slow. Although RLNS inherently demands comparison, this paper shows a novel way comparisons can be performed in parallel to the lookup from a small table. This paper also describes a novel tool that generates synthesizable Verilog, making RLNS viable in practical applications that can benefit from shorter multiply and divide times.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124141776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent research has demonstrated the vulnerability of certain smart card architectures to power and electromagnetic analysis when multiplier operations are insufficiently shielded from external monitoring. In this paper several standard multipliers are investigated in more detail in order to provide the foundation for understanding potential weaknesses and enabling the subsequent successful repair of those systems. A model is built which accurately predicts power use as a function of the Hamming weights of inputs without the combinatorial explosion of exhaustive simulation. This confirms that power use is indeed data dependent at least for those multipliers. Laboratory experiments confirm that EMR also corresponds closely to these power predictions over a wide range of frequencies.
{"title":"Data dependent power use in multipliers","authors":"C. D. Walter, David Samyde","doi":"10.1109/ARITH.2005.14","DOIUrl":"https://doi.org/10.1109/ARITH.2005.14","url":null,"abstract":"Recent research has demonstrated the vulnerability of certain smart card architectures to power and electromagnetic analysis when multiplier operations are insufficiently shielded from external monitoring. In this paper several standard multipliers are investigated in more detail in order to provide the foundation for understanding potential weaknesses and enabling the subsequent successful repair of those systems. A model is built which accurately predicts power use as a function of the Hamming weights of inputs without the combinatorial explosion of exhaustive simulation. This confirms that power use is indeed data dependent at least for those multipliers. Laboratory experiments confirm that EMR also corresponds closely to these power predictions over a wide range of frequencies.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128284055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper presents a one-shot batch process that generates a wide range of designs for a group of parallel prefix adders. The prefix adders are represented by two two-dimensional matrices and two vectors. This matrix representation makes it possible to compose two functions for gate sizing which calculate the delay and the total transistor width of the carry propagation graph of adders. After gate sizing, the critical path net-lists of the carry propagation graph are generated from the matrix representation for spice delay calculation. The process is illustrated by generating sets of delay and total transistor width pairs for 32-bit and 64-bit cases.
{"title":"Parallel prefix adder design with matrix representation","authors":"Youngmoon Choi, E. Swartzlander","doi":"10.1109/ARITH.2005.35","DOIUrl":"https://doi.org/10.1109/ARITH.2005.35","url":null,"abstract":"The paper presents a one-shot batch process that generates a wide range of designs for a group of parallel prefix adders. The prefix adders are represented by two two-dimensional matrices and two vectors. This matrix representation makes it possible to compose two functions for gate sizing which calculate the delay and the total transistor width of the carry propagation graph of adders. After gate sizing, the critical path net-lists of the carry propagation graph are generated from the matrix representation for spice delay calculation. The process is illustrated by generating sets of delay and total transistor width pairs for 32-bit and 64-bit cases.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131452149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The pipelined CORDIC with linear approximation to rotation has been proposed to achieve reductions in delay, power and area; however, the schemes for rotation (multiplication) and vectoring (division) complicate implementation in a single unit. In this work, we improve the linear approximation scheme, leading to a unified implementation for rotation and vectoring where fully parallel tree multipliers are used instead of the second half of CORDIC iterations. We also combine the linear approximation to rotation with the scale factor compensation so that the compensation is performed concurrently with the rotation process. Comparison with other designs is also provided.
{"title":"Low latency pipelined circular CORDIC","authors":"E. Antelo, J. Villalba","doi":"10.1109/ARITH.2005.30","DOIUrl":"https://doi.org/10.1109/ARITH.2005.30","url":null,"abstract":"The pipelined CORDIC with linear approximation to rotation has been proposed to achieve reductions in delay, power and area; however, the schemes for rotation (multiplication) and vectoring (division) complicate implementation in a single unit. In this work, we improve the linear approximation scheme, leading to a unified implementation for rotation and vectoring where fully parallel tree multipliers are used instead of the second half of CORDIC iterations. We also combine the linear approximation to rotation with the scale factor compensation so that the compensation is performed concurrently with the rotation process. Comparison with other designs is also provided.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128154748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient adder design requires proper selection of a recurrence algorithm and its realization. Each of the algorithms: Weinberger's, Ling's and Doran's were analyzed for its flexibility in representation and suitability for realization in CMOS. We describe general techniques for developing efficient realizations based on CMOS technology constraints when using Ling's algorithm. From these techniques we propose two high-performance realizations that achieve 1 FO4 delay improvement at the same energy and 50% energy reduction at the same delay than existing Ling and Weinberger designs.
{"title":"Efficient mapping of addition recurrence algorithms in CMOS","authors":"B. Zeydel, Ties Kluter, V. Oklobdzija","doi":"10.1109/ARITH.2005.19","DOIUrl":"https://doi.org/10.1109/ARITH.2005.19","url":null,"abstract":"Efficient adder design requires proper selection of a recurrence algorithm and its realization. Each of the algorithms: Weinberger's, Ling's and Doran's were analyzed for its flexibility in representation and suitability for realization in CMOS. We describe general techniques for developing efficient realizations based on CMOS technology constraints when using Ling's algorithm. From these techniques we propose two high-performance realizations that achieve 1 FO4 delay improvement at the same energy and 50% energy reduction at the same delay than existing Ling and Weinberger designs.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126848237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, S. Hsu
This paper describes an improved version of the Tenca-Koc unified scalable radix-2 Montgomery multiplier with half the latency for small and moderate precision operands and half the queue memory requirement. Like the Tenca-Koc multiplier, this design is reconfigurable to accept any input precision in either GF(p) or GF(2/sup n/) up to the size of the on-chip memory. An FPGA implementation can perform 1024-bit modular exponentiation in 16 ms using 5598 4-input lookup tables, making it the fastest unified scalable design yet reported.
{"title":"An improved unified scalable radix-2 Montgomery multiplier","authors":"D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, S. Hsu","doi":"10.1109/ARITH.2005.9","DOIUrl":"https://doi.org/10.1109/ARITH.2005.9","url":null,"abstract":"This paper describes an improved version of the Tenca-Koc unified scalable radix-2 Montgomery multiplier with half the latency for small and moderate precision operands and half the queue memory requirement. Like the Tenca-Koc multiplier, this design is reconfigurable to accept any input precision in either GF(p) or GF(2/sup n/) up to the size of the on-chip memory. An FPGA implementation can perform 1024-bit modular exponentiation in 16 ms using 5598 4-input lookup tables, making it the fastest unified scalable design yet reported.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129724388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel error-free (infinite-precision) architecture for the fast implementation of both 8/spl times/8 2D discrete cosine transform and inverse DCT. The architecture uses a new algebraic integer quantization of a 1D radix-8 DCT that allows the separable computation of a 2D 8/spl times/8 DCT without any intermediate number representation conversions. This is a considerable improvement on previously introduced algebraic integer encoding techniques to compute both DCT and IDCT which eliminates the requirements to approximate the transformation matrix elements by obtaining their exact representations and hence mapping the transcendental functions without any errors. Using this encoding scheme, an entire 8/spl times/8 1D DCT-SQ (scalar quantization) algorithm can be implemented with only 24 adders. Apart from the multiplication-free nature, this new mapping scheme fits to this algorithm, eliminating any computational or quantization errors and resulting short-word-length and high-speed-design.
{"title":"Error-free computation of 8/spl times/8 2D DCT and IDCT using two-dimensional algebraic integer quantization","authors":"K. Wahid, V. Dimitrov, G. Jullien","doi":"10.1109/ARITH.2005.20","DOIUrl":"https://doi.org/10.1109/ARITH.2005.20","url":null,"abstract":"This paper presents a novel error-free (infinite-precision) architecture for the fast implementation of both 8/spl times/8 2D discrete cosine transform and inverse DCT. The architecture uses a new algebraic integer quantization of a 1D radix-8 DCT that allows the separable computation of a 2D 8/spl times/8 DCT without any intermediate number representation conversions. This is a considerable improvement on previously introduced algebraic integer encoding techniques to compute both DCT and IDCT which eliminates the requirements to approximate the transformation matrix elements by obtaining their exact representations and hence mapping the transcendental functions without any errors. Using this encoding scheme, an entire 8/spl times/8 1D DCT-SQ (scalar quantization) algorithm can be implemented with only 24 adders. Apart from the multiplication-free nature, this new mapping scheme fits to this algorithm, eliminating any computational or quantization errors and resulting short-word-length and high-speed-design.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132186608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modular reduction is a fundamental operation in cryptographic systems. Most well known modular reduction methods including Barrett's and Montgomery's algorithms leverage some-pre computations to avoid divisions so that the main complexity of these methods lies in a sequence of two long multiplications. For large wordlengths a multiplication which is tantamount to a linear convolution is performed via the fast Fourier transform (FFT) or other transform-based techniques as in the Schonhage-Strassen multiplication algorithm. We show a fundamental property (the separation principle): in a modular reduction based on long multiplications, the linear convolution required by one of the two long multiplications can be replaced by a cyclic convolution, and the halves can be separated using other information available due to the intrinsic redundancy of the operations. This reduces the number of operations by about 25%. We demonstrate that both Barrett's and Montgomery's methods can be sped up by using the aforementioned fundamental principle. It is shown that a direct application of this algorithm to modular exponentiation (either using Barrett's or Montgomery's methods) can be expected to yield about about 17% speedup.
{"title":"Fast modular reduction for large wordlengths via one linear and one cyclic convolution","authors":"D. Phatak, T. Goff","doi":"10.1109/ARITH.2005.21","DOIUrl":"https://doi.org/10.1109/ARITH.2005.21","url":null,"abstract":"Modular reduction is a fundamental operation in cryptographic systems. Most well known modular reduction methods including Barrett's and Montgomery's algorithms leverage some-pre computations to avoid divisions so that the main complexity of these methods lies in a sequence of two long multiplications. For large wordlengths a multiplication which is tantamount to a linear convolution is performed via the fast Fourier transform (FFT) or other transform-based techniques as in the Schonhage-Strassen multiplication algorithm. We show a fundamental property (the separation principle): in a modular reduction based on long multiplications, the linear convolution required by one of the two long multiplications can be replaced by a cyclic convolution, and the halves can be separated using other information available due to the intrinsic redundancy of the operations. This reduces the number of operations by about 25%. We demonstrate that both Barrett's and Montgomery's methods can be sped up by using the aforementioned fundamental principle. It is shown that a direct application of this algorithm to modular exponentiation (either using Barrett's or Montgomery's methods) can be expected to yield about about 17% speedup.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133151336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
New bit serial squarers for long numbers in LSB first form, are presented in this paper. The first presented scheme is a 50% operational efficient squarer than has the half number of cells compared to the traditional squarers. The second scheme is a 100% operational efficient squarer. In this scheme, the number of the cells remain unchanged compared to other proposed schemes but the number of the required registers is reduced significantly. Both schemes are presented in non-systolic and systolic form and are compared against other squarers presented in the bibliography from the aspect of hardware complexity.
{"title":"Long number bit-serial squarers","authors":"E. Chaniotakis, P. Kalivas, K. Pekmestzi","doi":"10.1109/ARITH.2005.28","DOIUrl":"https://doi.org/10.1109/ARITH.2005.28","url":null,"abstract":"New bit serial squarers for long numbers in LSB first form, are presented in this paper. The first presented scheme is a 50% operational efficient squarer than has the half number of cells compared to the traditional squarers. The second scheme is a 100% operational efficient squarer. In this scheme, the number of the cells remain unchanged compared to other proposed schemes but the number of the required registers is reduced significantly. Both schemes are presented in non-systolic and systolic form and are compared against other squarers presented in the bibliography from the aspect of hardware complexity.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123576701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we propose an architecture for the computation of the double-precision floating-point multiply-add fused (MAF) operation A+(B/spl times/C) that permits to compute the floating-point addition with lower latency than floating-point multiplication and MAF. While previous MAF architectures compute the three operations with the same latency, the proposed architecture permits to skip the first pipeline stages, those related with the multiplication B/spl times/C, in case of an addition. For instance, for a MAF unit pipelined into three or five stages, the latency of the floating-point addition is reduced to two or three cycles, respectively. To achieve the latency reduction for floating-point addition, the alignment shifter, which in previous organizations is in parallel with the multiplication, is moved so that the multiplication can be bypassed. To avoid that this modification increases the critical path, a double-datapath organization is used, in which the alignment and normalization are in separate paths. Moreover, we use the techniques developed previously of combining the addition and the rounding and of performing the normalization before the addition.
{"title":"Floating-point fused multiply-add: reduced latency for floating-point addition","authors":"J. Bruguera, T. Lang","doi":"10.1109/ARITH.2005.22","DOIUrl":"https://doi.org/10.1109/ARITH.2005.22","url":null,"abstract":"In this paper we propose an architecture for the computation of the double-precision floating-point multiply-add fused (MAF) operation A+(B/spl times/C) that permits to compute the floating-point addition with lower latency than floating-point multiplication and MAF. While previous MAF architectures compute the three operations with the same latency, the proposed architecture permits to skip the first pipeline stages, those related with the multiplication B/spl times/C, in case of an addition. For instance, for a MAF unit pipelined into three or five stages, the latency of the floating-point addition is reduced to two or three cycles, respectively. To achieve the latency reduction for floating-point addition, the alignment shifter, which in previous organizations is in parallel with the multiplication, is moved so that the multiplication can be bypassed. To avoid that this modification increases the critical path, a double-datapath organization is used, in which the alignment and normalization are in separate paths. Moreover, we use the techniques developed previously of combining the addition and the rounding and of performing the normalization before the addition.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116941025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}