The residue logarithmic number system (RLNS) represents real values as quantized logarithms which, in turn, are represented using the residue number system (RNS). Compared to the conventional logarithmic number system (LNS) in which quantized logarithms are represented as binary integers, RLNS offers faster multiplication and division times. RLNS and LNS use a table lookup involving all bits for addition. The width, dynamic range, precision and naive table size of RLNS (with careful moduli selection) is as good as those for conventional LNS. Conventional LNS can be more efficient than naive addition lookup. First, commutativity allows interchanging arguments. Second, the addition function is often essentially zero, and does not have to be tabulated. In binary, comparisons are easy. In residue, comparisons are slow. Although RLNS inherently demands comparison, this paper shows a novel way comparisons can be performed in parallel to the lookup from a small table. This paper also describes a novel tool that generates synthesizable Verilog, making RLNS viable in practical applications that can benefit from shorter multiply and divide times.
{"title":"The residue logarithmic number system: theory and implementation","authors":"M. Arnold","doi":"10.1109/ARITH.2005.44","DOIUrl":"https://doi.org/10.1109/ARITH.2005.44","url":null,"abstract":"The residue logarithmic number system (RLNS) represents real values as quantized logarithms which, in turn, are represented using the residue number system (RNS). Compared to the conventional logarithmic number system (LNS) in which quantized logarithms are represented as binary integers, RLNS offers faster multiplication and division times. RLNS and LNS use a table lookup involving all bits for addition. The width, dynamic range, precision and naive table size of RLNS (with careful moduli selection) is as good as those for conventional LNS. Conventional LNS can be more efficient than naive addition lookup. First, commutativity allows interchanging arguments. Second, the addition function is often essentially zero, and does not have to be tabulated. In binary, comparisons are easy. In residue, comparisons are slow. Although RLNS inherently demands comparison, this paper shows a novel way comparisons can be performed in parallel to the lookup from a small table. This paper also describes a novel tool that generates synthesizable Verilog, making RLNS viable in practical applications that can benefit from shorter multiply and divide times.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124141776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent research has demonstrated the vulnerability of certain smart card architectures to power and electromagnetic analysis when multiplier operations are insufficiently shielded from external monitoring. In this paper several standard multipliers are investigated in more detail in order to provide the foundation for understanding potential weaknesses and enabling the subsequent successful repair of those systems. A model is built which accurately predicts power use as a function of the Hamming weights of inputs without the combinatorial explosion of exhaustive simulation. This confirms that power use is indeed data dependent at least for those multipliers. Laboratory experiments confirm that EMR also corresponds closely to these power predictions over a wide range of frequencies.
{"title":"Data dependent power use in multipliers","authors":"C. D. Walter, David Samyde","doi":"10.1109/ARITH.2005.14","DOIUrl":"https://doi.org/10.1109/ARITH.2005.14","url":null,"abstract":"Recent research has demonstrated the vulnerability of certain smart card architectures to power and electromagnetic analysis when multiplier operations are insufficiently shielded from external monitoring. In this paper several standard multipliers are investigated in more detail in order to provide the foundation for understanding potential weaknesses and enabling the subsequent successful repair of those systems. A model is built which accurately predicts power use as a function of the Hamming weights of inputs without the combinatorial explosion of exhaustive simulation. This confirms that power use is indeed data dependent at least for those multipliers. Laboratory experiments confirm that EMR also corresponds closely to these power predictions over a wide range of frequencies.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128284055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper presents a one-shot batch process that generates a wide range of designs for a group of parallel prefix adders. The prefix adders are represented by two two-dimensional matrices and two vectors. This matrix representation makes it possible to compose two functions for gate sizing which calculate the delay and the total transistor width of the carry propagation graph of adders. After gate sizing, the critical path net-lists of the carry propagation graph are generated from the matrix representation for spice delay calculation. The process is illustrated by generating sets of delay and total transistor width pairs for 32-bit and 64-bit cases.
{"title":"Parallel prefix adder design with matrix representation","authors":"Youngmoon Choi, E. Swartzlander","doi":"10.1109/ARITH.2005.35","DOIUrl":"https://doi.org/10.1109/ARITH.2005.35","url":null,"abstract":"The paper presents a one-shot batch process that generates a wide range of designs for a group of parallel prefix adders. The prefix adders are represented by two two-dimensional matrices and two vectors. This matrix representation makes it possible to compose two functions for gate sizing which calculate the delay and the total transistor width of the carry propagation graph of adders. After gate sizing, the critical path net-lists of the carry propagation graph are generated from the matrix representation for spice delay calculation. The process is illustrated by generating sets of delay and total transistor width pairs for 32-bit and 64-bit cases.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131452149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The pipelined CORDIC with linear approximation to rotation has been proposed to achieve reductions in delay, power and area; however, the schemes for rotation (multiplication) and vectoring (division) complicate implementation in a single unit. In this work, we improve the linear approximation scheme, leading to a unified implementation for rotation and vectoring where fully parallel tree multipliers are used instead of the second half of CORDIC iterations. We also combine the linear approximation to rotation with the scale factor compensation so that the compensation is performed concurrently with the rotation process. Comparison with other designs is also provided.
{"title":"Low latency pipelined circular CORDIC","authors":"E. Antelo, J. Villalba","doi":"10.1109/ARITH.2005.30","DOIUrl":"https://doi.org/10.1109/ARITH.2005.30","url":null,"abstract":"The pipelined CORDIC with linear approximation to rotation has been proposed to achieve reductions in delay, power and area; however, the schemes for rotation (multiplication) and vectoring (division) complicate implementation in a single unit. In this work, we improve the linear approximation scheme, leading to a unified implementation for rotation and vectoring where fully parallel tree multipliers are used instead of the second half of CORDIC iterations. We also combine the linear approximation to rotation with the scale factor compensation so that the compensation is performed concurrently with the rotation process. Comparison with other designs is also provided.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128154748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient adder design requires proper selection of a recurrence algorithm and its realization. Each of the algorithms: Weinberger's, Ling's and Doran's were analyzed for its flexibility in representation and suitability for realization in CMOS. We describe general techniques for developing efficient realizations based on CMOS technology constraints when using Ling's algorithm. From these techniques we propose two high-performance realizations that achieve 1 FO4 delay improvement at the same energy and 50% energy reduction at the same delay than existing Ling and Weinberger designs.
{"title":"Efficient mapping of addition recurrence algorithms in CMOS","authors":"B. Zeydel, Ties Kluter, V. Oklobdzija","doi":"10.1109/ARITH.2005.19","DOIUrl":"https://doi.org/10.1109/ARITH.2005.19","url":null,"abstract":"Efficient adder design requires proper selection of a recurrence algorithm and its realization. Each of the algorithms: Weinberger's, Ling's and Doran's were analyzed for its flexibility in representation and suitability for realization in CMOS. We describe general techniques for developing efficient realizations based on CMOS technology constraints when using Ling's algorithm. From these techniques we propose two high-performance realizations that achieve 1 FO4 delay improvement at the same energy and 50% energy reduction at the same delay than existing Ling and Weinberger designs.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126848237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, S. Hsu
This paper describes an improved version of the Tenca-Koc unified scalable radix-2 Montgomery multiplier with half the latency for small and moderate precision operands and half the queue memory requirement. Like the Tenca-Koc multiplier, this design is reconfigurable to accept any input precision in either GF(p) or GF(2/sup n/) up to the size of the on-chip memory. An FPGA implementation can perform 1024-bit modular exponentiation in 16 ms using 5598 4-input lookup tables, making it the fastest unified scalable design yet reported.
{"title":"An improved unified scalable radix-2 Montgomery multiplier","authors":"D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, S. Hsu","doi":"10.1109/ARITH.2005.9","DOIUrl":"https://doi.org/10.1109/ARITH.2005.9","url":null,"abstract":"This paper describes an improved version of the Tenca-Koc unified scalable radix-2 Montgomery multiplier with half the latency for small and moderate precision operands and half the queue memory requirement. Like the Tenca-Koc multiplier, this design is reconfigurable to accept any input precision in either GF(p) or GF(2/sup n/) up to the size of the on-chip memory. An FPGA implementation can perform 1024-bit modular exponentiation in 16 ms using 5598 4-input lookup tables, making it the fastest unified scalable design yet reported.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129724388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel error-free (infinite-precision) architecture for the fast implementation of both 8/spl times/8 2D discrete cosine transform and inverse DCT. The architecture uses a new algebraic integer quantization of a 1D radix-8 DCT that allows the separable computation of a 2D 8/spl times/8 DCT without any intermediate number representation conversions. This is a considerable improvement on previously introduced algebraic integer encoding techniques to compute both DCT and IDCT which eliminates the requirements to approximate the transformation matrix elements by obtaining their exact representations and hence mapping the transcendental functions without any errors. Using this encoding scheme, an entire 8/spl times/8 1D DCT-SQ (scalar quantization) algorithm can be implemented with only 24 adders. Apart from the multiplication-free nature, this new mapping scheme fits to this algorithm, eliminating any computational or quantization errors and resulting short-word-length and high-speed-design.
{"title":"Error-free computation of 8/spl times/8 2D DCT and IDCT using two-dimensional algebraic integer quantization","authors":"K. Wahid, V. Dimitrov, G. Jullien","doi":"10.1109/ARITH.2005.20","DOIUrl":"https://doi.org/10.1109/ARITH.2005.20","url":null,"abstract":"This paper presents a novel error-free (infinite-precision) architecture for the fast implementation of both 8/spl times/8 2D discrete cosine transform and inverse DCT. The architecture uses a new algebraic integer quantization of a 1D radix-8 DCT that allows the separable computation of a 2D 8/spl times/8 DCT without any intermediate number representation conversions. This is a considerable improvement on previously introduced algebraic integer encoding techniques to compute both DCT and IDCT which eliminates the requirements to approximate the transformation matrix elements by obtaining their exact representations and hence mapping the transcendental functions without any errors. Using this encoding scheme, an entire 8/spl times/8 1D DCT-SQ (scalar quantization) algorithm can be implemented with only 24 adders. Apart from the multiplication-free nature, this new mapping scheme fits to this algorithm, eliminating any computational or quantization errors and resulting short-word-length and high-speed-design.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132186608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modular reduction is a fundamental operation in cryptographic systems. Most well known modular reduction methods including Barrett's and Montgomery's algorithms leverage some-pre computations to avoid divisions so that the main complexity of these methods lies in a sequence of two long multiplications. For large wordlengths a multiplication which is tantamount to a linear convolution is performed via the fast Fourier transform (FFT) or other transform-based techniques as in the Schonhage-Strassen multiplication algorithm. We show a fundamental property (the separation principle): in a modular reduction based on long multiplications, the linear convolution required by one of the two long multiplications can be replaced by a cyclic convolution, and the halves can be separated using other information available due to the intrinsic redundancy of the operations. This reduces the number of operations by about 25%. We demonstrate that both Barrett's and Montgomery's methods can be sped up by using the aforementioned fundamental principle. It is shown that a direct application of this algorithm to modular exponentiation (either using Barrett's or Montgomery's methods) can be expected to yield about about 17% speedup.
{"title":"Fast modular reduction for large wordlengths via one linear and one cyclic convolution","authors":"D. Phatak, T. Goff","doi":"10.1109/ARITH.2005.21","DOIUrl":"https://doi.org/10.1109/ARITH.2005.21","url":null,"abstract":"Modular reduction is a fundamental operation in cryptographic systems. Most well known modular reduction methods including Barrett's and Montgomery's algorithms leverage some-pre computations to avoid divisions so that the main complexity of these methods lies in a sequence of two long multiplications. For large wordlengths a multiplication which is tantamount to a linear convolution is performed via the fast Fourier transform (FFT) or other transform-based techniques as in the Schonhage-Strassen multiplication algorithm. We show a fundamental property (the separation principle): in a modular reduction based on long multiplications, the linear convolution required by one of the two long multiplications can be replaced by a cyclic convolution, and the halves can be separated using other information available due to the intrinsic redundancy of the operations. This reduces the number of operations by about 25%. We demonstrate that both Barrett's and Montgomery's methods can be sped up by using the aforementioned fundamental principle. It is shown that a direct application of this algorithm to modular exponentiation (either using Barrett's or Montgomery's methods) can be expected to yield about about 17% speedup.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133151336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
New bit serial squarers for long numbers in LSB first form, are presented in this paper. The first presented scheme is a 50% operational efficient squarer than has the half number of cells compared to the traditional squarers. The second scheme is a 100% operational efficient squarer. In this scheme, the number of the cells remain unchanged compared to other proposed schemes but the number of the required registers is reduced significantly. Both schemes are presented in non-systolic and systolic form and are compared against other squarers presented in the bibliography from the aspect of hardware complexity.
{"title":"Long number bit-serial squarers","authors":"E. Chaniotakis, P. Kalivas, K. Pekmestzi","doi":"10.1109/ARITH.2005.28","DOIUrl":"https://doi.org/10.1109/ARITH.2005.28","url":null,"abstract":"New bit serial squarers for long numbers in LSB first form, are presented in this paper. The first presented scheme is a 50% operational efficient squarer than has the half number of cells compared to the traditional squarers. The second scheme is a 100% operational efficient squarer. In this scheme, the number of the cells remain unchanged compared to other proposed schemes but the number of the required registers is reduced significantly. Both schemes are presented in non-systolic and systolic form and are compared against other squarers presented in the bibliography from the aspect of hardware complexity.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123576701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a set of tools for mechanical reasoning of numerical bounds using interval arithmetic. The tools implement two techniques for reducing decorrelation: interval splitting and Taylor's series expansions. Although the tools are designed for the proof assistant system PVS, expertise on PVS is not required. The ultimate goal of the tools is to provide guaranteed proofs of numerical properties with a minimal human-theorem prover interaction.
{"title":"Guaranteed proofs using interval arithmetic","authors":"M. Daumas, G. Melquiond, C. Muñoz","doi":"10.1109/ARITH.2005.25","DOIUrl":"https://doi.org/10.1109/ARITH.2005.25","url":null,"abstract":"This paper presents a set of tools for mechanical reasoning of numerical bounds using interval arithmetic. The tools implement two techniques for reducing decorrelation: interval splitting and Taylor's series expansions. Although the tools are designed for the proof assistant system PVS, expertise on PVS is not required. The ultimate goal of the tools is to provide guaranteed proofs of numerical properties with a minimal human-theorem prover interaction.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125171060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}