In this paper, we focus on the relation collection step of the Function Field Sieve (FFS), which is to date the best algorithm known for computing discrete logarithms in small-characteristic finite fields of cryptographic sizes. Denoting such a finite field by Fpn, where p is much smaller than n, the main idea behind this step is to find polynomials of the form a(t)-b(t)x in Fp[t][x] which, when considered as principal ideals in carefully selected function fields, can be factored into products of low-degree prime ideals. Such polynomials are called "relations", and current record-sized discrete-logarithm computations need billions of those. Collecting relations is therefore a crucial and extremely expensive step in FFS, and a practical implementation thereof requires heavy use of cache-aware sieving algorithms, along with efficient polynomial arithmetic over Fp[t]. This paper presents the algorithmic and arithmetic techniques which were put together as part of a new public implementation of FFS, aimed at medium-to record-sized computations.
在本文中,我们重点研究了函数域筛(Function Field Sieve, FFS)的关系收集步骤,这是迄今为止已知的在密码大小的小特征有限域中计算离散对数的最佳算法。用Fpn表示这样一个有限域,其中p比n小得多,这一步背后的主要思想是在Fp[t][x]中找到形式为a(t)-b(t)x的多项式,当这些多项式被认为是精心选择的函数域中的主理想时,可以分解成低次素理想的乘积。这样的多项式被称为“关系”,而当前创纪录规模的离散对数计算需要数十亿个这样的多项式。因此,收集关系在FFS中是一个关键且极其昂贵的步骤,其实际实现需要大量使用缓存感知筛选算法,以及对Fp[t]的高效多项式算法。本文介绍了作为FFS新公共实现的一部分的算法和算术技术,旨在进行中等到记录大小的计算。
{"title":"Relation Collection for the Function Field Sieve","authors":"J. Detrey, P. Gaudry, M. Videau","doi":"10.1109/ARITH.2013.28","DOIUrl":"https://doi.org/10.1109/ARITH.2013.28","url":null,"abstract":"In this paper, we focus on the relation collection step of the Function Field Sieve (FFS), which is to date the best algorithm known for computing discrete logarithms in small-characteristic finite fields of cryptographic sizes. Denoting such a finite field by Fpn, where p is much smaller than n, the main idea behind this step is to find polynomials of the form a(t)-b(t)x in Fp[t][x] which, when considered as principal ideals in carefully selected function fields, can be factored into products of low-degree prime ideals. Such polynomials are called \"relations\", and current record-sized discrete-logarithm computations need billions of those. Collecting relations is therefore a crucial and extremely expensive step in FFS, and a practical implementation thereof requires heavy use of cache-aware sieving algorithms, along with efficient polynomial arithmetic over Fp[t]. This paper presents the algorithmic and arithmetic techniques which were put together as part of a new public implementation of FFS, aimed at medium-to record-sized computations.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126944324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SIPE (Small Integer Plus Exponent) is a mini-library in the form of a C header file, to perform floating-point computations in very low precisions with correct rounding to nearest in radix 2. The goal of such a tool is to do proofs of algorithms/properties or computations of tight error bounds in these precisions by exhaustive tests, in order to try to generalize them to higher precisions. The currently supported operations are addition, subtraction, multiplication (possibly with the error term), FMA, and miscellaneous comparisons and conversions. Timing comparisons have been done with hardware IEEE-754 floating point and with GNU MPFR.
{"title":"SIPE: Small Integer Plus Exponent","authors":"Vincent Lefèvre","doi":"10.1109/ARITH.2013.22","DOIUrl":"https://doi.org/10.1109/ARITH.2013.22","url":null,"abstract":"SIPE (Small Integer Plus Exponent) is a mini-library in the form of a C header file, to perform floating-point computations in very low precisions with correct rounding to nearest in radix 2. The goal of such a tool is to do proofs of algorithms/properties or computations of tight error bounds in these precisions by exhaustive tests, in order to try to generalize them to higher precisions. The currently supported operations are addition, subtraction, multiplication (possibly with the error term), FMA, and miscellaneous comparisons and conversions. Timing comparisons have been done with hardware IEEE-754 floating point and with GNU MPFR.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130293316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given current hardware trends, ExaScale computing (1018 floating point operations per second) is projected to be available in less than a decade, achieved by using a huge number of processors, of order 109. Given the likely hardware heterogeneity in both platform and network, and the possibility of intermittent failures, dynamic scheduling will be needed to adapt to changing resources and loads. This will make it likely that repeated runs of a program will not execute operations like reductions in exactly the same order. This in turn will make reproducibility, i.e. getting bitwise identical results from run to run, difficult to achieve, because floating point operations like addition are not associative, so computing sums in different orders often leads to different results. Indeed, this is already a challenge on today's platforms.
{"title":"Numerical Reproducibility and Accuracy at ExaScale","authors":"J. Demmel, Hong Diep Nguyen","doi":"10.1109/ARITH.2013.43","DOIUrl":"https://doi.org/10.1109/ARITH.2013.43","url":null,"abstract":"Given current hardware trends, ExaScale computing (1018 floating point operations per second) is projected to be available in less than a decade, achieved by using a huge number of processors, of order 109. Given the likely hardware heterogeneity in both platform and network, and the possibility of intermittent failures, dynamic scheduling will be needed to adapt to changing resources and loads. This will make it likely that repeated runs of a program will not execute operations like reductions in exactly the same order. This in turn will make reproducibility, i.e. getting bitwise identical results from run to run, difficult to achieve, because floating point operations like addition are not associative, so computing sums in different orders often leads to different results. Indeed, this is already a challenge on today's platforms.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114767674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents improved architectures for a floating-point fused two-term dot product unit. The floating-point fused dot product unit is useful for a wide variety of digital signal processing (DSP) applications including complex multiplication and fast Fourier transform (FFT) and discrete cosine transform (DCT) butterfly operations. In order to improve the performance, a new alignment scheme, early normalization, a four-input leading zero anticipation (LZA), a dual-path algorithm, and pipelining are applied. The proposed designs are implemented for single precision and synthesized with a 45nm standard cell library. The proposed dual-path design reduces the latency by 25% compared to the traditional floating-point fused dot product unit. Based on a data flow analysis, the proposed design can be split into three pipeline stages. Since the latencies of the three stages are fairly well balanced, the throughput is increased by a factor of 2.8 compared to the non-pipelined dual-path design.
{"title":"Improved Architectures for a Floating-Point Fused Dot Product Unit","authors":"Jongwook Sohn, E. Swartzlander","doi":"10.1109/ARITH.2013.26","DOIUrl":"https://doi.org/10.1109/ARITH.2013.26","url":null,"abstract":"This paper presents improved architectures for a floating-point fused two-term dot product unit. The floating-point fused dot product unit is useful for a wide variety of digital signal processing (DSP) applications including complex multiplication and fast Fourier transform (FFT) and discrete cosine transform (DCT) butterfly operations. In order to improve the performance, a new alignment scheme, early normalization, a four-input leading zero anticipation (LZA), a dual-path algorithm, and pipelining are applied. The proposed designs are implemented for single precision and synthesized with a 45nm standard cell library. The proposed dual-path design reduces the latency by 25% compared to the traditional floating-point fused dot product unit. Based on a data flow analysis, the proposed design can be split into three pipeline stages. Since the latencies of the three stages are fairly well balanced, the throughput is increased by a factor of 2.8 compared to the non-pipelined dual-path design.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132351398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Brisebarre, M. Mezzarobba, J. Muller, C. Lauter
We introduce a software-oriented algorithm that allows one to quickly compare a binary64 floating-point (FP) number and a decimal64 FP number, assuming the "binary encoding" of the decimal formats specified by the IEEE 754-2008 standard for FP arithmetic is used. It is a two-step algorithm: a first pass, based on the exponents only, makes it possible to quickly eliminate most cases, then when the first pass does not suffice, a more accurate second pass is required. We provide an implementation of several variants of our algorithm, and compare them.
{"title":"Comparison between Binary64 and Decimal64 Floating-Point Numbers","authors":"N. Brisebarre, M. Mezzarobba, J. Muller, C. Lauter","doi":"10.1109/ARITH.2013.23","DOIUrl":"https://doi.org/10.1109/ARITH.2013.23","url":null,"abstract":"We introduce a software-oriented algorithm that allows one to quickly compare a binary64 floating-point (FP) number and a decimal64 FP number, assuming the \"binary encoding\" of the decimal formats specified by the IEEE 754-2008 standard for FP arithmetic is used. It is a two-step algorithm: a first pass, based on the exponents only, makes it possible to quickly eliminate most cases, then when the first pass does not suffice, a more accurate second pass is required. We provide an implementation of several variants of our algorithm, and compare them.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115341779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reproducibility, i.e. getting the bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [1]. However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, make attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rump's algorithm for error-free vector transformation [2], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm trades off efficiency and accuracy: we reproducibly attain reasonably accurate results (with an absolute error bound c · n2 · macheps · max |vi| for a small constant c) with just 2n + O(1) floating-point operations, and quite accurate results (with an absolute error bound c · n3 · macheps2 · max |vi| with 5n + O(1) floating point operations, both with just two reduction operations. Higher accuracies are also possible by increasing the number of error-free transformations. As long as the same rounding mode is used, results computed by the proposed algorithms are reproducible for any run on any platform.
{"title":"Fast Reproducible Floating-Point Summation","authors":"J. Demmel, Hong Diep Nguyen","doi":"10.1109/ARITH.2013.9","DOIUrl":"https://doi.org/10.1109/ARITH.2013.9","url":null,"abstract":"Reproducibility, i.e. getting the bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [1]. However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, make attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rump's algorithm for error-free vector transformation [2], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm trades off efficiency and accuracy: we reproducibly attain reasonably accurate results (with an absolute error bound c · n2 · macheps · max |vi| for a small constant c) with just 2n + O(1) floating-point operations, and quite accurate results (with an absolute error bound c · n3 · macheps2 · max |vi| with 5n + O(1) floating point operations, both with just two reduction operations. Higher accuracies are also possible by increasing the number of error-free transformations. As long as the same rounding mode is used, results computed by the proposed algorithms are reproducible for any run on any platform.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124855960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper deals with the accuracy of complex division in radix-two floating-point arithmetic. Assuming that a fused multiply-add (FMA) instruction is available and that no underflow/overflow occurs, we study how to ensure high relative accuracy in the component wise sense. Since this essentially reduces to evaluating accurately three expressions of the form ac+bd, an obvious approach would be to perform three calls to Kahan's compensated algorithm for 2 by 2 determinants. However, in the context of complex division, two of those expressions are such that ac and bd have the same sign, suggesting that cheaper schemes should be used here (since cancellation cannot occur). We first give a detailed accuracy analysis of such schemes for the sum of two nonnegative products, providing not only sharp bounds on both their absolute and relative errors, but also sufficient conditions for the output of one of them to coincide with the output of Kahan's algorithm. By combining Kahan's algorithm with this particular scheme, we then deduce two new division algorithms. Our first algorithm is a straight-line program whose component wise relative error is always at most 5u+13u2 with u the unit round off, we also provide examples of inputs for which the error of this algorithm approaches 5u, thus showing that our upper bound is essentially the best possible. When tests are allowed we show with a second algorithm that the bound above can be further reduced to 4.5u+9u2, and that this improved bound is reasonably sharp.
{"title":"On the Componentwise Accuracy of Complex Floating-Point Division with an FMA","authors":"C. Jeannerod, N. Louvet, J. Muller","doi":"10.1109/ARITH.2013.8","DOIUrl":"https://doi.org/10.1109/ARITH.2013.8","url":null,"abstract":"This paper deals with the accuracy of complex division in radix-two floating-point arithmetic. Assuming that a fused multiply-add (FMA) instruction is available and that no underflow/overflow occurs, we study how to ensure high relative accuracy in the component wise sense. Since this essentially reduces to evaluating accurately three expressions of the form ac+bd, an obvious approach would be to perform three calls to Kahan's compensated algorithm for 2 by 2 determinants. However, in the context of complex division, two of those expressions are such that ac and bd have the same sign, suggesting that cheaper schemes should be used here (since cancellation cannot occur). We first give a detailed accuracy analysis of such schemes for the sum of two nonnegative products, providing not only sharp bounds on both their absolute and relative errors, but also sufficient conditions for the output of one of them to coincide with the output of Kahan's algorithm. By combining Kahan's algorithm with this particular scheme, we then deduce two new division algorithms. Our first algorithm is a straight-line program whose component wise relative error is always at most 5u+13u2 with u the unit round off, we also provide examples of inputs for which the error of this algorithm approaches 5u, thus showing that our upper bound is essentially the best possible. When tests are allowed we show with a second algorithm that the bound above can be further reduced to 4.5u+9u2, and that this improved bound is reasonably sharp.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130589541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mathematical values are usually computed using well-known mathematical formulas without thinking about their accuracy, which may turn awful with particular instances. This is the case for the computation of the area of a triangle. When the triangle is needle-like, the common formula has a very poor accuracy. Kahan proposed in 1986 an algorithm he claimed correct within a few ulps. Goldberg took over this algorithm in 1991 and gave a precise error bound. This article presents a formal proof of this algorithm, an improvement of its error bound and new investigations in case of underflow.
{"title":"How to Compute the Area of a Triangle: A Formal Revisit","authors":"S. Boldo","doi":"10.1109/ARITH.2013.29","DOIUrl":"https://doi.org/10.1109/ARITH.2013.29","url":null,"abstract":"Mathematical values are usually computed using well-known mathematical formulas without thinking about their accuracy, which may turn awful with particular instances. This is the case for the computation of the area of a triangle. When the triangle is needle-like, the common formula has a very poor accuracy. Kahan proposed in 1986 an algorithm he claimed correct within a few ulps. Goldberg took over this algorithm in 1991 and gave a precise error bound. This article presents a formal proof of this algorithm, an improvement of its error bound and new investigations in case of underflow.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130149775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The speed and levels of integration of modern devices have risen to the point that arithmetic can be performed very fast and with high precision. Precise arithmetic comes at a hidden cost-by computing results past the precision they require, systems inefficiently utilize their resources. Numerous designs over the past fifty years have demonstrated scalable efficiency by utilizing approximate logarithms. Many such designs are based off of a linear approximation algorithm developed by Mitchell. This paper evaluates a truncated form of binary logarithm as a replacement for Mitchell's algorithm. The truncated approximate logarithm simultaneously improves the efficiency and precision of Mitchell's approximation while remaining simple to implement.
{"title":"Truncated Logarithmic Approximation","authors":"Michael B. Sullivan, E. Swartzlander","doi":"10.1109/ARITH.2013.34","DOIUrl":"https://doi.org/10.1109/ARITH.2013.34","url":null,"abstract":"The speed and levels of integration of modern devices have risen to the point that arithmetic can be performed very fast and with high precision. Precise arithmetic comes at a hidden cost-by computing results past the precision they require, systems inefficiently utilize their resources. Numerous designs over the past fifty years have demonstrated scalable efficiency by utilizing approximate logarithms. Many such designs are based off of a linear approximation algorithm developed by Mitchell. This paper evaluates a truncated form of binary logarithm as a replacement for Mitchell's algorithm. The truncated approximate logarithm simultaneously improves the efficiency and precision of Mitchell's approximation while remaining simple to implement.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115257861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Exascale level computers might be available in less than a decade. Computer architects are already thinking of, and planning to achieve such levels of performance. It is reasonable to expect that researchers and engineers will carry out scientific and engineering computations more complex than ever before, and will attempt breakthroughs not possible today. If the size of the problems solved on such machines scales accordingly, we may face new issues related to precision, accuracy, performance, and programmability. The paper examines some relevant aspects of this problem.
{"title":"Precision, Accuracy, and Rounding Error Propagation in Exascale Computing","authors":"Marius Cornea","doi":"10.1109/ARITH.2013.42","DOIUrl":"https://doi.org/10.1109/ARITH.2013.42","url":null,"abstract":"Exascale level computers might be available in less than a decade. Computer architects are already thinking of, and planning to achieve such levels of performance. It is reasonable to expect that researchers and engineers will carry out scientific and engineering computations more complex than ever before, and will attempt breakthroughs not possible today. If the size of the problems solved on such machines scales accordingly, we may face new issues related to precision, accuracy, performance, and programmability. The paper examines some relevant aspects of this problem.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122165839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}