Current processors typically embed many cores running at high speed. The main goal of this paper is to assess the efficiency of software parallelism for low level arithmetic operations by providing a thorough comparison of several parallel modular multiplications. Famous methods such as Barrett, Montgomery as well as more recent algorithms are compared together with a novel k-ary multipartite multiplication which allows to split the computations into independent processes. Our experiments show that this new algorithm is well suited to software parallelism.
{"title":"Parallel Modular Multiplication on Multi-core Processors","authors":"Pascal Giorgi, L. Imbert, T. Izard","doi":"10.1109/ARITH.2013.20","DOIUrl":"https://doi.org/10.1109/ARITH.2013.20","url":null,"abstract":"Current processors typically embed many cores running at high speed. The main goal of this paper is to assess the efficiency of software parallelism for low level arithmetic operations by providing a thorough comparison of several parallel modular multiplications. Famous methods such as Barrett, Montgomery as well as more recent algorithms are compared together with a novel k-ary multipartite multiplication which allows to split the computations into independent processes. Our experiments show that this new algorithm is well suited to software parallelism.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128067369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper examines the mapping of algorithms encountered when solving dense linear systems and linear least-squares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations. As part of the study, we expose the benefits of redesigning floating point units and their surrounding data-paths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design trade-offs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs. A feasibility study shows that our extensions to the MAC units can double the speed of required vector-norm operations while reducing energy by 60%. Similarly, up to 20% speedup with 15% savings in energy can be achieved for LU factorization. We show how such efficiency is maintained even in the complex inner kernels of these operations.
{"title":"Floating Point Architecture Extensions for Optimized Matrix Factorization","authors":"A. Pedram, A. Gerstlauer, R. V. D. Geijn","doi":"10.1109/ARITH.2013.21","DOIUrl":"https://doi.org/10.1109/ARITH.2013.21","url":null,"abstract":"This paper examines the mapping of algorithms encountered when solving dense linear systems and linear least-squares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations. As part of the study, we expose the benefits of redesigning floating point units and their surrounding data-paths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design trade-offs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs. A feasibility study shows that our extensions to the MAC units can double the speed of required vector-norm operations while reducing energy by 60%. Similarly, up to 20% speedup with 15% savings in energy can be achieved for LU factorization. We show how such efficiency is maintained even in the complex inner kernels of these operations.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124436405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Srinivasan, Ketan Bhudiya, R. Ramanarayanan, P. Babu, Tiju Jacob, S. Mathew, R. Krishnamurthy, V. Erraguntla
Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it.
{"title":"Split-Path Fused Floating Point Multiply Accumulate (FPMAC)","authors":"S. Srinivasan, Ketan Bhudiya, R. Ramanarayanan, P. Babu, Tiju Jacob, S. Mathew, R. Krishnamurthy, V. Erraguntla","doi":"10.1109/ARITH.2013.32","DOIUrl":"https://doi.org/10.1109/ARITH.2013.32","url":null,"abstract":"Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121589984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The series expansion at the origin of the Airy function Ai(x) is alternating and hence problematic to evaluate for x > 0 due to cancellation. Based on a method recently proposed by Gawronski, Müller, and Rein hard, we exhibit two functions F and G, both with nonnegative Taylor expansions at the origin, such that Ai(x) = G(x)/F(x). The sums are now well-conditioned, but the Taylor coefficients of G turn out to obey an ill-conditioned three-term recurrence. We use the classical Miller algorithm to overcome this issue. We bound all errors and our implementation allows an arbitrary and certified accuracy, that can be used, e.g., for providing correct rounding in arbitrary precision.
Airy函数Ai(x)原点处的级数展开是交替的,因此由于消去,当x > 0时很难求值。基于Gawronski, m ller和Rein hard最近提出的一种方法,我们展示了两个函数F和G,它们在原点处都具有非负泰勒展开,使得Ai(x) = G(x)/F(x)。这些和现在是条件良好的,但是G的泰勒系数服从一个条件不良的三项递归式。我们使用经典的米勒算法来克服这个问题。我们绑定了所有的错误,并且我们的实现允许任意和认证的精度,例如,在任意精度下提供正确的舍入。
{"title":"Multiple-Precision Evaluation of the Airy Ai Function with Reduced Cancellation","authors":"S. Chevillard, M. Mezzarobba","doi":"10.1109/ARITH.2013.33","DOIUrl":"https://doi.org/10.1109/ARITH.2013.33","url":null,"abstract":"The series expansion at the origin of the Airy function Ai(x) is alternating and hence problematic to evaluate for x > 0 due to cancellation. Based on a method recently proposed by Gawronski, Müller, and Rein hard, we exhibit two functions F and G, both with nonnegative Taylor expansions at the origin, such that Ai(x) = G(x)/F(x). The sums are now well-conditioned, but the Taylor coefficients of G turn out to obey an ill-conditioned three-term recurrence. We use the classical Miller algorithm to overcome this issue. We bound all errors and our implementation allows an arbitrary and certified accuracy, that can be used, e.g., for providing correct rounding in arbitrary precision.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123618775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}