Pub Date : 2022-09-01DOI: 10.1109/ARITH54963.2022.00017
Y. Durand, E. Guthmuller, C. F. Tortolero, Jérôme Fereyre, Andrea Bocco, Riccardo Alidori
Linear algebra kernels such as linear solvers, eigen-solvers are the actual working engine underneath many scientific applications. The growing scale of these applications has led researchers to rely on high-precision computing for improving their efficiency and their stability. In this work, we investigate the impact of arbitrary extended precision on multiple variants of the Conjugate Gradient method (CG). We show how our VRP processor improves the convergence and the efficiency of these kernels. We also illustrate how our set of tools (library, software environment) enables to migrate legacy applications in a fast and intuitive way while preserving high-performance. We observe up to an 8X improvements on kernel iteration count, and up to a 40 % improvement on latency. Nevertheless, the main benefit is the stability gained with the precision. It makes it possible to resolve larger and ill-conditioned systems without costly compensating techniques.
{"title":"Accelerating Variants of the Conjugate Gradient with the Variable Precision Processor","authors":"Y. Durand, E. Guthmuller, C. F. Tortolero, Jérôme Fereyre, Andrea Bocco, Riccardo Alidori","doi":"10.1109/ARITH54963.2022.00017","DOIUrl":"https://doi.org/10.1109/ARITH54963.2022.00017","url":null,"abstract":"Linear algebra kernels such as linear solvers, eigen-solvers are the actual working engine underneath many scientific applications. The growing scale of these applications has led researchers to rely on high-precision computing for improving their efficiency and their stability. In this work, we investigate the impact of arbitrary extended precision on multiple variants of the Conjugate Gradient method (CG). We show how our VRP processor improves the convergence and the efficiency of these kernels. We also illustrate how our set of tools (library, software environment) enables to migrate legacy applications in a fast and intuitive way while preserving high-performance. We observe up to an 8X improvements on kernel iteration count, and up to a 40 % improvement on latency. Nevertheless, the main benefit is the stability gained with the precision. It makes it possible to resolve larger and ill-conditioned systems without costly compensating techniques.","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131969827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/ARITH54963.2022.00030
David M. Russinoff, J. Bruguera, C. Chau, M. Manjrekar, Nicholas Pfister, Harsha Valsaraju
We present a hybrid methodology for the formal verification of arithmetic RTL designs that combines sequential logic equivalence checking with interactive theorem proving in a two-step process. First, an intermediate model of the design is extracted by hand and coded in Restricted Algorithmic C, a simple C subset augmented by the C++ register class templates of Algorithmic C, which provide the bit manipulation features of Verilog. The model is designed to mirror the RTL microarchitecture closely enough to allow efficient equivalence checking, but sufficiently abstract to be amenable to formal analysis. The model is then automatically translated to the logic of the ACL2 theorem prover, which is used to establish correctness with respect to an architectural specification. As an illustration, we describe the modeling and proof of correctness of a chained multiply-add module, designed to test techniques for area and power reduction and intended for implementation in future Arm graphics nrocessors.
{"title":"Formal Verification of a Chained Multiply-Add Design: Combining Theorem Proving and Equivalence Checking","authors":"David M. Russinoff, J. Bruguera, C. Chau, M. Manjrekar, Nicholas Pfister, Harsha Valsaraju","doi":"10.1109/ARITH54963.2022.00030","DOIUrl":"https://doi.org/10.1109/ARITH54963.2022.00030","url":null,"abstract":"We present a hybrid methodology for the formal verification of arithmetic RTL designs that combines sequential logic equivalence checking with interactive theorem proving in a two-step process. First, an intermediate model of the design is extracted by hand and coded in Restricted Algorithmic C, a simple C subset augmented by the C++ register class templates of Algorithmic C, which provide the bit manipulation features of Verilog. The model is designed to mirror the RTL microarchitecture closely enough to allow efficient equivalence checking, but sufficiently abstract to be amenable to formal analysis. The model is then automatically translated to the logic of the ACL2 theorem prover, which is used to establish correctness with respect to an architectural specification. As an illustration, we describe the modeling and proof of correctness of a chained multiply-add module, designed to test techniques for area and power reduction and intended for implementation in future Arm graphics nrocessors.","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131129846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/ARITH54963.2022.00012
J. Bruguera
Digit-recurrence algorithms are widely used in actual microprocessors to compute floating-point division and square root. These iterative algorithms present a good trade-off in terms of performance, area and power. Commercial processors have non-pipelined division and square root units where part of the logic is used over several cycles. The main drawbacks of these non-pipelined units are the long latency of the traditional division and square root implementations, the low bandwidth (or throughput) due to the reuse of part of the logic over several cycles, and its hardware complexity with separated logic for division and square root. We present a radix-64 floating-point division and square root algorithm with a common iteration for division and square root and where each radix-64 iteration is made of two simpler radix-8 iterations. The radix-64 algorithm allows to get low-latency operations, and the common division and square root radix-64 iteration results in some area reduction. The algorithm is mapped into a low-latency and high-bandwidth pipelined unit.
{"title":"Low-Latency and High-Bandwidth Pipelined Radix-64 Division and Square Root Unit","authors":"J. Bruguera","doi":"10.1109/ARITH54963.2022.00012","DOIUrl":"https://doi.org/10.1109/ARITH54963.2022.00012","url":null,"abstract":"Digit-recurrence algorithms are widely used in actual microprocessors to compute floating-point division and square root. These iterative algorithms present a good trade-off in terms of performance, area and power. Commercial processors have non-pipelined division and square root units where part of the logic is used over several cycles. The main drawbacks of these non-pipelined units are the long latency of the traditional division and square root implementations, the low bandwidth (or throughput) due to the reuse of part of the logic over several cycles, and its hardware complexity with separated logic for division and square root. We present a radix-64 floating-point division and square root algorithm with a common iteration for division and square root and where each radix-64 iteration is made of two simpler radix-8 iterations. The radix-64 algorithm allows to get low-latency operations, and the common division and square root radix-64 iteration results in some area reduction. The algorithm is mapped into a low-latency and high-bandwidth pipelined unit.","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130393698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/arith54963.2022.00011
John Osorio Ríos, Adrià Armejach, E. Petit, G. Henry, Marc Casas
{"title":"A BF16 FMA is All You Need for DNN Training","authors":"John Osorio Ríos, Adrià Armejach, E. Petit, G. Henry, Marc Casas","doi":"10.1109/arith54963.2022.00011","DOIUrl":"https://doi.org/10.1109/arith54963.2022.00011","url":null,"abstract":"","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115812262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/ARITH54963.2022.00015
Jongwook Sohn, David K. Dean, Eric E. Quintana, Wing Shek Wong
This paper presents an enhanced floating-point adder (FADD) design for the Intel E-Core processor. Floating-point addition and subtraction are two of the most widely used operations in many applications. The proposed FADD is executed in 2 cycles, fully pipelined, handles SSE/AVX operations for scalar/packed IEEE single and double precision, and supports all four rounding modes. Also, the proposed FADD fully supports both denormal inputs and underflow outputs without microcode assistance. To achieve the 2-cycle FADD with full denormal support, several optimization techniques are applied: split path algorithm, early alignment and sticky logic, parallel addition, rounding and all-ones detection, and modified leading zero anticipation (LZA) for masking the underflow. As a result, the proposed FADD achieved not only full denormal support but also about 12.5% reduced latency compared to the traditional FADD designs.
{"title":"Enhanced Floating-Point Adder with Full Denormal Support","authors":"Jongwook Sohn, David K. Dean, Eric E. Quintana, Wing Shek Wong","doi":"10.1109/ARITH54963.2022.00015","DOIUrl":"https://doi.org/10.1109/ARITH54963.2022.00015","url":null,"abstract":"This paper presents an enhanced floating-point adder (FADD) design for the Intel E-Core processor. Floating-point addition and subtraction are two of the most widely used operations in many applications. The proposed FADD is executed in 2 cycles, fully pipelined, handles SSE/AVX operations for scalar/packed IEEE single and double precision, and supports all four rounding modes. Also, the proposed FADD fully supports both denormal inputs and underflow outputs without microcode assistance. To achieve the 2-cycle FADD with full denormal support, several optimization techniques are applied: split path algorithm, early alignment and sticky logic, parallel addition, rounding and all-ones detection, and modified leading zero anticipation (LZA) for masking the underflow. As a result, the proposed FADD achieved not only full denormal support but also about 12.5% reduced latency compared to the traditional FADD designs.","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114931185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/ARITH54963.2022.00023
Fangan-Yssouf Dosso, J. Robert, P. Véron
{"title":"PMNS for efficient arithmetic and small memory cost","authors":"Fangan-Yssouf Dosso, J. Robert, P. Véron","doi":"10.1109/ARITH54963.2022.00023","DOIUrl":"https://doi.org/10.1109/ARITH54963.2022.00023","url":null,"abstract":"","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125883851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/ARITH54963.2022.00021
Teodor-Dumitru Ene, J. Stine
Rephrasing binary addition as a parallel prefix tree problem allows for the generation of high-performance architectures with logarithmic delay. Modern literature and implementation seeks to explore this prefix tree design space in order to identify optimal circuits for each target application. This paper broadens the scope of the design space by treating both preprocessing and post-processing nodes as malleable parts of the tree structure. Structures obtained through this novel approach are shown to have superior performance. Implementation results are presented using the SkyWater Open Source 130nm PDK and the open-source tools developed by this paper are made available.
{"title":"Point-Targeted Sparseness and Ling Transforms on Parallel Prefix Adder Trees","authors":"Teodor-Dumitru Ene, J. Stine","doi":"10.1109/ARITH54963.2022.00021","DOIUrl":"https://doi.org/10.1109/ARITH54963.2022.00021","url":null,"abstract":"Rephrasing binary addition as a parallel prefix tree problem allows for the generation of high-performance architectures with logarithmic delay. Modern literature and implementation seeks to explore this prefix tree design space in order to identify optimal circuits for each target application. This paper broadens the scope of the design space by treating both preprocessing and post-processing nodes as malleable parts of the tree structure. Structures obtained through this novel approach are shown to have superior performance. Implementation results are presented using the SkyWater Open Source 130nm PDK and the open-source tools developed by this paper are made available.","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132866332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/arith54963.2022.00031
Louise Ben Salem-Knapp, S. Boldo, William Weens
{"title":"Bounding the Round-Off Error of the Upwind Scheme for Advection","authors":"Louise Ben Salem-Knapp, S. Boldo, William Weens","doi":"10.1109/arith54963.2022.00031","DOIUrl":"https://doi.org/10.1109/arith54963.2022.00031","url":null,"abstract":"","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129371571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/ARITH54963.2022.00029
Malek Safieh, F. D. Santis
Gaussian integers are a subset of the complex numbers with integers as real and imaginary parts. When Gaussian integers are equipped with modulo operations, they form Gaussian integer rings or fields, depending on the specific choice of the modulus. Arithmetic on Gaussian integers can offer advantages in terms of operand size and improved parallelism, due to independent calculation of the real and imaginary parts. However, although Gaussian integer modulo reduction is the fundamental operation to enable computations in finite Gaussian integer rings and fields, efficient algorithms for Gaussian integer modulo reduction have not been widely investigated so far. In this work, we fill this gap and present efficient reduction algorithms for Gaussian integer moduli of special forms. Indeed, we demonstrate that there exist different classes of Gaussian integer moduli allowing for very fast reductions. Finally, we show that the computational complexity of the proposed algorithm is significantly reduced compared with generic Gaussian integer reduction methods known to date, e.g., Montgomery-based reduction for Gaussian integers.
{"title":"Efficient Reduction Algorithms for Special Gaussian Integer Moduli","authors":"Malek Safieh, F. D. Santis","doi":"10.1109/ARITH54963.2022.00029","DOIUrl":"https://doi.org/10.1109/ARITH54963.2022.00029","url":null,"abstract":"Gaussian integers are a subset of the complex numbers with integers as real and imaginary parts. When Gaussian integers are equipped with modulo operations, they form Gaussian integer rings or fields, depending on the specific choice of the modulus. Arithmetic on Gaussian integers can offer advantages in terms of operand size and improved parallelism, due to independent calculation of the real and imaginary parts. However, although Gaussian integer modulo reduction is the fundamental operation to enable computations in finite Gaussian integer rings and fields, efficient algorithms for Gaussian integer modulo reduction have not been widely investigated so far. In this work, we fill this gap and present efficient reduction algorithms for Gaussian integer moduli of special forms. Indeed, we demonstrate that there exist different classes of Gaussian integer moduli allowing for very fast reductions. Finally, we show that the computational complexity of the proposed algorithm is significantly reduced compared with generic Gaussian integer reduction methods known to date, e.g., Montgomery-based reduction for Gaussian integers.","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128809059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}