首页 > 最新文献

2013 IEEE 21st Symposium on Computer Arithmetic最新文献

英文 中文
Parallel Modular Multiplication on Multi-core Processors 多核处理器上的并行模块化乘法
Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.20
Pascal Giorgi, L. Imbert, T. Izard
Current processors typically embed many cores running at high speed. The main goal of this paper is to assess the efficiency of software parallelism for low level arithmetic operations by providing a thorough comparison of several parallel modular multiplications. Famous methods such as Barrett, Montgomery as well as more recent algorithms are compared together with a novel k-ary multipartite multiplication which allows to split the computations into independent processes. Our experiments show that this new algorithm is well suited to software parallelism.
目前的处理器通常嵌入许多高速运行的核心。本文的主要目标是通过提供几个并行模块化乘法的全面比较来评估软件并行性对低级算术运算的效率。著名的方法如巴雷特,蒙哥马利以及最近的算法与一种新颖的k-ary多部乘法相比较,该乘法允许将计算拆分为独立的过程。实验表明,该算法非常适合软件并行化。
{"title":"Parallel Modular Multiplication on Multi-core Processors","authors":"Pascal Giorgi, L. Imbert, T. Izard","doi":"10.1109/ARITH.2013.20","DOIUrl":"https://doi.org/10.1109/ARITH.2013.20","url":null,"abstract":"Current processors typically embed many cores running at high speed. The main goal of this paper is to assess the efficiency of software parallelism for low level arithmetic operations by providing a thorough comparison of several parallel modular multiplications. Famous methods such as Barrett, Montgomery as well as more recent algorithms are compared together with a novel k-ary multipartite multiplication which allows to split the computations into independent processes. Our experiments show that this new algorithm is well suited to software parallelism.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128067369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Floating Point Architecture Extensions for Optimized Matrix Factorization 优化矩阵分解的浮点架构扩展
Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.21
A. Pedram, A. Gerstlauer, R. V. D. Geijn
This paper examines the mapping of algorithms encountered when solving dense linear systems and linear least-squares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations. As part of the study, we expose the benefits of redesigning floating point units and their surrounding data-paths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design trade-offs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs. A feasibility study shows that our extensions to the MAC units can double the speed of required vector-norm operations while reducing energy by 60%. Similarly, up to 20% speedup with 15% savings in energy can be achieved for LU factorization. We show how such efficiency is maintained even in the complex inner kernels of these operations.
本文研究了在求解密集线性系统和线性最小二乘问题时遇到的算法映射到自定义线性代数处理器。具体来说,重点是Cholesky, LU(部分枢轴)和QR分解。作为研究的一部分,我们揭示了重新设计浮点单元及其周围数据路径以支持这些复杂操作的好处。我们将展示如何在体系结构中添加适度的复杂性,从而极大地减轻算法的复杂性。我们研究了设计权衡和架构修改的有效性,以证明我们可以将功率和性能效率提高到只有全定制ASIC设计才能达到的水平。可行性研究表明,我们对MAC单元的扩展可以将所需向量范数运算的速度提高一倍,同时减少60%的能量。类似地,对于LU分解,可以实现高达20%的加速和15%的能源节省。我们展示了如何在这些操作的复杂内核中保持这种效率。
{"title":"Floating Point Architecture Extensions for Optimized Matrix Factorization","authors":"A. Pedram, A. Gerstlauer, R. V. D. Geijn","doi":"10.1109/ARITH.2013.21","DOIUrl":"https://doi.org/10.1109/ARITH.2013.21","url":null,"abstract":"This paper examines the mapping of algorithms encountered when solving dense linear systems and linear least-squares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations. As part of the study, we expose the benefits of redesigning floating point units and their surrounding data-paths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design trade-offs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs. A feasibility study shows that our extensions to the MAC units can double the speed of required vector-norm operations while reducing energy by 60%. Similarly, up to 20% speedup with 15% savings in energy can be achieved for LU factorization. We show how such efficiency is maintained even in the complex inner kernels of these operations.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124436405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Split-Path Fused Floating Point Multiply Accumulate (FPMAC) 分路融合浮点乘法累加(FPMAC)
Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.32
S. Srinivasan, Ketan Bhudiya, R. Ramanarayanan, P. Babu, Tiju Jacob, S. Mathew, R. Krishnamurthy, V. Erraguntla
Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it.
浮点乘积单元是现代处理器的骨干,是决定微处理器频率、功耗和面积的关键电路。FPMAC单元在当代客户端微处理器中广泛使用,随着ISA对AVX和SSE等指令的支持,FPMAC单元进一步扩散,并广泛用于工程和科学应用的服务器处理器。因此,FPMAC的设计是至关重要的考虑因素,因为它在此类系统中主导着功率和性能权衡决策。在这项工作中,我们展示了一种新颖的FPMAC设计,它专注于关键路径上的最佳计算,因此使其成为目前文献中最快的FPMAC设计。该设计以隔离和优化FPMAC运行中的关键路径计算为前提。在这项工作中,我们有三个关键创新来创建一种新的双精度FPMAC,在时序关键路径上具有最少的门级:a)根据指数差拆分远近路径(d=Exy-Ez ={-2, -1, 0,1}为近路径,其余为远路径),b)将近路径的累加提前注入到Wallace树中,以消除近路径关键逻辑中的3:2压缩器,利用近路径和稀疏Wallace树中的小对齐偏移进行53位尾数乘法。c)结合轮加和累加,以消除乘法器中的完成加法器,同时提供时间和功率优势。我们基于分割的设计为每个操作消耗更少的功率,其中每个操作只需要切换所需的逻辑。分割路径也为纯基于指数差分信号的逻辑门的未使用部分(近15-20%)的时钟或功率门提供了巨大的机会。我们还演示了对所有舍入模式的支持,以遵守双精度FPMAC的IEEE标准,这对于在当代处理器系列中使用该设计至关重要。演示的设计在时间上比最著名的IBM Power6[6]的硅实现高出14%,同时具有相似的面积,并且由于拆分处理而具有额外的功耗优势。该设计还与Lang等人[5]最著名的定时设计进行了比较,性能优于后者7%,而面积比后者小30%。
{"title":"Split-Path Fused Floating Point Multiply Accumulate (FPMAC)","authors":"S. Srinivasan, Ketan Bhudiya, R. Ramanarayanan, P. Babu, Tiju Jacob, S. Mathew, R. Krishnamurthy, V. Erraguntla","doi":"10.1109/ARITH.2013.32","DOIUrl":"https://doi.org/10.1109/ARITH.2013.32","url":null,"abstract":"Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121589984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Multiple-Precision Evaluation of the Airy Ai Function with Reduced Cancellation 减少消去的Airy函数的多精度评价
Pub Date : 2012-12-19 DOI: 10.1109/ARITH.2013.33
S. Chevillard, M. Mezzarobba
The series expansion at the origin of the Airy function Ai(x) is alternating and hence problematic to evaluate for x > 0 due to cancellation. Based on a method recently proposed by Gawronski, Müller, and Rein hard, we exhibit two functions F and G, both with nonnegative Taylor expansions at the origin, such that Ai(x) = G(x)/F(x). The sums are now well-conditioned, but the Taylor coefficients of G turn out to obey an ill-conditioned three-term recurrence. We use the classical Miller algorithm to overcome this issue. We bound all errors and our implementation allows an arbitrary and certified accuracy, that can be used, e.g., for providing correct rounding in arbitrary precision.
Airy函数Ai(x)原点处的级数展开是交替的,因此由于消去,当x > 0时很难求值。基于Gawronski, m ller和Rein hard最近提出的一种方法,我们展示了两个函数F和G,它们在原点处都具有非负泰勒展开,使得Ai(x) = G(x)/F(x)。这些和现在是条件良好的,但是G的泰勒系数服从一个条件不良的三项递归式。我们使用经典的米勒算法来克服这个问题。我们绑定了所有的错误,并且我们的实现允许任意和认证的精度,例如,在任意精度下提供正确的舍入。
{"title":"Multiple-Precision Evaluation of the Airy Ai Function with Reduced Cancellation","authors":"S. Chevillard, M. Mezzarobba","doi":"10.1109/ARITH.2013.33","DOIUrl":"https://doi.org/10.1109/ARITH.2013.33","url":null,"abstract":"The series expansion at the origin of the Airy function Ai(x) is alternating and hence problematic to evaluate for x > 0 due to cancellation. Based on a method recently proposed by Gawronski, Müller, and Rein hard, we exhibit two functions F and G, both with nonnegative Taylor expansions at the origin, such that Ai(x) = G(x)/F(x). The sums are now well-conditioned, but the Taylor coefficients of G turn out to obey an ill-conditioned three-term recurrence. We use the classical Miller algorithm to overcome this issue. We bound all errors and our implementation allows an arbitrary and certified accuracy, that can be used, e.g., for providing correct rounding in arbitrary precision.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123618775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2013 IEEE 21st Symposium on Computer Arithmetic
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1