{"title":"Low-Latency and High-Bandwidth Pipelined Radix-64 Division and Square Root Unit","authors":"J. Bruguera","doi":"10.1109/ARITH54963.2022.00012","DOIUrl":null,"url":null,"abstract":"Digit-recurrence algorithms are widely used in actual microprocessors to compute floating-point division and square root. These iterative algorithms present a good trade-off in terms of performance, area and power. Commercial processors have non-pipelined division and square root units where part of the logic is used over several cycles. The main drawbacks of these non-pipelined units are the long latency of the traditional division and square root implementations, the low bandwidth (or throughput) due to the reuse of part of the logic over several cycles, and its hardware complexity with separated logic for division and square root. We present a radix-64 floating-point division and square root algorithm with a common iteration for division and square root and where each radix-64 iteration is made of two simpler radix-8 iterations. The radix-64 algorithm allows to get low-latency operations, and the common division and square root radix-64 iteration results in some area reduction. The algorithm is mapped into a low-latency and high-bandwidth pipelined unit.","PeriodicalId":268661,"journal":{"name":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ARITH54963.2022.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Digit-recurrence algorithms are widely used in actual microprocessors to compute floating-point division and square root. These iterative algorithms present a good trade-off in terms of performance, area and power. Commercial processors have non-pipelined division and square root units where part of the logic is used over several cycles. The main drawbacks of these non-pipelined units are the long latency of the traditional division and square root implementations, the low bandwidth (or throughput) due to the reuse of part of the logic over several cycles, and its hardware complexity with separated logic for division and square root. We present a radix-64 floating-point division and square root algorithm with a common iteration for division and square root and where each radix-64 iteration is made of two simpler radix-8 iterations. The radix-64 algorithm allows to get low-latency operations, and the common division and square root radix-64 iteration results in some area reduction. The algorithm is mapped into a low-latency and high-bandwidth pipelined unit.